-
Notifications
You must be signed in to change notification settings - Fork 469
Description
Everything was doing great on training, for at least 90 steps, and then this error appeared, any idea on how I can prevent complete failure when theres connection issues? This training was on an 8xH100 instance from PrimeIntellect.
Generating rollouts (train): 75%|███████▌ | 96/128 [01:31<00:12, 2.63it/s]2025-12-02 03:58:17 - verifiers.envs.CrystalRelaxationMultiTurnEnv - ERROR - Error getting model response: Connection error.
2025-12-01 22:58:17
2025-12-01 22:58:17
Exiting...
2025-12-01 22:58:18
2025-12-02 03:58:17.856 | ERROR | asyncio.events:_run:88 - An error has been caught in function '_run', process 'MainProcess' (11969), thread 'MainThread' (139832729312128):
2025-12-01 22:58:18
Traceback (most recent call last):
2025-12-01 22:58:18
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 101, in map_httpcore_exceptions
2025-12-01 22:58:18
yield
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 394, in handle_async_request
2025-12-01 22:58:18
resp = await self._pool.handle_async_request(req)
2025-12-01 22:58:18
│ │ │ └ <Request [b'POST']>
2025-12-01 22:58:18
│ │ └ <function AsyncConnectionPool.handle_async_request at 0x7f2b63f58540>
2025-12-01 22:58:18
│ └ <AsyncConnectionPool [Requests: 0 active, 0 queued | Connections: 0 active, 0 idle]>
2025-12-01 22:58:18
└ <httpx.AsyncHTTPTransport object at 0x7f2b5685e3f0>
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpcore/_async/connection_pool.py", line 256, in handle_async_request
2025-12-01 22:58:18
raise exc from None
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpcore/_async/connection_pool.py", line 236, in handle_async_request
2025-12-01 22:58:18
response = await connection.handle_async_request(
2025-12-01 22:58:18
│ └ <function AsyncHTTPConnection.handle_async_request at 0x7f2b63f4b740>
2025-12-01 22:58:18
└ <AsyncHTTPConnection [CONNECTION FAILED]>
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpcore/_async/connection.py", line 101, in handle_async_request
2025-12-01 22:58:18
raise exc
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpcore/_async/connection.py", line 78, in handle_async_request
2025-12-01 22:58:18
stream = await self._connect(request)
2025-12-01 22:58:18
│ │ └ <Request [b'POST']>
2025-12-01 22:58:18
│ └ <function AsyncHTTPConnection._connect at 0x7f2b63f4b7e0>
2025-12-01 22:58:18
└ <AsyncHTTPConnection [CONNECTION FAILED]>
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpcore/_async/connection.py", line 124, in _connect
2025-12-01 22:58:18
stream = await self._network_backend.connect_tcp(**kwargs)
2025-12-01 22:58:18
│ │ │ └ {'host': 'localhost', 'port': 8000, 'local_address': None, 'timeout': 1200, 'socket_options': None}
2025-12-01 22:58:18
│ │ └ <function AutoBackend.connect_tcp at 0x7f2b63f498a0>
2025-12-01 22:58:18
│ └ <httpcore._backends.auto.AutoBackend object at 0x7f2b5685e2d0>
2025-12-01 22:58:18
└ <AsyncHTTPConnection [CONNECTION FAILED]>
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpcore/_backends/auto.py", line 31, in connect_tcp
2025-12-01 22:58:18
return await self._backend.connect_tcp(
2025-12-01 22:58:18
│ │ └ <function AnyIOBackend.connect_tcp at 0x7f2b63f5b2e0>
2025-12-01 22:58:18
│ └ <httpcore.AnyIOBackend object at 0x7f2a87ad3110>
2025-12-01 22:58:18
└ <httpcore._backends.auto.AutoBackend object at 0x7f2b5685e2d0>
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpcore/_backends/anyio.py", line 113, in connect_tcp
2025-12-01 22:58:18
with map_exceptions(exc_map):
2025-12-01 22:58:18
│ └ {<class 'TimeoutError'>: <class 'httpcore.ConnectTimeout'>, <class 'OSError'>: <class 'httpcore.ConnectError'>, <class 'anyio...
2025-12-01 22:58:18
└ <function map_exceptions at 0x7f2b640c6b60>
2025-12-01 22:58:18
File "/usr/local/lib/python3.12/contextlib.py", line 158, in exit
2025-12-01 22:58:18
self.gen.throw(value)
2025-12-01 22:58:18
│ │ │ └ OSError('All connection attempts failed')
2025-12-01 22:58:18
│ │ └ <method 'throw' of 'generator' objects>
2025-12-01 22:58:18
│ └ <generator object map_exceptions at 0x7f26c33f7c40>
2025-12-01 22:58:18
└ <contextlib._GeneratorContextManager object at 0x7f26e0f529c0>
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
2025-12-01 22:58:18
raise to_exc(exc) from exc
2025-12-01 22:58:18
└ <class 'httpcore.ConnectError'>
2025-12-01 22:58:18
2025-12-01 22:58:18
httpcore.ConnectError: All connection attempts failed
2025-12-01 22:58:18
2025-12-01 22:58:18
2025-12-01 22:58:18
The above exception was the direct cause of the following exception:
2025-12-01 22:58:18
2025-12-01 22:58:18
2025-12-01 22:58:18
Traceback (most recent call last):
2025-12-01 22:58:18
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/openai/_base_client.py", line 1529, in request
2025-12-01 22:58:18
response = await self._client.send(
2025-12-01 22:58:18
│ │ └ <function AsyncClient.send at 0x7f2ba5d2f060>
2025-12-01 22:58:18
│ └ <httpx.AsyncClient object at 0x7f2b56a18440>
2025-12-01 22:58:18
└ <openai.AsyncOpenAI object at 0x7f2b56a183b0>
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpx/_client.py", line 1629, in send
2025-12-01 22:58:18
response = await self._send_handling_auth(
2025-12-01 22:58:18
│ └ <function AsyncClient._send_handling_auth at 0x7f2ba5d2f100>
2025-12-01 22:58:18
└ <httpx.AsyncClient object at 0x7f2b56a18440>
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpx/_client.py", line 1657, in _send_handling_auth
2025-12-01 22:58:18
response = await self._send_handling_redirects(
2025-12-01 22:58:18
│ └ <function AsyncClient._send_handling_redirects at 0x7f2ba5d2f1a0>
2025-12-01 22:58:18
└ <httpx.AsyncClient object at 0x7f2b56a18440>
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpx/_client.py", line 1694, in _send_handling_redirects
2025-12-01 22:58:18
response = await self._send_single_request(request)
2025-12-01 22:58:18
│ │ └ <Request('POST', 'http://localhost:8000/v1/chat/completions')>
2025-12-01 22:58:18
│ └ <function AsyncClient._send_single_request at 0x7f2ba5d2f240>
2025-12-01 22:58:18
└ <httpx.AsyncClient object at 0x7f2b56a18440>
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpx/_client.py", line 1730, in _send_single_request
2025-12-01 22:58:18
response = await transport.handle_async_request(request)
2025-12-01 22:58:18
│ │ └ <Request('POST', 'http://localhost:8000/v1/chat/completions')>
2025-12-01 22:58:18
│ └ <function AsyncHTTPTransport.handle_async_request at 0x7f2ba5d23ec0>
2025-12-01 22:58:18
└ <httpx.AsyncHTTPTransport object at 0x7f2b5685e3f0>
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 393, in handle_async_request
2025-12-01 22:58:18
with map_httpcore_exceptions():