Skip to content

Connection Error Mid Training #599

@Jgmedina95

Description

@Jgmedina95

Everything was doing great on training, for at least 90 steps, and then this error appeared, any idea on how I can prevent complete failure when theres connection issues? This training was on an 8xH100 instance from PrimeIntellect.

Generating rollouts (train): 75%|███████▌ | 96/128 [01:31<00:12, 2.63it/s]2025-12-02 03:58:17 - verifiers.envs.CrystalRelaxationMultiTurnEnv - ERROR - Error getting model response: Connection error.
2025-12-01 22:58:17

2025-12-01 22:58:17
Exiting...
2025-12-01 22:58:18
2025-12-02 03:58:17.856 | ERROR | asyncio.events:_run:88 - An error has been caught in function '_run', process 'MainProcess' (11969), thread 'MainThread' (139832729312128):
2025-12-01 22:58:18
Traceback (most recent call last):
2025-12-01 22:58:18

2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 101, in map_httpcore_exceptions
2025-12-01 22:58:18
yield
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 394, in handle_async_request
2025-12-01 22:58:18
resp = await self._pool.handle_async_request(req)
2025-12-01 22:58:18
│ │ │ └ <Request [b'POST']>
2025-12-01 22:58:18
│ │ └ <function AsyncConnectionPool.handle_async_request at 0x7f2b63f58540>
2025-12-01 22:58:18
│ └ <AsyncConnectionPool [Requests: 0 active, 0 queued | Connections: 0 active, 0 idle]>
2025-12-01 22:58:18
└ <httpx.AsyncHTTPTransport object at 0x7f2b5685e3f0>
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpcore/_async/connection_pool.py", line 256, in handle_async_request
2025-12-01 22:58:18
raise exc from None
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpcore/_async/connection_pool.py", line 236, in handle_async_request
2025-12-01 22:58:18
response = await connection.handle_async_request(
2025-12-01 22:58:18
│ └ <function AsyncHTTPConnection.handle_async_request at 0x7f2b63f4b740>
2025-12-01 22:58:18
└ <AsyncHTTPConnection [CONNECTION FAILED]>
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpcore/_async/connection.py", line 101, in handle_async_request
2025-12-01 22:58:18
raise exc
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpcore/_async/connection.py", line 78, in handle_async_request
2025-12-01 22:58:18
stream = await self._connect(request)
2025-12-01 22:58:18
│ │ └ <Request [b'POST']>
2025-12-01 22:58:18
│ └ <function AsyncHTTPConnection._connect at 0x7f2b63f4b7e0>
2025-12-01 22:58:18
└ <AsyncHTTPConnection [CONNECTION FAILED]>
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpcore/_async/connection.py", line 124, in _connect
2025-12-01 22:58:18
stream = await self._network_backend.connect_tcp(**kwargs)
2025-12-01 22:58:18
│ │ │ └ {'host': 'localhost', 'port': 8000, 'local_address': None, 'timeout': 1200, 'socket_options': None}
2025-12-01 22:58:18
│ │ └ <function AutoBackend.connect_tcp at 0x7f2b63f498a0>
2025-12-01 22:58:18
│ └ <httpcore._backends.auto.AutoBackend object at 0x7f2b5685e2d0>
2025-12-01 22:58:18
└ <AsyncHTTPConnection [CONNECTION FAILED]>
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpcore/_backends/auto.py", line 31, in connect_tcp
2025-12-01 22:58:18
return await self._backend.connect_tcp(
2025-12-01 22:58:18
│ │ └ <function AnyIOBackend.connect_tcp at 0x7f2b63f5b2e0>
2025-12-01 22:58:18
│ └ <httpcore.AnyIOBackend object at 0x7f2a87ad3110>
2025-12-01 22:58:18
└ <httpcore._backends.auto.AutoBackend object at 0x7f2b5685e2d0>
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpcore/_backends/anyio.py", line 113, in connect_tcp
2025-12-01 22:58:18
with map_exceptions(exc_map):
2025-12-01 22:58:18
│ └ {<class 'TimeoutError'>: <class 'httpcore.ConnectTimeout'>, <class 'OSError'>: <class 'httpcore.ConnectError'>, <class 'anyio...
2025-12-01 22:58:18
└ <function map_exceptions at 0x7f2b640c6b60>
2025-12-01 22:58:18
File "/usr/local/lib/python3.12/contextlib.py", line 158, in exit
2025-12-01 22:58:18
self.gen.throw(value)
2025-12-01 22:58:18
│ │ │ └ OSError('All connection attempts failed')
2025-12-01 22:58:18
│ │ └ <method 'throw' of 'generator' objects>
2025-12-01 22:58:18
│ └ <generator object map_exceptions at 0x7f26c33f7c40>
2025-12-01 22:58:18
└ <contextlib._GeneratorContextManager object at 0x7f26e0f529c0>
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
2025-12-01 22:58:18
raise to_exc(exc) from exc
2025-12-01 22:58:18
└ <class 'httpcore.ConnectError'>
2025-12-01 22:58:18

2025-12-01 22:58:18
httpcore.ConnectError: All connection attempts failed
2025-12-01 22:58:18

2025-12-01 22:58:18

2025-12-01 22:58:18
The above exception was the direct cause of the following exception:
2025-12-01 22:58:18

2025-12-01 22:58:18

2025-12-01 22:58:18
Traceback (most recent call last):
2025-12-01 22:58:18

2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/openai/_base_client.py", line 1529, in request
2025-12-01 22:58:18
response = await self._client.send(
2025-12-01 22:58:18
│ │ └ <function AsyncClient.send at 0x7f2ba5d2f060>
2025-12-01 22:58:18
│ └ <httpx.AsyncClient object at 0x7f2b56a18440>
2025-12-01 22:58:18
└ <openai.AsyncOpenAI object at 0x7f2b56a183b0>
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpx/_client.py", line 1629, in send
2025-12-01 22:58:18
response = await self._send_handling_auth(
2025-12-01 22:58:18
│ └ <function AsyncClient._send_handling_auth at 0x7f2ba5d2f100>
2025-12-01 22:58:18
└ <httpx.AsyncClient object at 0x7f2b56a18440>
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpx/_client.py", line 1657, in _send_handling_auth
2025-12-01 22:58:18
response = await self._send_handling_redirects(
2025-12-01 22:58:18
│ └ <function AsyncClient._send_handling_redirects at 0x7f2ba5d2f1a0>
2025-12-01 22:58:18
└ <httpx.AsyncClient object at 0x7f2b56a18440>
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpx/_client.py", line 1694, in _send_handling_redirects
2025-12-01 22:58:18
response = await self._send_single_request(request)
2025-12-01 22:58:18
│ │ └ <Request('POST', 'http://localhost:8000/v1/chat/completions')>
2025-12-01 22:58:18
│ └ <function AsyncClient._send_single_request at 0x7f2ba5d2f240>
2025-12-01 22:58:18
└ <httpx.AsyncClient object at 0x7f2b56a18440>
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpx/_client.py", line 1730, in _send_single_request
2025-12-01 22:58:18
response = await transport.handle_async_request(request)
2025-12-01 22:58:18
│ │ └ <Request('POST', 'http://localhost:8000/v1/chat/completions')>
2025-12-01 22:58:18
│ └ <function AsyncHTTPTransport.handle_async_request at 0x7f2ba5d23ec0>
2025-12-01 22:58:18
└ <httpx.AsyncHTTPTransport object at 0x7f2b5685e3f0>
2025-12-01 22:58:18
File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 393, in handle_async_request
2025-12-01 22:58:18
with map_httpcore_exceptions():

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions