This repository was archived by the owner on Jun 3, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 3
This repository was archived by the owner on Jun 3, 2024. It is now read-only.
RuntimeError: Timed out waiting for send operation to complete #4
Copy link
Copy link
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
I am running the simple command python -m openfed.tools.simulator --nproc 6 examples/run.py as given in the repository just to check if the code was running and I encountered the following error.
(openfed) ozaland@prec3660c:~/OpenFed$ python -m openfed.tools.simulator --nproc 6 examples/run.py
0%| | 0/10 [00:00<?, ?it/s]/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:430: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
"torch.distributed.distributed_c10d._get_global_rank is deprecated "
/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:430: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
"torch.distributed.distributed_c10d._get_global_rank is deprecated "
/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:430: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
"torch.distributed.distributed_c10d._get_global_rank is deprecated "
/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:430: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
"torch.distributed.distributed_c10d._get_global_rank is deprecated "
/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:430: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
"torch.distributed.distributed_c10d._get_global_rank is deprecated "
/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:430: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
"torch.distributed.distributed_c10d._get_global_rank is deprecated "
10%|████ | 1/10 [30:02<4:30:18, 1802.07s/it]
Traceback (most recent call last):
File "examples/run.py", line 99, in <module>
simulate()
File "examples/run.py", line 52, in simulate
api.run()
File "/home/ozaland/OpenFed/openfed/api.py", line 71, in run
maintainer.step()
File "/home/ozaland/OpenFed/openfed/core/maintainer.py", line 306, in step
return self._aggregator_step(*args, **kwargs)
File "/home/ozaland/OpenFed/openfed/core/maintainer.py", line 378, in _aggregator_step
flag = self.upload()
File "/home/ozaland/OpenFed/openfed/core/maintainer.py", line 298, in upload
self.transfer(to=True)
File "/home/ozaland/OpenFed/openfed/core/functional.py", line 33, in _fed_context
return safe_call(self, *args, **kwargs)
File "/home/ozaland/OpenFed/openfed/core/functional.py", line 24, in safe_call
return func(*args, **kwargs)
File "/home/ozaland/OpenFed/openfed/core/maintainer.py", line 253, in transfer
self.pipe.upload(self.packaged_data)
File "/home/ozaland/OpenFed/openfed/federated/pipe.py", line 164, in upload
self.transfer(True, data)
File "/home/ozaland/OpenFed/openfed/federated/pipe.py", line 233, in transfer
self.push(data)
File "/home/ozaland/OpenFed/openfed/federated/pipe.py", line 249, in push
distributed_c10d.gather_object(data, None, dst=rank, group=self.pg)
File "/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1981, in gather_object
all_gather(object_size_list, local_size, group=group)
File "/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2282, in all_gather
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:133] Timed out waiting 1800000ms for send operation to complete
/home/ozaland/OpenFed/openfed/core/functional.py:26: UserWarning: Failed to call <function Maintainer.transfer at 0x7fc6515eff80>
warnings.warn(f'Failed to call {func}')
/home/ozaland/OpenFed/openfed/core/functional.py:26: UserWarning: Failed to call <function Maintainer.transfer at 0x7f2c126eff80>
warnings.warn(f'Failed to call {func}')
/home/ozaland/OpenFed/openfed/core/functional.py:26: UserWarning: Failed to call <function Maintainer.transfer at 0x7f60a89eff80>
warnings.warn(f'Failed to call {func}')
/home/ozaland/OpenFed/openfed/core/functional.py:26: UserWarning: Failed to call <function Maintainer.transfer at 0x7ff26edaff80>
warnings.warn(f'Failed to call {func}')
/home/ozaland/OpenFed/openfed/core/functional.py:26: UserWarning: Failed to call <function Maintainer.transfer at 0x7f290cfaff80>
warnings.warn(f'Failed to call {func}')
Killing subprocess 1740079
Killing subprocess 1740080
Killing subprocess 1740081
Killing subprocess 1740082
Killing subprocess 1740083
Killing subprocess 1740084
Traceback (most recent call last):
File "/home/ozaland/anaconda3/envs/openfed/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/ozaland/anaconda3/envs/openfed/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ozaland/OpenFed/openfed/tools/simulator.py", line 221, in <module>
main()
File "/home/ozaland/OpenFed/openfed/tools/simulator.py", line 204, in main
sigkill_handler(signal.SIGTERM, None)
File "/home/ozaland/OpenFed/openfed/tools/simulator.py", line 138, in sigkill_handler
returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ozaland/anaconda3/envs/openfed/bin/python', '-u', 'examples/run.py', '--props=/tmp/collaborator-5.json']' returned non-zero exit status 1.
Environment (please complete the following information):
- OS Platform and Distribution (e.g., Linux Ubuntu 22.04):
- Python package versions: Pytorch v1.13.1, openfed v0.0.0, torchvision v0.14.1
- Python version: 3.7
- CUDA/cuDNN version: 11.7
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working