Skip to content
This repository was archived by the owner on Jun 3, 2024. It is now read-only.
This repository was archived by the owner on Jun 3, 2024. It is now read-only.

RuntimeError: Timed out waiting for send operation to complete #4

@obaidullahzaland

Description

@obaidullahzaland

Describe the bug
I am running the simple command python -m openfed.tools.simulator --nproc 6 examples/run.py as given in the repository just to check if the code was running and I encountered the following error.

(openfed) ozaland@prec3660c:~/OpenFed$ python -m openfed.tools.simulator --nproc 6 examples/run.py
  0%|                                                    | 0/10 [00:00<?, ?it/s]/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:430: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
  "torch.distributed.distributed_c10d._get_global_rank is deprecated "
/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:430: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
  "torch.distributed.distributed_c10d._get_global_rank is deprecated "
/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:430: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
  "torch.distributed.distributed_c10d._get_global_rank is deprecated "
/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:430: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
  "torch.distributed.distributed_c10d._get_global_rank is deprecated "
/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:430: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
  "torch.distributed.distributed_c10d._get_global_rank is deprecated "
/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:430: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
  "torch.distributed.distributed_c10d._get_global_rank is deprecated "
 10%|████                                    | 1/10 [30:02<4:30:18, 1802.07s/it]
Traceback (most recent call last):
  File "examples/run.py", line 99, in <module>
    simulate()
  File "examples/run.py", line 52, in simulate
    api.run()
  File "/home/ozaland/OpenFed/openfed/api.py", line 71, in run
    maintainer.step()
  File "/home/ozaland/OpenFed/openfed/core/maintainer.py", line 306, in step
    return self._aggregator_step(*args, **kwargs)
  File "/home/ozaland/OpenFed/openfed/core/maintainer.py", line 378, in _aggregator_step
    flag = self.upload()
  File "/home/ozaland/OpenFed/openfed/core/maintainer.py", line 298, in upload
    self.transfer(to=True)
  File "/home/ozaland/OpenFed/openfed/core/functional.py", line 33, in _fed_context
    return safe_call(self, *args, **kwargs)
  File "/home/ozaland/OpenFed/openfed/core/functional.py", line 24, in safe_call
    return func(*args, **kwargs)
  File "/home/ozaland/OpenFed/openfed/core/maintainer.py", line 253, in transfer
    self.pipe.upload(self.packaged_data)
  File "/home/ozaland/OpenFed/openfed/federated/pipe.py", line 164, in upload
    self.transfer(True, data)
  File "/home/ozaland/OpenFed/openfed/federated/pipe.py", line 233, in transfer
    self.push(data)
  File "/home/ozaland/OpenFed/openfed/federated/pipe.py", line 249, in push
    distributed_c10d.gather_object(data, None, dst=rank, group=self.pg)
  File "/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1981, in gather_object
    all_gather(object_size_list, local_size, group=group)
  File "/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2282, in all_gather
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:133] Timed out waiting 1800000ms for send operation to complete
/home/ozaland/OpenFed/openfed/core/functional.py:26: UserWarning: Failed to call <function Maintainer.transfer at 0x7fc6515eff80>
  warnings.warn(f'Failed to call {func}')
/home/ozaland/OpenFed/openfed/core/functional.py:26: UserWarning: Failed to call <function Maintainer.transfer at 0x7f2c126eff80>
  warnings.warn(f'Failed to call {func}')
/home/ozaland/OpenFed/openfed/core/functional.py:26: UserWarning: Failed to call <function Maintainer.transfer at 0x7f60a89eff80>
  warnings.warn(f'Failed to call {func}')
/home/ozaland/OpenFed/openfed/core/functional.py:26: UserWarning: Failed to call <function Maintainer.transfer at 0x7ff26edaff80>
  warnings.warn(f'Failed to call {func}')
/home/ozaland/OpenFed/openfed/core/functional.py:26: UserWarning: Failed to call <function Maintainer.transfer at 0x7f290cfaff80>
  warnings.warn(f'Failed to call {func}')
Killing subprocess 1740079
Killing subprocess 1740080
Killing subprocess 1740081
Killing subprocess 1740082
Killing subprocess 1740083
Killing subprocess 1740084
Traceback (most recent call last):
  File "/home/ozaland/anaconda3/envs/openfed/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ozaland/anaconda3/envs/openfed/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ozaland/OpenFed/openfed/tools/simulator.py", line 221, in <module>
    main()
  File "/home/ozaland/OpenFed/openfed/tools/simulator.py", line 204, in main
    sigkill_handler(signal.SIGTERM, None)
  File "/home/ozaland/OpenFed/openfed/tools/simulator.py", line 138, in sigkill_handler
    returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ozaland/anaconda3/envs/openfed/bin/python', '-u', 'examples/run.py', '--props=/tmp/collaborator-5.json']' returned non-zero exit status 1.

Environment (please complete the following information):

  • OS Platform and Distribution (e.g., Linux Ubuntu 22.04):
  • Python package versions: Pytorch v1.13.1, openfed v0.0.0, torchvision v0.14.1
  • Python version: 3.7
  • CUDA/cuDNN version: 11.7

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions