Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch distributed error #156

Open
9B8DY6 opened this issue Jul 22, 2022 · 6 comments
Open

torch distributed error #156

9B8DY6 opened this issue Jul 22, 2022 · 6 comments

Comments

@9B8DY6
Copy link

9B8DY6 commented Jul 22, 2022

`(imaginaire) da981116@lait:/disk1/da981116/imaginaire$ bash scripts/test_training.sh
/home/da981116/.conda/envs/imaginaire/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
Traceback (most recent call last):
  File "train.py", line 21, in <module>
    from imaginaire.utils.logging import init_logging, make_logging_dir
  File "/disk1/da981116/imaginaire/imaginaire/utils/logging.py", line 10, in <module>
    from imaginaire.utils.meters import set_summary_writer
  File "/disk1/da981116/imaginaire/imaginaire/utils/meters.py", line 11, in <module>
    from torch.utils.tensorboard import SummaryWriter
  File "/home/da981116/.conda/envs/imaginaire/lib/python3.8/site-packages/torch/utils/tensorboard/__init__.py", line 4, in <module>
    LooseVersion = distutils.version.LooseVersion
AttributeError: module 'distutils' has no attribute 'version'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 253476) of binary: /home/da981116/.conda/envs/imaginaire/bin/python
Traceback (most recent call last):
  File "/home/da981116/.conda/envs/imaginaire/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/da981116/.conda/envs/imaginaire/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/da981116/.conda/envs/imaginaire/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/da981116/.conda/envs/imaginaire/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/da981116/.conda/envs/imaginaire/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/da981116/.conda/envs/imaginaire/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
    elastic_launch(
  File "/home/da981116/.conda/envs/imaginaire/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/da981116/.conda/envs/imaginaire/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
***************************************
            train.py FAILED            
=======================================
Root Cause:
[0]:
  time: 2022-07-22_10:24:36
  rank: 0 (local_rank: 0)
  exitcode: 1 (pid: 253476)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
=======================================
Other Failures:
  <NO_OTHER_FAILURES>
***************************************

 python -m torch.distributed.launch --nproc_per_node=1 train.py  --config configs/unit_test/spade.yaml >> /tmp/unit_test.log  [Failure] 
(imaginaire) da981116@lait:/disk1/da981116/imaginaire$ `

I got error of disutils that has no 'version' and raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
I tried to install imaginaire with conda but it does not work. Would you tell me how to fix it? If I solve the problem of disutils with pip install setuptools, then may torch.distributed problem be also solved?

@digvijayad
Copy link

@9B8DY6
I solved it by downgrading the setuptools to version 59 as suggested here

pip install setuptools==59.5.0

@9B8DY6
Copy link
Author

9B8DY6 commented Jul 28, 2022

@9B8DY6 I solved it by downgrading the setuptools to version 59 as suggested here

pip install setuptools==59.5.0

Thank you for your reply. @digvijayad

@digvijayad
Copy link

@9B8DY6 have you managed to run the model? I'm getting another error: version 'GLIBCXX_3.4.XX' not found

@9B8DY6
Copy link
Author

9B8DY6 commented Jul 29, 2022

@9B8DY6 have you managed to run the model? I'm getting another error: version 'GLIBCXX_3.4.XX' not found

RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=1, worker_count=4, timeout=0:30:00) ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 567) of binary: /opt/conda/bin/python ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70
No, I still got the problem of torch.distributed

@digvijayad
Copy link

That looks like you have a timeout of 30 minutes for your worker. Therefore, it timed out before it could finish. Not to sure about it.
Have you finished building unprojections or are you just testing with default cityscapes ?

@Nyoko74
Copy link

Nyoko74 commented Oct 26, 2022

@9B8DY6 have you managed to run the model? I'm getting another error: version 'GLIBCXX_3.4.XX' not found

RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=1, worker_count=4, timeout=0:30:00) ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 567) of binary: /opt/conda/bin/python ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70 No, I still got the problem of torch.distributed

Hi @9B8DY6,I am having this same issue. Did you manage to fix it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants