Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Getting "SymIntArrayRef expected to contain only concrete integers" error when > 1 GPU #6806

Open
rileyhun opened this issue Nov 28, 2024 · 0 comments
Labels
bug Something isn't working training

Comments

@rileyhun
Copy link

rileyhun commented Nov 28, 2024

Describe the bug
When training on 8 GPUs on a single instance, there seems to be some obscure error, which results in NCCL communication timeout. Seems to always fail on the last step of the first epoch.

We are able to train the model with Deepspeed against a single GPU. This error seems to only surface when using more than 1 GPU. Additionally, we tested training a sample model with the mnist data using Deepspeed and 16 GPUs, and wasn't able to reproduce the issue here either. So likely not a configuration, environment, or OS error.

The error:

Training errored after 0 iterations at 2024-11-28 20:38:59. Total running time: 51min 4s
Error file: /tmp/ray/session_2024-11-28_19-41-03_581423_461/artifacts/2024-11-28_19-47-54/***_20241128-194737/driver_artifacts/TorchTrainer_a8e76_00000_0_2024-11-28_19-47-54/error.txt
2024-11-28 20:39:15,732 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to 'com-****-amp-experim-****/****-experiments/****_20241128-194737' in 15.9896s.

2024-11-28 20:39:15,733 ERROR tune.py:1037 -- Trials did not complete: [TorchTrainer_a8e76_00000]
ray.exceptions.RayTaskError(RuntimeError): ray::_Inner.train() (pid=790, ip=10.119.112.122, actor_id=89e092e47068d2ddb308d36f02000000, repr=TorchTrainer)
  File "/opt/miniconda/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 331, in train
    raise skipped from exception_cause(skipped)
  File "/opt/miniconda/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 57, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(RuntimeError): ray::_RayTrainWorker__execute.get_next() (pid=930, ip=10.119.112.122, actor_id=daefc30e025aaf5ca0e04d4202000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f68c0658f10>)
  File "/opt/miniconda/lib/python3.10/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
    raise skipped from exception_cause(skipped)
  File "/opt/miniconda/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 176, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/opt/miniconda/lib/python3.10/site-packages/adsk_ailab_ray/ray_lightning.py", line 264, in _train_function_per_worker
    trainer.fit(
  File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit
    call._call_and_handle_interrupt(
  File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
  File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _run
    results = self._run_stage()
  File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1030, in _run_stage
    self.fit_loop.run()
  File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 206, in run
    self.on_advance_end()
  File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 376, in on_advance_end
    call._call_callback_hooks(trainer, "on_train_epoch_end", monitoring_callbacks=False)
  File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 210, in _call_callback_hooks
    fn(trainer, trainer.lightning_module, *args, **kwargs)
  File "/opt/miniconda/lib/python3.10/site-packages/adsk_ailab_ray/tools/ray.py", line 68, in on_train_epoch_end
    self.save_checkpoint(trainer, pl_module)
  File "/opt/miniconda/lib/python3.10/site-packages/adsk_ailab_ray/tools/ray.py", line 44, in save_checkpoint
    trainer.save_checkpoint(ckpt_path, weights_only=False)
  File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1370, in save_checkpoint
    self.strategy.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
  File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/strategies/deepspeed.py", line 625, in save_checkpoint
    filepath = self.broadcast(filepath)
  File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File "/opt/miniconda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/opt/miniconda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2422, in broadcast_object_list
    object_tensor = torch.empty(  # type: ignore[call-overload]
RuntimeError: aten/src/ATen/RegisterCUDA.cpp:7216: SymIntArrayRef expected to contain only concrete integers

To Reproduce
Proprietary model and code, so cannot share.

  • We are using IterableDataset for the DataLoader

System info (please complete the following information):

  • OS: Ubuntu 22.04
  • GPU count and types 1 machine with 8 H100 GPUs
  • Python version 3.10.1
  • Any other relevant info about your setup

Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else?
Ray VM using PyTorch Lightning

@rileyhun rileyhun added bug Something isn't working training labels Nov 28, 2024
@rileyhun rileyhun changed the title [BUG] [BUG] SymIntArrayRef expected to contain only concrete integers Nov 28, 2024
@rileyhun rileyhun changed the title [BUG] SymIntArrayRef expected to contain only concrete integers [BUG] SymIntArrayRef expected to contain only concrete integers when > 1 GPU Nov 28, 2024
@rileyhun rileyhun changed the title [BUG] SymIntArrayRef expected to contain only concrete integers when > 1 GPU [BUG] Getting "SymIntArrayRef expected to contain only concrete integers" error when > 1 GPU Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

1 participant