[BUG] Getting "SymIntArrayRef expected to contain only concrete integers" error when > 1 GPU #6806

rileyhun · 2024-11-28T21:46:58Z

Describe the bug
When training on 8 GPUs on a single instance, there seems to be some obscure error, which results in NCCL communication timeout. Seems to always fail on the last step of the first epoch.

We are able to train the model with Deepspeed against a single GPU. This error seems to only surface when using more than 1 GPU. Additionally, we tested training a sample model with the mnist data using Deepspeed and 16 GPUs, and wasn't able to reproduce the issue here either. So likely not a configuration, environment, or OS error.

The error:

Training errored after 0 iterations at 2024-11-28 20:38:59. Total running time: 51min 4s
Error file: /tmp/ray/session_2024-11-28_19-41-03_581423_461/artifacts/2024-11-28_19-47-54/***_20241128-194737/driver_artifacts/TorchTrainer_a8e76_00000_0_2024-11-28_19-47-54/error.txt
2024-11-28 20:39:15,732 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to 'com-****-amp-experim-****/****-experiments/****_20241128-194737' in 15.9896s.

2024-11-28 20:39:15,733 ERROR tune.py:1037 -- Trials did not complete: [TorchTrainer_a8e76_00000]
ray.exceptions.RayTaskError(RuntimeError): ray::_Inner.train() (pid=790, ip=10.119.112.122, actor_id=89e092e47068d2ddb308d36f02000000, repr=TorchTrainer)
  File "/opt/miniconda/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 331, in train
    raise skipped from exception_cause(skipped)
  File "/opt/miniconda/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 57, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(RuntimeError): ray::_RayTrainWorker__execute.get_next() (pid=930, ip=10.119.112.122, actor_id=daefc30e025aaf5ca0e04d4202000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f68c0658f10>)
  File "/opt/miniconda/lib/python3.10/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
    raise skipped from exception_cause(skipped)
  File "/opt/miniconda/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 176, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/opt/miniconda/lib/python3.10/site-packages/adsk_ailab_ray/ray_lightning.py", line 264, in _train_function_per_worker
    trainer.fit(
  File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit
    call._call_and_handle_interrupt(
  File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
  File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _run
    results = self._run_stage()
  File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1030, in _run_stage
    self.fit_loop.run()
  File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 206, in run
    self.on_advance_end()
  File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 376, in on_advance_end
    call._call_callback_hooks(trainer, "on_train_epoch_end", monitoring_callbacks=False)
  File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 210, in _call_callback_hooks
    fn(trainer, trainer.lightning_module, *args, **kwargs)
  File "/opt/miniconda/lib/python3.10/site-packages/adsk_ailab_ray/tools/ray.py", line 68, in on_train_epoch_end
    self.save_checkpoint(trainer, pl_module)
  File "/opt/miniconda/lib/python3.10/site-packages/adsk_ailab_ray/tools/ray.py", line 44, in save_checkpoint
    trainer.save_checkpoint(ckpt_path, weights_only=False)
  File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1370, in save_checkpoint
    self.strategy.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
  File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/strategies/deepspeed.py", line 625, in save_checkpoint
    filepath = self.broadcast(filepath)
  File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File "/opt/miniconda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/opt/miniconda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2422, in broadcast_object_list
    object_tensor = torch.empty(  # type: ignore[call-overload]
RuntimeError: aten/src/ATen/RegisterCUDA.cpp:7216: SymIntArrayRef expected to contain only concrete integers

To Reproduce
Proprietary model and code, so cannot share.

We are using IterableDataset for the DataLoader

System info (please complete the following information):

OS: Ubuntu 22.04
GPU count and types 1 machine with 8 H100 GPUs
Python version 3.10.1
Any other relevant info about your setup

Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else?
Ray VM using PyTorch Lightning

The text was updated successfully, but these errors were encountered:

rileyhun added bug Something isn't working training labels Nov 28, 2024

rileyhun changed the title ~~[BUG]~~ [BUG] SymIntArrayRef expected to contain only concrete integers Nov 28, 2024

rileyhun changed the title ~~[BUG] SymIntArrayRef expected to contain only concrete integers~~ [BUG] SymIntArrayRef expected to contain only concrete integers when > 1 GPU Nov 28, 2024

rileyhun changed the title ~~[BUG] SymIntArrayRef expected to contain only concrete integers when > 1 GPU~~ [BUG] Getting "SymIntArrayRef expected to contain only concrete integers" error when > 1 GPU Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Getting "SymIntArrayRef expected to contain only concrete integers" error when > 1 GPU #6806

[BUG] Getting "SymIntArrayRef expected to contain only concrete integers" error when > 1 GPU #6806

rileyhun commented Nov 28, 2024 •

edited

Loading

[BUG] Getting "SymIntArrayRef expected to contain only concrete integers" error when > 1 GPU #6806

[BUG] Getting "SymIntArrayRef expected to contain only concrete integers" error when > 1 GPU #6806

Comments

rileyhun commented Nov 28, 2024 • edited Loading

rileyhun commented Nov 28, 2024 •

edited

Loading