You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When training on 8 GPUs on a single instance, there seems to be some obscure error, which results in NCCL communication timeout. Seems to always fail on the last step of the first epoch.
We are able to train the model with Deepspeed against a single GPU. This error seems to only surface when using more than 1 GPU. Additionally, we tested training a sample model with the mnist data using Deepspeed and 16 GPUs, and wasn't able to reproduce the issue here either. So likely not a configuration, environment, or OS error.
The error:
Training errored after 0 iterations at 2024-11-28 20:38:59. Total running time: 51min 4s
Error file: /tmp/ray/session_2024-11-28_19-41-03_581423_461/artifacts/2024-11-28_19-47-54/***_20241128-194737/driver_artifacts/TorchTrainer_a8e76_00000_0_2024-11-28_19-47-54/error.txt
2024-11-28 20:39:15,732 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to 'com-****-amp-experim-****/****-experiments/****_20241128-194737' in 15.9896s.
2024-11-28 20:39:15,733 ERROR tune.py:1037 -- Trials did not complete: [TorchTrainer_a8e76_00000]
ray.exceptions.RayTaskError(RuntimeError): ray::_Inner.train() (pid=790, ip=10.119.112.122, actor_id=89e092e47068d2ddb308d36f02000000, repr=TorchTrainer)
File "/opt/miniconda/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/opt/miniconda/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 57, in check_for_failure
ray.get(object_ref)
ray.exceptions.RayTaskError(RuntimeError): ray::_RayTrainWorker__execute.get_next() (pid=930, ip=10.119.112.122, actor_id=daefc30e025aaf5ca0e04d4202000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f68c0658f10>)
File "/opt/miniconda/lib/python3.10/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
raise skipped from exception_cause(skipped)
File "/opt/miniconda/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 176, in discard_return_wrapper
train_func(*args, **kwargs)
File "/opt/miniconda/lib/python3.10/site-packages/adsk_ailab_ray/ray_lightning.py", line 264, in _train_function_per_worker
trainer.fit(
File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit
call._call_and_handle_interrupt(
File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _run
results = self._run_stage()
File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1030, in _run_stage
self.fit_loop.run()
File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 206, in run
self.on_advance_end()
File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 376, in on_advance_end
call._call_callback_hooks(trainer, "on_train_epoch_end", monitoring_callbacks=False)
File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 210, in _call_callback_hooks
fn(trainer, trainer.lightning_module, *args, **kwargs)
File "/opt/miniconda/lib/python3.10/site-packages/adsk_ailab_ray/tools/ray.py", line 68, in on_train_epoch_end
self.save_checkpoint(trainer, pl_module)
File "/opt/miniconda/lib/python3.10/site-packages/adsk_ailab_ray/tools/ray.py", line 44, in save_checkpoint
trainer.save_checkpoint(ckpt_path, weights_only=False)
File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1370, in save_checkpoint
self.strategy.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/strategies/deepspeed.py", line 625, in save_checkpoint
filepath = self.broadcast(filepath)
File "/opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
File "/opt/miniconda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
return func(*args, **kwargs)
File "/opt/miniconda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2422, in broadcast_object_list
object_tensor = torch.empty( # type: ignore[call-overload]
RuntimeError: aten/src/ATen/RegisterCUDA.cpp:7216: SymIntArrayRef expected to contain only concrete integers
To Reproduce
Proprietary model and code, so cannot share.
We are using IterableDataset for the DataLoader
System info (please complete the following information):
OS: Ubuntu 22.04
GPU count and types 1 machine with 8 H100 GPUs
Python version 3.10.1
Any other relevant info about your setup
Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else?
Ray VM using PyTorch Lightning
The text was updated successfully, but these errors were encountered:
rileyhun
changed the title
[BUG]
[BUG] SymIntArrayRef expected to contain only concrete integers
Nov 28, 2024
rileyhun
changed the title
[BUG] SymIntArrayRef expected to contain only concrete integers
[BUG] SymIntArrayRef expected to contain only concrete integers when > 1 GPU
Nov 28, 2024
rileyhun
changed the title
[BUG] SymIntArrayRef expected to contain only concrete integers when > 1 GPU
[BUG] Getting "SymIntArrayRef expected to contain only concrete integers" error when > 1 GPU
Nov 28, 2024
Describe the bug
When training on 8 GPUs on a single instance, there seems to be some obscure error, which results in NCCL communication timeout. Seems to always fail on the last step of the first epoch.
We are able to train the model with Deepspeed against a single GPU. This error seems to only surface when using more than 1 GPU. Additionally, we tested training a sample model with the mnist data using Deepspeed and 16 GPUs, and wasn't able to reproduce the issue here either. So likely not a configuration, environment, or OS error.
The error:
To Reproduce
Proprietary model and code, so cannot share.
System info (please complete the following information):
Launcher context
Are you launching your experiment with the
deepspeed
launcher, MPI, or something else?Ray VM using PyTorch Lightning
The text was updated successfully, but these errors were encountered: