Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: an illegal memory access was encountered #1577

Open
albertz opened this issue Jul 9, 2024 · 1 comment
Open

RuntimeError: CUDA error: an illegal memory access was encountered #1577

albertz opened this issue Jul 9, 2024 · 1 comment

Comments

@albertz
Copy link
Member

albertz commented Jul 9, 2024

...
ep 1 train, step 294, ctc_4 4.553, ctc_8 4.531, ctc 4.510, num_seqs 11, max_size:time 201384, max_size:out-spatial 149, mem_usage:cuda:0 5.9GB, 0.411 sec/step
ep 1 train, step 294, ctc_4 4.516, ctc_8 4.528, ctc 4.528, num_seqs 10, max_size:time 239009, max_size:out-spatial 133, mem_usage:cuda:2 5.9GB, 0.455 sec/step
ep 1 train, step 295, ctc_4 4.569, ctc_8 4.623, ctc 4.650, num_seqs 9, max_size:time 245433, max_size:out-spatial 136, mem_usage:cuda:1 5.9GB, 0.404 sec/step
ep 1 train, step 295, ctc_4 4.467, ctc_8 4.479, ctc 4.519, num_seqs 9, max_size:time 247369, max_size:out-spatial 135, mem_usage:cuda:3 5.9GB, 0.428 sec/step
ep 1 train, step 295, ctc_4 4.500, ctc_8 4.590, ctc 4.528, num_seqs 9, max_size:time 245081, max_size:out-spatial 131, mem_usage:cuda:0 5.9GB, 0.405 sec/step
ep 1 train, step 295, ctc_4 4.620, ctc_8 4.670, ctc 4.536, num_seqs 10, max_size:time 236369, max_size:out-spatial 135, mem_usage:cuda:2 5.9GB, 0.476 sec/step
ep 1 train, step 296, ctc_4 4.598, ctc_8 4.540, ctc 4.563, num_seqs 9, max_size:time 248953, max_size:out-spatial 156, mem_usage:cuda:3 5.9GB, 0.400 sec/step
ep 1 train, step 296, ctc_4 4.707, ctc_8 4.549, ctc 4.544, num_seqs 12, max_size:time 199296, max_size:out-spatial 131, mem_usage:cuda:0 5.9GB, 0.408 sec/step
ep 1 train, step 296, ctc_4 4.515, ctc_8 4.595, ctc 4.611, num_seqs 10, max_size:time 223920, max_size:out-spatial 121, mem_usage:cuda:1 5.9GB, 0.484 sec/step
ep 1 train, step 296, ctc_4 4.560, ctc_8 4.889, ctc 4.619, num_seqs 10, max_size:time 236457, max_size:out-spatial 144, mem_usage:cuda:2 5.9GB, 0.405 sec/step
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

...
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/data/extern_data.py", line 55, in raw_dict_to_extern_data
    line: data.raw_tensor = raw_tensor.to(device)
    locals:
      data = <local> Tensor{'data', [B?,T|'time'[B?],F|F'audio'(1)]}
      data.raw_tensor = <local> None
      raw_tensor = <local> tensor[9, 242353, 1] n=2181177 (8.3Mb) x∈[-1.033, 1.001] μ=0.000 σ=0.087
      raw_tensor.to = <local> <built-in method to of Tensor object at 0x7ca9d419d6d0>
      device = <local> 'cuda:0', len = 6
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

...
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7cab17f92617 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7cab17f4d98d in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7cab182cd9f8 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x169b6 (0x7cab182969b6 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1947d (0x7cab1829947d in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1989d (0x7cab1829989d in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x513c46 (0x7caad8d30c46 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x55ca7 (0x7cab17f77ca7 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x1e3 (0x7cab17f6fcb3 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7cab17f6fe49 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x4bd16c7 (0x7caac64da6c7 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #11: torch::autograd::deleteNode(torch::autograd::Node*) + 0xa9 (0x7caac64d2b59 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #12: std::_Sp_counted_deleter<torch::autograd::generated::SumBackward0*, void (*)(torch::autograd::Node*), std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0xe (0x7caac5baf1ee in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x4ba8990 (0x7caac64b1990 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #14: c10::TensorImpl::~TensorImpl() + 0x1da (0x7cab17f6fcaa in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #15: c10::TensorImpl::~TensorImpl() + 0x9 (0x7cab17f6fe49 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #16: <unknown function> + 0x7c84d8 (0x7caad8fe54d8 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #17: THPVariable_subclass_dealloc(_object*) + 0x305 (0x7caad8fe5865 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #33: <unknown function> + 0x291b7 (0x7cab445ab1b7 in /work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6)
frame #34: __libc_start_main + 0x7c (0x7cab445ab26c in /work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6)
frame #35: _start + 0x21 (0x401071 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11)

Fatal Python error: Aborted

Current thread 0x00007cab44581000 (most recent call first):
  Garbage-collecting
  <no Python frame>
Signal handler: signal 6:
/var/tmp/zeyer/returnn_native/native_signal_handler/c14b833885/native_signal_handler.so(signal_handler+0x4b)[0x7cab18e3b20b]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x3cf40)[0x7cab445bef40]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x86e6f)[0x7cab44608e6f]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(raise+0x12)[0x7cab445beea2]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x3cf40)[0x7cab445bef40]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x86e6f)[0x7cab44608e6f]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(raise+0x12)[0x7cab445beea2]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(abort+0xc2)[0x7cab445aa45c]
/work/tools/users/zeyer/linuxbrew/opt/gcc/lib/gcc/current/libstdc++.so.6(+0xa58d9)[0x7cab1992b8d9]
/work/tools/users/zeyer/linuxbrew/opt/gcc/lib/gcc/current/libstdc++.so.6(+0xb0f0a)[0x7cab19936f0a]
/work/tools/users/zeyer/linuxbrew/opt/gcc/lib/gcc/current/libstdc++.so.6(+0xaff79)[0x7cab19935f79]
/work/tools/users/zeyer/linuxbrew/opt/gcc/lib/gcc/current/libstdc++.so.6(__gxx_personality_v0+0x86)[0x7cab19936696]
/work/tools/users/zeyer/linuxbrew/opt/gcc/lib/gcc/current/libgcc_s.so.1(+0x17934)[0x7cab43ce2934]
/work/tools/users/zeyer/linuxbrew/opt/gcc/lib/gcc/current/libgcc_s.so.1(_Unwind_Resume+0x12d)[0x7cab43ce338d]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so(+0x117f7)[0x7cab182917f7]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so(+0x1989d)[0x7cab1829989d]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so(+0x513c46)[0x7caad8d30c46]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(+0x55ca7)[0x7cab17f77ca7]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(_ZN3c1010TensorImplD1Ev+0x1e3)[0x7cab17f6fcb3]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(_ZN3c1010TensorImplD0Ev+0x9)[0x7cab17f6fe49]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so(+0x4bd16c7)[0x7caac64da6c7]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so(_ZN5torch8autograd10deleteNodeEPNS0_4NodeE+0xa9)[0x7caac64d2b59]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so(_ZNSt19_Sp_counted_deleterIPN5torch8autograd9generated12SumBackward0EPFvPNS1_4NodeEESaIvELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv+0xe)[0x7caac5baf1ee]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so(+0x4ba8990)[0x7caac64b1990]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(_ZN3c1010TensorImplD1Ev+0x1da)[0x7cab17f6fcaa]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(_ZN3c1010TensorImplD0Ev+0x9)[0x7cab17f6fe49]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so(+0x7c84d8)[0x7caad8fe54d8]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so(_Z28THPVariable_subclass_deallocP7_object+0x305)[0x7caad8fe5865]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x1edb1d)[0x7cab44a63b1d]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x1ec6e3)[0x7cab44a626e3]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x1e1a5d)[0x7cab44a57a5d]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x1ec564)[0x7cab44a62564]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x1e003d)[0x7cab44a5603d]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x1dfe7d)[0x7cab44a55e7d]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x1e1a56)[0x7cab44a57a56]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x1ec564)[0x7cab44a62564]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x1e003d)[0x7cab44a5603d]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x258a56)[0x7cab44acea56]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x29f85b)[0x7cab44b1585b]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x29ff60)[0x7cab44b15f60]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(Py_FinalizeEx+0x7b)[0x7cab44b0a92b]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(Py_RunMain+0x180)[0x7cab44b14d40]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(Py_BytesMain+0x29)[0x7cab44b14ab9]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x291b7)[0x7cab445ab1b7]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(__libc_start_main+0x7c)[0x7cab445ab26c]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11(_start+0x21)[0x401071]
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.6.9.55]:26753
...

I have seen this error before a couple of times, but usually a restart of the job "fixed" it, and it was rare. So I attributed it to some hardware hiccup (we have many similar issues with our 1080s... e.g. #1520, #1558, #1496, ...).

However, here I have a case which doesn't seem to go away after restarts, and also occurs always exactly a the same step.

It also happens for many other similar setups where the vocab dimension is low, so this is probably the key factor, as other setups with higher vocab dim work just fine. But all of the setups with SPM 1k, 512, 128 have crashed now with this error. The step where they crashed was slightly different though, depending on the vocab. Maybe it's some long sequence which triggers this in the CTC calculation.

This is also multi GPU training but I'm not sure this is relevant.

Some more log (stripped down):

RETURNN starting up, version 1.20240708.175624+git.853bb23d, date/time 2024-07-09-07-42-52 (UTC+0000), pid 660812, cwd /work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.5AWTwj5VHV2P/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
RETURNN starting up, version 1.20240708.175624+git.853bb23d, date/time 2024-07-09-07-42-52 (UTC+0000), pid 660809, cwd /work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.5AWTwj5VHV2P/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
RETURNN starting up, version 1.20240708.175624+git.853bb23d, date/time 2024-07-09-07-42-52 (UTC+0000), pid 660811, cwd /work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.5AWTwj5VHV2P/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.5AWTwj5VHV2P/output/returnn.config']
RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.5AWTwj5VHV2P/output/returnn.config']
Hostname: cn-255
RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.5AWTwj5VHV2P/output/returnn.config']
Hostname: cn-255
Hostname: cn-255 
RETURNN starting up, version 1.20240708.175624+git.853bb23d, date/time 2024-07-09-07-42-52 (UTC+0000), pid 660810, cwd /work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.5AWTwj5VHV2P/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.5AWTwj5VHV2P/output/returnn.config']
Hostname: cn-255
Installed native_signal_handler.so.
Installed native_signal_handler.so.
Installed native_signal_handler.so.
Installed native_signal_handler.so.
PyTorch: 2.1.0+cu121 (7bcf7da3a268b435777fe87c7794c382f444e86d) (<site-package> in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch)
Torch: Hostname cn-255, pid 660811, using GPU 2.
PyTorch: 2.1.0+cu121 (7bcf7da3a268b435777fe87c7794c382f444e86d) (<site-package> in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch)
PyTorch: 2.1.0+cu121 (7bcf7da3a268b435777fe87c7794c382f444e86d) (<site-package> in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch)
Torch: Hostname cn-255, pid 660812, using GPU 3.
Torch: Hostname cn-255, pid 660810, using GPU 1.
PyTorch: 2.1.0+cu121 (7bcf7da3a268b435777fe87c7794c382f444e86d) (<site-package> in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch)
Torch: Hostname cn-255, pid 660809, using GPU 0.
CUDA_VISIBLE_DEVICES is set to '0,1,2,3'.
Available CUDA devices:
CUDA_VISIBLE_DEVICES is set to '0,1,2,3'.
CUDA_VISIBLE_DEVICES is set to '0,1,2,3'.
Available CUDA devices:
Available CUDA devices:
CUDA_VISIBLE_DEVICES is set to '0,1,2,3'.
Available CUDA devices:
  1/4: cuda:0
       name: NVIDIA GeForce GTX 1080 Ti
       total_memory: 10.9GB
       capability: 6.1
       device_index: 0
  2/4: cuda:1
       name: NVIDIA GeForce GTX 1080 Ti
       total_memory: 10.9GB
       capability: 6.1
       device_index: 1
...
ep 1 train, step 97, ctc_4 4.669, ctc_8 4.739, ctc 4.654, num_seqs 11, max_size:time 209881, max_size:out-spatial 129, mem_usage:cuda:2 5.6GB, 0.441 sec/step
ep 1 train, step 97, ctc_4 4.700, ctc_8 4.769, ctc 5.140, num_seqs 17, max_size:time 137448, max_size:out-spatial 104, mem_usage:cuda:3 5.6GB, 0.424 sec/step
ep 1 train, step 98, ctc_4 4.736, ctc_8 4.656, ctc 4.864, num_seqs 13, max_size:time 184185, max_size:out-spatial 105, mem_usage:cuda:1 5.6GB, 0.418 sec/step
ep 1 train, step 98, ctc_4 4.655, ctc_8 4.644, ctc 4.731, num_seqs 15, max_size:time 157360, max_size:out-spatial 99, mem_usage:cuda:3 5.6GB, 0.396 sec/step
ep 1 train, step 98, ctc_4 4.710, ctc_8 4.653, ctc 4.759, num_seqs 13, max_size:time 172080, max_size:out-spatial 109, mem_usage:cuda:0 5.6GB, 0.448 sec/step
ep 1 train, step 98, ctc_4 4.644, ctc_8 5.009, ctc 4.551, num_seqs 11, max_size:time 212609, max_size:out-spatial 115, mem_usage:cuda:2 5.6GB, 0.458 sec/step
cn-255:660809:660809 [0] NCCL INFO Bootstrap : Using enp5s0:10.6.9.55<0>
cn-255:660809:660809 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
cn-255:660809:660809 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
cn-255:660809:660809 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.18.1+cuda12.1
MEMORY: main proc python3.11(660810) increased RSS: rss=1.6GB pss=1.2GB uss=1.0GB shared=549.1MB
MEMORY: main proc python3.11(660811) increased RSS: rss=1.6GB pss=1.2GB uss=1.0GB shared=551.5MB
MEMORY: total (main 660810, 2024-07-09, 07:44:51, 21 procs): pss=6.2GB uss=5.9GB
MEMORY: main proc python3.11(660812) increased RSS: rss=1.5GB pss=1.2GB uss=1.0GB shared=549.4MB
cn-255:660809:663643 [0] NCCL INFO NET/IB : No device found.
cn-255:660809:663643 [0] NCCL INFO NET/Socket : Using [0]enp5s0:10.6.9.55<0>
cn-255:660809:663643 [0] NCCL INFO Using network Socket
cn-255:660809:663643 [0] NCCL INFO Setting affinity for GPU 0 to ff00ff
cn-255:660809:663643 [0] NCCL INFO NVLS multicast support is not available on dev 0
cn-255:660809:663643 [0] NCCL INFO Channel 00/02 :    0   1   2   3
cn-255:660809:663643 [0] NCCL INFO Channel 01/02 :    0   1   2   3
cn-255:660809:663643 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
cn-255:660809:663643 [0] NCCL INFO P2P Chunksize set to 131072
cn-255:660809:663643 [0] NCCL INFO Channel 00 : 0[2000] -> 1[3000] via SHM/direct/direct
cn-255:660809:663643 [0] NCCL INFO Channel 01 : 0[2000] -> 1[3000] via SHM/direct/direct
cn-255:660809:663643 [0] NCCL INFO Connected all rings
cn-255:660809:663643 [0] NCCL INFO Connected all trees
cn-255:660809:663643 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
cn-255:660809:663643 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
MEMORY: total (main 660811, 2024-07-09, 07:44:51, 21 procs): pss=6.3GB uss=6.0GB
cn-255:660809:663643 [0] NCCL INFO comm 0x1214fd90 rank 0 nranks 4 cudaDev 0 busId 2000 commId 0x86dd52ba5e99125f - Init COMPLETE
ep 1 train, step 99, ctc_4 4.801, ctc_8 4.946, ctc 4.701, num_seqs 11, max_size:time 215425, max_size:out-spatial 116, mem_usage:cuda:2 5.6GB, 2.135 sec/step
ep 1 train, step 99, ctc_4 4.787, ctc_8 4.728, ctc 4.716, num_seqs 13, max_size:time 180401, max_size:out-spatial 106, mem_usage:cuda:3 5.6GB, 2.225 sec/step
ep 1 train, step 99, ctc_4 4.687, ctc_8 4.699, ctc 4.861, num_seqs 12, max_size:time 187969, max_size:out-spatial 110, mem_usage:cuda:1 5.6GB, 2.252 sec/step
ep 1 train, step 99, ctc_4 4.756, ctc_8 4.632, ctc 4.686, num_seqs 12, max_size:time 193520, max_size:out-spatial 114, mem_usage:cuda:0 5.6GB, 2.207 sec/step
MEMORY: total (main 660812, 2024-07-09, 07:44:52, 21 procs): pss=6.2GB uss=5.9GB
ep 1 train, step 100, ctc_4 4.880, ctc_8 4.915, ctc 4.879, num_seqs 15, max_size:time 154224, max_size:out-spatial 109, mem_usage:cuda:3 5.6GB, 0.468 sec/step
...

Log-file at i6: /u/zeyer/setups/combined/2021-05-31/alias/ctc/v6-relPosAttDef-bhv20-11gb-f32-bs15k-accgrad1-mgpu4-pavg100-maxSeqLenAudio19_5-wd1e_2-lrlin1e_5_295k-featBN-speedpertV2-spm128/train/engine/i6_core.returnn.training.ReturnnTrainingJob.5AWTwj5VHV2P.run.7998996.1

albertz added a commit to rwth-i6/i6_experiments that referenced this issue Jul 9, 2024
@albertz
Copy link
Member Author

albertz commented Jul 9, 2024

Ah, it's a heisenbug. With CUDA_LAUNCH_BLOCKING=1, the bug does not appear anymore. (Or maybe different hardware? Now running on cn-238, but it's also 4x1080, just as before.)

albertz added a commit to rwth-i6/i6_experiments that referenced this issue Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant