You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I observe non-deterministic crashes when running Pennant C++ in debug mode on 4 nodes with 4 GPUs/node. It manifests with different backtraces (running with CUDA_LAUNCH_BLOCKING=1 so I can see exactly which kernel/copy is failing):
#6 0x00007f9a34469859 in __GI_abort () at abort.c:79
#7 0x0000560cac7f5d67 in Realm::Cuda::launch_kernel (func_info=..., params=0x7f9a281f83f0, num_elems=22, stream=0x560cdf48f190)
at /home/mebauer/legion/runtime//realm/cuda/cuda_module.cc:1105
#8 0x0000560cac7f65a1 in Realm::Cuda::GPU::launch_batch_affine_fill_kernel (this=0x560cdf3045a0, fill_info=0x7f9a281f83f0, dim=2,
elem_size=8, volume=22, stream=0x560cdf48f190) at /home/mebauer/legion/runtime//realm/cuda/cuda_module.cc:1172
#9 0x0000560cac84b5da in Realm::Cuda::GPUfillXferDes::progress_xd (this=0x7f9a1b5c8b30, channel=0x560ce28eaf30, work_until=...)
at /home/mebauer/legion/runtime//realm/cuda/cuda_internal.cc:2083
#10 0x0000560cac8535c0 in Realm::XDQueue<Realm::Cuda::GPUfillChannel, Realm::Cuda::GPUfillXferDes>::do_work (this=0x560ce28eaf68,
work_until=...) at /home/mebauer/legion/runtime/realm/transfer/channel.inl:166
#11 0x0000560cac437566 in Realm::BackgroundWorkManager::Worker::do_work (this=0x7f9a281f99c0, max_time_in_ns=-1,
interrupt_flag=0x0) at /home/mebauer/legion/runtime//realm/bgwork.cc:600
#12 0x0000560cac434f46 in Realm::BackgroundWorkThread::main_loop (this=0x560ce1816e00)
at /home/mebauer/legion/runtime//realm/bgwork.cc:103
#13 0x0000560cac438d5c in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop>
(obj=0x560ce1816e00) at /home/mebauer/legion/runtime/realm/threads.inl:97
#14 0x0000560cac5626ea in Realm::KernelThread::pthread_entry (data=0x560ce1816ea0)
at /home/mebauer/legion/runtime//realm/threads.cc:854
#15 0x00007f9a3673f609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#16 0x00007f9a34566353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
and also:
#6 0x00007f850d54a859 in __GI_abort () at abort.c:79
#7 0x000055960b4ee5bd in Realm::Cuda::GPUXferDes::progress_xd (this=0x7f84700a9130, channel=0x559621f13a50, work_until=...)
at /home/mebauer/legion/runtime//realm/cuda/cuda_internal.cc:854
#8 0x000055960b4fda4c in Realm::XDQueue<Realm::Cuda::GPUChannel, Realm::Cuda::GPUXferDes>::do_work (this=0x559621f13a88,
work_until=...) at /home/mebauer/legion/runtime/realm/transfer/channel.inl:166
#9 0x000055960b0e1566 in Realm::BackgroundWorkManager::Worker::do_work (this=0x7f84fbffd9c0, max_time_in_ns=-1,
interrupt_flag=0x0) at /home/mebauer/legion/runtime//realm/bgwork.cc:600
#10 0x000055960b0def46 in Realm::BackgroundWorkThread::main_loop (this=0x559620e3fd50)
at /home/mebauer/legion/runtime//realm/bgwork.cc:103
#11 0x000055960b0e2d5c in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop>
(obj=0x559620e3fd50) at /home/mebauer/legion/runtime/realm/threads.inl:97
#12 0x000055960b20c6ea in Realm::KernelThread::pthread_entry (data=0x559620e3fe70)
at /home/mebauer/legion/runtime//realm/threads.cc:854
#13 0x00007f850f820609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#14 0x00007f850d647353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
It pretty much crashes every single time in the DMA system, just varies where it is going to crash.
To reproduce, download the master branches of Legion and Pennant C++. Modify the Makefile to set DEBUG=1 and enable (uncomment) -DPRECOMPACTED_RECT_POINTS and disable (comment out) -DENABLE_GATHER_COPIES.
After building, use the following script to launch jobs to sbatch (note the REALM_FREEZE_ON_ERROR=1 means processes will freeze when you crash so you'll need to explicitly kill your jobs):
I am probably missing some key part here. However, I am not able to reproduce that off master neither on sapling2 nor on local workstation. Following the very exact steps provided here.
It is non-deterministic and you need all four GPU nodes with four GPUs/node. If you want I can make a version of it that crashes under your account on sapling and give you a frozen process.
I observe non-deterministic crashes when running Pennant C++ in debug mode on 4 nodes with 4 GPUs/node. It manifests with different backtraces (running with
CUDA_LAUNCH_BLOCKING=1
so I can see exactly which kernel/copy is failing):and also:
It pretty much crashes every single time in the DMA system, just varies where it is going to crash.
To reproduce, download the master branches of Legion and Pennant C++. Modify the Makefile to set
DEBUG=1
and enable (uncomment)-DPRECOMPACTED_RECT_POINTS
and disable (comment out)-DENABLE_GATHER_COPIES
.After building, use the following script to launch jobs to sbatch (note the
REALM_FREEZE_ON_ERROR=1
means processes will freeze when you crash so you'll need to explicitly kill your jobs):Submit with
sbatch -n 4 -N 4 --exclusive <script_name>
.The text was updated successfully, but these errors were encountered: