Realm DMA Crash in Pennant C++ #1803

lightsighter · 2024-12-06T00:47:23Z

I observe non-deterministic crashes when running Pennant C++ in debug mode on 4 nodes with 4 GPUs/node. It manifests with different backtraces (running with CUDA_LAUNCH_BLOCKING=1 so I can see exactly which kernel/copy is failing):

#6  0x00007f9a34469859 in __GI_abort () at abort.c:79
#7  0x0000560cac7f5d67 in Realm::Cuda::launch_kernel (func_info=..., params=0x7f9a281f83f0, num_elems=22, stream=0x560cdf48f190)
    at /home/mebauer/legion/runtime//realm/cuda/cuda_module.cc:1105
#8  0x0000560cac7f65a1 in Realm::Cuda::GPU::launch_batch_affine_fill_kernel (this=0x560cdf3045a0, fill_info=0x7f9a281f83f0, dim=2, 
    elem_size=8, volume=22, stream=0x560cdf48f190) at /home/mebauer/legion/runtime//realm/cuda/cuda_module.cc:1172
#9  0x0000560cac84b5da in Realm::Cuda::GPUfillXferDes::progress_xd (this=0x7f9a1b5c8b30, channel=0x560ce28eaf30, work_until=...)
    at /home/mebauer/legion/runtime//realm/cuda/cuda_internal.cc:2083
#10 0x0000560cac8535c0 in Realm::XDQueue<Realm::Cuda::GPUfillChannel, Realm::Cuda::GPUfillXferDes>::do_work (this=0x560ce28eaf68, 
    work_until=...) at /home/mebauer/legion/runtime/realm/transfer/channel.inl:166
#11 0x0000560cac437566 in Realm::BackgroundWorkManager::Worker::do_work (this=0x7f9a281f99c0, max_time_in_ns=-1, 
    interrupt_flag=0x0) at /home/mebauer/legion/runtime//realm/bgwork.cc:600
#12 0x0000560cac434f46 in Realm::BackgroundWorkThread::main_loop (this=0x560ce1816e00)
    at /home/mebauer/legion/runtime//realm/bgwork.cc:103
#13 0x0000560cac438d5c in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop>
    (obj=0x560ce1816e00) at /home/mebauer/legion/runtime/realm/threads.inl:97
#14 0x0000560cac5626ea in Realm::KernelThread::pthread_entry (data=0x560ce1816ea0)
    at /home/mebauer/legion/runtime//realm/threads.cc:854
#15 0x00007f9a3673f609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#16 0x00007f9a34566353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

and also:

#6  0x00007f850d54a859 in __GI_abort () at abort.c:79
#7  0x000055960b4ee5bd in Realm::Cuda::GPUXferDes::progress_xd (this=0x7f84700a9130, channel=0x559621f13a50, work_until=...)
    at /home/mebauer/legion/runtime//realm/cuda/cuda_internal.cc:854
#8  0x000055960b4fda4c in Realm::XDQueue<Realm::Cuda::GPUChannel, Realm::Cuda::GPUXferDes>::do_work (this=0x559621f13a88, 
    work_until=...) at /home/mebauer/legion/runtime/realm/transfer/channel.inl:166
#9  0x000055960b0e1566 in Realm::BackgroundWorkManager::Worker::do_work (this=0x7f84fbffd9c0, max_time_in_ns=-1, 
    interrupt_flag=0x0) at /home/mebauer/legion/runtime//realm/bgwork.cc:600
#10 0x000055960b0def46 in Realm::BackgroundWorkThread::main_loop (this=0x559620e3fd50)
    at /home/mebauer/legion/runtime//realm/bgwork.cc:103
#11 0x000055960b0e2d5c in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop>
    (obj=0x559620e3fd50) at /home/mebauer/legion/runtime/realm/threads.inl:97
#12 0x000055960b20c6ea in Realm::KernelThread::pthread_entry (data=0x559620e3fe70)
    at /home/mebauer/legion/runtime//realm/threads.cc:854
#13 0x00007f850f820609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#14 0x00007f850d647353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

It pretty much crashes every single time in the DMA system, just varies where it is going to crash.

To reproduce, download the master branches of Legion and Pennant C++. Modify the Makefile to set DEBUG=1 and enable (uncomment) -DPRECOMPACTED_RECT_POINTS and disable (comment out) -DENABLE_GATHER_COPIES.

After building, use the following script to launch jobs to sbatch (note the REALM_FREEZE_ON_ERROR=1 means processes will freeze when you crash so you'll need to explicitly kill your jobs):

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --time=00:30:00

root_dir="$PWD"

export LD_LIBRARY_PATH="$PWD"

export GASNET_PHYSMEM_MAX=16G

export CUDA_LAUNCH_BLOCKING=1
export REALM_FREEZE_ON_ERROR=1

ulimit -S -c 0 # disable core dumps

export LEGION_DEFAULT_ARGS="-ll:gpu 4 -ll:util 2 -ll:bgwork 2 -ll:csize 15000 -ll:fsize 14000 -ll:zsize 1024 -ll:rsize 512 -ll:gsize 0 -gex:obcount 8192 -lg:prof 1 -lg:prof_logfile /home/mebauer/pennant-legion/prof_%.log"

srun -n 4 -N 4 --ntasks-per-node 1 --cpu_bind none "$root_dir/pennant" -f "$root_dir"/test/leblanc/leblanc.pnt -n 16

Submit with sbatch -n 4 -N 4 --exclusive <script_name>.

The text was updated successfully, but these errors were encountered:

apryakhin · 2024-12-11T18:51:16Z

I am working on this issue now

apryakhin · 2024-12-17T15:59:38Z

I am probably missing some key part here. However, I am not able to reproduce that off master neither on sapling2 nor on local workstation. Following the very exact steps provided here.

lightsighter · 2024-12-17T18:25:29Z

It is non-deterministic and you need all four GPU nodes with four GPUs/node. If you want I can make a version of it that crashes under your account on sapling and give you a frozen process.

lightsighter assigned apryakhin Dec 6, 2024

lightsighter added bug Realm Issues pertaining to Realm labels Dec 6, 2024

lightsighter mentioned this issue Dec 6, 2024

Realm: Profiling breaks with CUPTI #1800

Open

lightsighter mentioned this issue Dec 13, 2024

Realm Gather Copy Hang #1802

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Realm DMA Crash in Pennant C++ #1803

Realm DMA Crash in Pennant C++ #1803

lightsighter commented Dec 6, 2024

apryakhin commented Dec 11, 2024

apryakhin commented Dec 17, 2024

lightsighter commented Dec 17, 2024

Realm DMA Crash in Pennant C++ #1803

Realm DMA Crash in Pennant C++ #1803

Comments

lightsighter commented Dec 6, 2024

apryakhin commented Dec 11, 2024

apryakhin commented Dec 17, 2024

lightsighter commented Dec 17, 2024