Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realm DMA Crash in Pennant C++ #1803

Open
lightsighter opened this issue Dec 6, 2024 · 3 comments
Open

Realm DMA Crash in Pennant C++ #1803

lightsighter opened this issue Dec 6, 2024 · 3 comments
Assignees
Labels
bug Realm Issues pertaining to Realm

Comments

@lightsighter
Copy link
Contributor

I observe non-deterministic crashes when running Pennant C++ in debug mode on 4 nodes with 4 GPUs/node. It manifests with different backtraces (running with CUDA_LAUNCH_BLOCKING=1 so I can see exactly which kernel/copy is failing):

#6  0x00007f9a34469859 in __GI_abort () at abort.c:79
#7  0x0000560cac7f5d67 in Realm::Cuda::launch_kernel (func_info=..., params=0x7f9a281f83f0, num_elems=22, stream=0x560cdf48f190)
    at /home/mebauer/legion/runtime//realm/cuda/cuda_module.cc:1105
#8  0x0000560cac7f65a1 in Realm::Cuda::GPU::launch_batch_affine_fill_kernel (this=0x560cdf3045a0, fill_info=0x7f9a281f83f0, dim=2, 
    elem_size=8, volume=22, stream=0x560cdf48f190) at /home/mebauer/legion/runtime//realm/cuda/cuda_module.cc:1172
#9  0x0000560cac84b5da in Realm::Cuda::GPUfillXferDes::progress_xd (this=0x7f9a1b5c8b30, channel=0x560ce28eaf30, work_until=...)
    at /home/mebauer/legion/runtime//realm/cuda/cuda_internal.cc:2083
#10 0x0000560cac8535c0 in Realm::XDQueue<Realm::Cuda::GPUfillChannel, Realm::Cuda::GPUfillXferDes>::do_work (this=0x560ce28eaf68, 
    work_until=...) at /home/mebauer/legion/runtime/realm/transfer/channel.inl:166
#11 0x0000560cac437566 in Realm::BackgroundWorkManager::Worker::do_work (this=0x7f9a281f99c0, max_time_in_ns=-1, 
    interrupt_flag=0x0) at /home/mebauer/legion/runtime//realm/bgwork.cc:600
#12 0x0000560cac434f46 in Realm::BackgroundWorkThread::main_loop (this=0x560ce1816e00)
    at /home/mebauer/legion/runtime//realm/bgwork.cc:103
#13 0x0000560cac438d5c in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop>
    (obj=0x560ce1816e00) at /home/mebauer/legion/runtime/realm/threads.inl:97
#14 0x0000560cac5626ea in Realm::KernelThread::pthread_entry (data=0x560ce1816ea0)
    at /home/mebauer/legion/runtime//realm/threads.cc:854
#15 0x00007f9a3673f609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#16 0x00007f9a34566353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

and also:

#6  0x00007f850d54a859 in __GI_abort () at abort.c:79
#7  0x000055960b4ee5bd in Realm::Cuda::GPUXferDes::progress_xd (this=0x7f84700a9130, channel=0x559621f13a50, work_until=...)
    at /home/mebauer/legion/runtime//realm/cuda/cuda_internal.cc:854
#8  0x000055960b4fda4c in Realm::XDQueue<Realm::Cuda::GPUChannel, Realm::Cuda::GPUXferDes>::do_work (this=0x559621f13a88, 
    work_until=...) at /home/mebauer/legion/runtime/realm/transfer/channel.inl:166
#9  0x000055960b0e1566 in Realm::BackgroundWorkManager::Worker::do_work (this=0x7f84fbffd9c0, max_time_in_ns=-1, 
    interrupt_flag=0x0) at /home/mebauer/legion/runtime//realm/bgwork.cc:600
#10 0x000055960b0def46 in Realm::BackgroundWorkThread::main_loop (this=0x559620e3fd50)
    at /home/mebauer/legion/runtime//realm/bgwork.cc:103
#11 0x000055960b0e2d5c in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop>
    (obj=0x559620e3fd50) at /home/mebauer/legion/runtime/realm/threads.inl:97
#12 0x000055960b20c6ea in Realm::KernelThread::pthread_entry (data=0x559620e3fe70)
    at /home/mebauer/legion/runtime//realm/threads.cc:854
#13 0x00007f850f820609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#14 0x00007f850d647353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

It pretty much crashes every single time in the DMA system, just varies where it is going to crash.

To reproduce, download the master branches of Legion and Pennant C++. Modify the Makefile to set DEBUG=1 and enable (uncomment) -DPRECOMPACTED_RECT_POINTS and disable (comment out) -DENABLE_GATHER_COPIES.

After building, use the following script to launch jobs to sbatch (note the REALM_FREEZE_ON_ERROR=1 means processes will freeze when you crash so you'll need to explicitly kill your jobs):

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --time=00:30:00

root_dir="$PWD"

export LD_LIBRARY_PATH="$PWD"

export GASNET_PHYSMEM_MAX=16G

export CUDA_LAUNCH_BLOCKING=1
export REALM_FREEZE_ON_ERROR=1

ulimit -S -c 0 # disable core dumps

export LEGION_DEFAULT_ARGS="-ll:gpu 4 -ll:util 2 -ll:bgwork 2 -ll:csize 15000 -ll:fsize 14000 -ll:zsize 1024 -ll:rsize 512 -ll:gsize 0 -gex:obcount 8192 -lg:prof 1 -lg:prof_logfile /home/mebauer/pennant-legion/prof_%.log"

srun -n 4 -N 4 --ntasks-per-node 1 --cpu_bind none "$root_dir/pennant" -f "$root_dir"/test/leblanc/leblanc.pnt -n 16

Submit with sbatch -n 4 -N 4 --exclusive <script_name>.

@apryakhin
Copy link
Contributor

I am working on this issue now

@apryakhin
Copy link
Contributor

I am probably missing some key part here. However, I am not able to reproduce that off master neither on sapling2 nor on local workstation. Following the very exact steps provided here.

@lightsighter
Copy link
Contributor Author

It is non-deterministic and you need all four GPU nodes with four GPUs/node. If you want I can make a version of it that crashes under your account on sapling and give you a frozen process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Realm Issues pertaining to Realm
Projects
None yet
Development

No branches or pull requests

2 participants