Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic segmentation fault #1632

Closed
Tracked by #1032
mariodirenzo opened this issue Feb 3, 2024 · 4 comments
Closed
Tracked by #1032

Non-deterministic segmentation fault #1632

mariodirenzo opened this issue Feb 3, 2024 · 4 comments

Comments

@mariodirenzo
Copy link

Some of my multimode executions fail randomly with a segmentation fault that has the following backtrace

#0  0x000020000beaeb88 in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x000020000beae8bc in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:137
#2  0x0000000013f45718 in Realm::realm_freeze (signal=<optimized out>) at /g/g92/direnzo1/legion/runtime/realm/runtime_impl.cc:204
#3  <signal handler called>
#4  0x66664f2200000000 in ?? ()
#5  0x0000000013c02e00 in Legion::Internal::GatherCollective::perform_collective_async (this=0x18828038, precondition=...) at /g/g92/direnzo1/legion/runtime/legion/legion_replication.cc:12029
#6  0x0000000013c00968 in Legion::Internal::ShardCollective::handle_deferred_collective (args=<optimized out>) at /g/g92/direnzo1/legion/runtime/legion/legion_replication.cc:11811
#7  0x000000001356c554 in Legion::Internal::Runtime::legion_runtime_task (args=0x20408a69cdc0, arglen=12, userdata=<optimized out>, userlen=<optimized out>, p=...) at /g/g92/direnzo1/legion/runtime/legion/runtime.cc:32671
#8  0x0000000013f274d8 in Realm::LocalTaskProcessor::execute_task (this=0x3a8fd840, func_id=<optimized out>, task_args=...) at /g/g92/direnzo1/legion/runtime/realm/bytearray.inl:150
#9  0x0000000013f8ee78 in Realm::Task::execute_on_processor (this=0x20408a69cc40, p=...) at /g/g92/direnzo1/legion/runtime/realm/bytearray.inl:39
#10 0x0000000013f8efe4 in Realm::KernelThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>) at /g/g92/direnzo1/legion/runtime/realm/tasks.cc:1421
#11 0x0000000013f8cda8 in Realm::ThreadedTaskScheduler::scheduler_loop (this=this@entry=0x3a8fdb90) at /g/g92/direnzo1/legion/runtime/realm/tasks.cc:1158
#12 0x0000000013f923a4 in scheduler_loop_wlock (this=0x3a8fdb90) at /g/g92/direnzo1/legion/runtime/realm/tasks.cc:1272
#13 Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x3a8fdb90) at /g/g92/direnzo1/legion/runtime/realm/threads.inl:97
#14 0x0000000013f97540 in Realm::KernelThread::pthread_entry (data=0x20408a76cc20) at /g/g92/direnzo1/legion/runtime/realm/threads.cc:831
#15 0x0000200000128cd4 in start_thread (arg=0x2000fd04f8b0) at pthread_create.c:309
#16 0x000020000bef7f14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104

I am using legion on ec0c8500ed8491c8122fc83319e824e026e0f95b compiled in release mode with debug symbols and I have been able to reproduce this bug by running HTR on 4 nodes for a few hours.

@elliottslaughter, can you please add this issue to #1032 ?

@lightsighter
Copy link
Contributor

Try this patch and report back if it fixes the issue:

diff --git a/runtime/legion/legion_replication.cc b/runtime/legion/legion_replication.cc
index 760f76dac..6d0f5a8c3 100644
--- a/runtime/legion/legion_replication.cc
+++ b/runtime/legion/legion_replication.cc
@@ -11979,7 +11979,7 @@ namespace Legion {
         received_notifications(0)
     //--------------------------------------------------------------------------
     {
-      if (expected_notifications > 1)
+      //if (expected_notifications > 1)
         done_event = Runtime::create_rt_user_event();
     }
 
@@ -11992,7 +11992,7 @@ namespace Legion {
         received_notifications(0)
     //--------------------------------------------------------------------------
     {
-      if (expected_notifications > 1)
+      //if (expected_notifications > 1)
         done_event = Runtime::create_rt_user_event();
     }

@mariodirenzo
Copy link
Author

I've been running for 12 hrs without seeing a segmentation fault. I think that the patch fixes the issue

@lightsighter
Copy link
Contributor

@mariodirenzo Try the latest control replication without the patch and see if it is good. If so you can close the issue.

@mariodirenzo
Copy link
Author

This bug has been fixed. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants