Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deterministic error affecting HTR #1466

Closed
Tracked by #1032
cmelone opened this issue May 4, 2023 · 10 comments
Closed
Tracked by #1032

Deterministic error affecting HTR #1466

cmelone opened this issue May 4, 2023 · 10 comments

Comments

@cmelone
Copy link
Contributor

cmelone commented May 4, 2023

We have been seeing one of our test cases deterministically fail on Lassen in the last week or so with the following error message:

...
prometeo_variables.cu:68: void UpdatePropertiesFromPrimitive_kernel(Legion::FieldAccessor<(legion_privilege_mode_t)1, double, 3, long long, Realm::AffineAccessor<double, 3, long long>, false>, Legion::FieldAccessor<(legion_privilege_mode_t)1, double, 3, long long, Realm::AffineAccessor<double, 3, long long>, false>, Legion::FieldAccessor<(legion_privilege_mode_t)1, MyArray<double, 1>, 3, long long, Realm::AffineAccessor<MyArray<double, 1>, 3, long long>, false>, Legion::FieldAccessor<(legion_privilege_mode_t)1, MyArray<double, 3>, 3, long long, Realm::AffineAccessor<MyArray<double, 3>, 3, long long>, false>, Legion::FieldAccessor<(legion_privilege_mode_t)268435458, MyArray<double, 1>, 3, long long, Realm::AffineAccessor<MyArray<double, 1>, 3, long long>, false>, Legion::FieldAccessor<(legion_privilege_mode_t)268435458, double, 3, long long, Realm::AffineAccessor<double, 3, long long>, false>, Legion::FieldAccessor<(legion_privilege_mode_t)268435458, double, 3, long long, Realm::AffineAccessor<double, 3, long long>, false>, Legion::FieldAccessor<(legion_privilege_mode_t)268435458, double, 3, long long, Realm::AffineAccessor<double, 3, long long>, false>, Legion::FieldAccessor<(legion_privilege_mode_t)268435458, MyArray<double, 1>, 3, long long, Realm::AffineAccessor<MyArray<double, 1>, 3, long long>, false>, Legion::FieldAccessor<(legion_privilege_mode_t)268435458, double, 3, long long, Realm::AffineAccessor<double, 3, long long>, false>, Realm::Rect<3, long long>, long long, long long, long long): block: [0,3,14], thread: [0,2,0] Assertion `mix.CheckMixture(MolarFracs[p])` failed.
...
prometeo_ConstPropMix.exec: /usr/WS1/stanf_ci/psaap-ci/artifacts/1167868/legion/runtime/realm/cuda/cuda_module.cc:340: bool Realm::Cuda::GPUStream::reap_events(Realm::TimeLimit): Assertion `0' failed.

I believe this is the relevant backtrace:

Thread 70 (Thread 0x20005ba9f8b0 (LWP 119373)):
#0  0x0000200002abeb88 in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000200002abe8bc in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:137
#2  0x0000200000fa8e98 in Realm::realm_freeze (signal=<optimized out>) at /usr/WS1/stanf_ci/psaap-ci/artifacts/1167868/legion/runtime/realm/runtime_impl.cc:183
#3  <signal handler called>
#4  0x0000200002a1fcb0 in __GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#5  0x0000200002a2200c in __GI_abort () at abort.c:90
#6  0x0000200002a157d4 in __assert_fail_base (fmt=0x200002b7b7d0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x200001b287e0 "0", 
    file=0x200001b47280 "/usr/WS1/stanf_ci/psaap-ci/artifacts/1167868/legion/runtime/realm/cuda/cuda_module.cc", line=<optimized out>, function=<optimized out>) at assert.c:92
#7  0x0000200002a158c4 in __GI___assert_fail (assertion=0x200001b287e0 "0", file=0x200001b47280 "/usr/WS1/stanf_ci/psaap-ci/artifacts/1167868/legion/runtime/realm/cuda/cuda_module.cc", 
    line=<optimized out>, function=0x200001b45d40 <Realm::Cuda::GPUStream::reap_events(Realm::TimeLimit)::__PRETTY_FUNCTION__> "bool Realm::Cuda::GPUStream::reap_events(Realm::TimeLimit)")
    at assert.c:101
#8  0x00002000013063c0 in Realm::Cuda::GPUStream::reap_events (this=0x11abf450, work_until=...) at /usr/WS1/stanf_ci/psaap-ci/artifacts/1167868/legion/runtime/realm/cuda/cuda_module.cc:340
#9  0x000020000130c9d4 in Realm::Cuda::GPUWorker::do_work (this=0x1261c230, work_until=...) at /usr/WS1/stanf_ci/psaap-ci/artifacts/1167868/legion/runtime/realm/cuda/cuda_module.cc:2218
#10 0x0000200000fc9ff4 in Realm::BackgroundWorkManager::Worker::do_work (this=0x20005ba9eb58, max_time_in_ns=-1, interrupt_flag=0x0)
    at /usr/WS1/stanf_ci/psaap-ci/artifacts/1167868/legion/runtime/realm/timers.inl:255
#11 0x0000200000fca8b0 in Realm::BackgroundWorkThread::main_loop (this=0x17728340) at /usr/WS1/stanf_ci/psaap-ci/artifacts/1167868/legion/runtime/realm/bgwork.cc:125
#12 0x00002000010c0684 in Realm::KernelThread::pthread_entry (data=0x17728d80) at /usr/WS1/stanf_ci/psaap-ci/artifacts/1167868/legion/runtime/realm/threads.cc:781
#13 0x0000200002c08cd4 in start_thread (arg=0x20005ba9f8b0) at pthread_create.c:309
#14 0x0000200002b07f14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104

Legion was built with CXXFLAGS="-g -O2" in release mode. This is on 1 node, 1 rank per node with control_replication.

This execution last succeeded on cb61755 and started failing on at least d1ecc4b.

This may not be relevant, but I have only been able to reproduce this on a POWER9 machine but not on an Intel cluster.

Also, could this be added to #1032? Thanks

@streichler
Copy link
Contributor

This looks like an application-level assert? Can we determine what about MolarFracs[p] is unexpected enough to cause the assert to fire?

@cmelone
Copy link
Contributor Author

cmelone commented May 4, 2023

I believe it asserts when a task has received bad data. I've re-ran the problem and it succeeds on cb61755. Two commits later (d1ecc4b), it fails. No modifications have been made to HTR's codebase in the meantime.

Additionally, this error has come up non-deterministically for the same problem. Not sure if it's related:

assertion failed: FastInterpInitData: something wrong in the input region

backtrace:

(gdb) bt
#0  0x0000200002abeb88 in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000200002abe8bc in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:137
#2  0x0000200000fa8e98 in Realm::realm_freeze (signal=<optimized out>) at /usr/WS1/stanf_ci/psaap-ci/artifacts/1167868/legion/runtime/realm/runtime_impl.cc:183
#3  <signal handler called>
#4  0x0000200002a1fcb0 in __GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#5  0x0000200002a2200c in __GI_abort () at abort.c:90
#6  0x00000000100781b8 in $<FastInterpInitData> ()
#7  0x000000001007726c in $__regent_task_FastInterpInitData_primary ()
#8  0x000020000124818c in Realm::LocalTaskProcessor::execute_task (this=0x1881ac50, func_id=<optimized out>, task_args=...)
    at /usr/WS1/stanf_ci/psaap-ci/artifacts/1167868/legion/runtime/realm/bytearray.inl:150
#9  0x00002000010e37e4 in Realm::Task::execute_on_processor (this=0x2000c08b45b0, p=...) at /usr/WS1/stanf_ci/psaap-ci/artifacts/1167868/legion/runtime/realm/runtime_impl.h:399
#10 0x00002000010e3994 in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>) at /usr/WS1/stanf_ci/psaap-ci/artifacts/1167868/legion/runtime/realm/tasks.cc:1632
#11 0x00002000010e6a34 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x1881aef0) at /usr/WS1/stanf_ci/psaap-ci/artifacts/1167868/legion/runtime/realm/tasks.cc:1103
#12 0x00002000010c6e58 in Realm::UserThread::uthread_entry () at /usr/WS1/stanf_ci/psaap-ci/artifacts/1167868/legion/runtime/realm/threads.cc:1355
#13 0x0000200002a3388c in makecontext () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/makecontext.S:137
#14 0x0000000000000000 in ?? ()

@streichler
Copy link
Contributor

@lightsighter the commit range that @cmelone is referring to looks to be the merge of the collectiveup branch. Does that suggest any promising debug paths to follow?

@lightsighter
Copy link
Contributor

Not unless they are using collective copies/reductions (which I would be a bit surprised if HTR is doing that). I did fix a bug related to that branch today, but it will only matter if you have a collective reduction.

@cmelone
Copy link
Contributor Author

cmelone commented May 5, 2023

After re-compiling Legion this morning with the new changes and running the problem again, it seems to succeed now. Thanks

@cmelone cmelone closed this as completed May 5, 2023
@lightsighter
Copy link
Contributor

That's interesting to know that you're using collective reduction copies.

@cmelone
Copy link
Contributor Author

cmelone commented May 5, 2023

yeah, @mariodirenzo was wondering about that too

@elliottslaughter
Copy link
Contributor

You can end up in this situation if you write code like:

for i in ... do
   my_task(r) -- reduces+(r)
end

Where my_task does a reduction on r. My understanding is that the default mapper turns these into collective instances by default.

Otherwise I'm not sure how you'd be impacted by this.

@lightsighter
Copy link
Contributor

Yeah, but HTR doesn't use the default mapper.

@mariodirenzo
Copy link

The mapper of HTR is derived from the default mapper so I think that @elliottslaughter is correct, we are inheriting that aspect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants