-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTR crash on multiple nodes #1573
Comments
Please report the name of the instance that has failed to have its metadata (ask someone from the Realm team how to do this) and the logs with |
@lightsighter I can help with debugging this |
I suspect it is likely to be a Legion issue if we're deleting an instance before it is done being used, but in order to understand that I need to know the name of the instance and what is was used for (region/future/etc) which requires the name of the instance and logging information from a failed run. |
@cmelone I think we need to run this with the debug option
|
@apryakhin I think the idea was to replace this line:
@cmelone I haven't tried compiling the above, but hopefully it'll work. Can you try adding it to your build and running (edited the code from the first version - I forgot that fatal logging messages automatically terminate the application now) |
will do -- thanks |
Logs available at |
Also got this backtrace from a generic segfault:
|
Since this only requires two processes, please make a reproducer on sapling.
That is unrelated. Something corrupted to the heap; could be the runtime, but could also be the application. Whatever went wrong it happened way before this backtrace, so it is not interesting. |
I got the following segmentation fault while trying to reproduce this issue on sapling:
If you want to take a look, the process is hanging at
|
That is also a memory corruption that occurred a long time ago and is either caused by the same underlying issue here or something unrelated. |
The memory corruption seems to go away reverting to the commit I can still reproduce
on |
There are no obvious commits between that commit and the current top of control replication that would cause an issue with that. I suspect that they are the same bug and something else just perturbed the timing so the failure mode changes. If you want to bisect it though then you can do that. Or you can just make the reproducer with the commit that you know fails. |
At this point, let's debug
using |
@elliottslaughter, Can you please add this issue to #1032 ? |
@cmelone please try the |
@lightsighter has this been merged? (sorry if I'm late, was without my laptop) I can't find the branch. The original crash is still coming up on the latest commit of CR. |
Which commit are you on? |
Also, are you running with |
Pull and try again and be sure you are on this commit: If it still reproduces after that then make a reproducer on sapling. |
Not seeing this crash anymore, thank you! |
Latest control_replication. 4 ranks, 1 rank per node with GPUs. This is non-deterministic only with specific test cases of the solver.
I think this is the same crash as #1415. Feel free to move this back there if you'd prefer.
crash:
backtrace:
@elliottslaughter, please add to #1032, thanks!
The text was updated successfully, but these errors were encountered: