-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shutdown is still messy. Proposal to fix it (probably for Edward to implement?) #192
Comments
Many of these shutdown tasks are covered by Group::leave(), ViewManager::leave(), and destructors of the two classes. But some clean up operations are missing like sst::lf_destroy() and rdmc::shutdown(). Edward suggests to add them in ViewManager::leave(). |
I discovered one reason why Derecho processes often end with a segmentation fault when attempting to shut down "cleanly" (as noted in issues #135, #192, etc.): When a node marks itself as failed, the SST will be told to freeze the node's own row (in process_suspicions()), but SST::freeze() will dereference a null pointer if it is called on the local row (res_vec has no entry for the node's own row). The solution is to add a check for row_index == my_index in freeze(), and also to ensure a node shuts itself down more promptly when it detects that it has been marked as "failed" by the rest of the group.
The current shutdown process in the main branch still prints error messages in a given situation. It turns out that the A better solution could be disabling all heartbeats () on For future improvement: |
This is a trivial proposal, but turns out to be important to our users (notably the AFRL funding people who have supported us for several years).
The issue: If a process is a member of the top-level group and exits "abruptly", but without crashing (like by return from main), our handling is messy and often causes error messages. Users think Derecho is broken. If all members exit, we can be VERY messy right now.
Solution: Leverage C++ destructors for static objects. In the C++ standard, destructors for static objects are called when a program() exits normally (abnormal exit won't necessarily do this, but I'm not worried about that case).
We would add a simple class to Derecho:
You can test this... you'll see that for any normal shutdown, the destructor gets called.
Then we offer people a derecho::detach() and a derecho::shutdown() API:
As you can see, my proposal is that the destructor in this static object will cause derecho::shutdown to be called automatically if you didn't call derecho::detach.
derecho::shutdown would use the simple 2-phase approach:
Additionally:
6) Inhibit all aspects of Edward's new view logic once shutdown_in_progress is true: We don't want to mess up our logs at the very last moment.
The text was updated successfully, but these errors were encountered: