-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streamlining the destructors in the UCX API #498
Conversation
446085b
to
8bd80ab
Compare
Running with this PR i am seeing the following error:
This is in the middle of a workflow |
Apologies, this is with an IB stress test with 4 nodes: https://gist.github.com/quasiben/73cf6d7c2131fe41370014ddc21ecb56 Perhaps this is still the bug @pentschev, myself, and others have been tracking |
I've done some testing with this and I have some carefully optimistic news:
All tests above were on a 4 node/32 GPU cluster. I'm carefully optimistic because IB was hanging and crashing most of the time and I haven't been able to reproduce that in any runs. The amount of observation is still not enough to ensure our bugs have been fixed. The hangs with IB+NVLink also point to some remaining issues. Finally, just as the merge operation starts I still the following error several times when new endpoints are getting created to do worker-worker connection: future: <Task finished name='Task-1578' coro=<_listener_handler() done, defined at /datasets/pentschev/miniconda3/envs/pydbg/lib/python3.8/site-packages/ucp/core.py:116> exception=ValueError('Both peers must set guarantee_msg_order identically')>
raise ValueError("Both peers must set guarantee_msg_order identically")
ValueError: Both peers must set guarantee_msg_order identically For @VibhuJawa and @beckernick , please mind the NVLink+IB hang if you're testing. If you have the chance a test with IB only (NVLink disabled) would be useful as well to see whether you experience any hangs or segfaults. |
I intend to do some more testing tomorrow to confirm whether I can't reproduce IB errors anymore. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Mads! Happy to see these cleanup improvements. 😄
Made some suggestion below. Where possible tried to include a code suggestion for simplicity.
# Close the endpoint | ||
# TODO: Support UCP_EP_CLOSE_MODE_FORCE | ||
status = ucp_ep_close_nb(handle, UCP_EP_CLOSE_MODE_FLUSH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we file an issue about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added #505
Co-Authored-By: jakirkham <[email protected]>
Co-Authored-By: jakirkham <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Mads! Looks much nicer 😄
Had a couple follow-up comments on UCXEndpoint
so that we can use __cinit__
as is typical.
I did more testing today and results were very similar to what I observed yesterday. Having NVLink enabled makes the cluster more likely to hang. IB was (almost) hang-free, it did hang once in some 30 runs. I also did various runs with larger data -- multiplying the sizes of the workflow in #402 (comment) by 8 -- which takes roughly 4 minutes to create the data and another 11-12 minutes for the merge to complete, and they all seemed to behave the same as the original size, hanging sometimes with NVLink and passing with IB. Due to the long wait between runs, I did 3 successful runs of each only, where I observed 2 or 3 hangs for NVLink (totalling 5-6 runs in that case). I am very suspicious still of what happens during Line 56 in 37e4445
|
How do folks generally feel about merging this PR in given that other work is now dependent on this ? |
I made a few minor suggestions above that would be nice to see addressed first. |
Definitely agreed -- I wanted to get a sense where folks were at. |
Yeah after that's addressed I'm +1 on merging. |
I'm +1 on merging this as well. |
Co-Authored-By: jakirkham <[email protected]>
@pentschev I think you are on to something: #506 ! |
Thank you @Mads and @jakirkham for the reviews |
LGTM, any final comments before we merge @jakirkham ? |
Thanks Mads for the PR! Also thanks Peter and Ben for the reviews! |
This PR make sure that all calls to the UCX is using valid UCX handles and streamlines the destruction of UCX handles.