Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamlining the destructors in the UCX API #498

Merged
merged 20 commits into from
Apr 23, 2020

Conversation

madsbk
Copy link
Member

@madsbk madsbk commented Apr 20, 2020

This PR make sure that all calls to the UCX is using valid UCX handles and streamlines the destruction of UCX handles.

@madsbk madsbk force-pushed the ucx_api_cleanup branch 3 times, most recently from 446085b to 8bd80ab Compare April 21, 2020 13:19
@madsbk madsbk marked this pull request as ready for review April 21, 2020 16:17
@madsbk madsbk requested a review from a team as a code owner April 21, 2020 16:17
@madsbk
Copy link
Member Author

madsbk commented Apr 21, 2020

cc. @beckernick @VibhuJawa

@quasiben
Copy link
Member

Running with this PR i am seeing the following error:

Task exception was never retrieved
future: <Task finished coro=<_listener_handler() done, defined at /gpfs/fs1/bzaitlen/miniconda3/envs/20200417/lib/python3.7/site-packages/ucp/core.py:116> exception=ValueError('Both peers must set guarantee_msg_order identically')>
Traceback (most recent call last):
  File "/gpfs/fs1/bzaitlen/miniconda3/envs/20200417/lib/python3.7/site-packages/ucp/core.py", line 127, in _listener_handler
    guarantee_msg_order=guarantee_msg_order,
  File "/gpfs/fs1/bzaitlen/miniconda3/envs/20200417/lib/python3.7/site-packages/ucp/core.py", line 56, in exchange_peer_info
    raise ValueError("Both peers must set guarantee_msg_order identically")
ValueError: Both peers must set guarantee_msg_order identically

This is in the middle of a workflow

@quasiben
Copy link
Member

Apologies, this is with an IB stress test with 4 nodes: https://gist.github.com/quasiben/73cf6d7c2131fe41370014ddc21ecb56

Perhaps this is still the bug @pentschev, myself, and others have been tracking

@pentschev
Copy link
Member

I've done some testing with this and I have some carefully optimistic news:

  1. TCP: not a single hang or crash in about 3 runs;
  2. NVLink: not a single hang or crash in about 3 runs;
  3. IB: not a single hang or crash in 10+ runs;
  4. IB+NVLink: in some 7-8 runs, it hanged all times except for two consecutive ones.

All tests above were on a 4 node/32 GPU cluster.

I'm carefully optimistic because IB was hanging and crashing most of the time and I haven't been able to reproduce that in any runs. The amount of observation is still not enough to ensure our bugs have been fixed. The hangs with IB+NVLink also point to some remaining issues.

Finally, just as the merge operation starts I still the following error several times when new endpoints are getting created to do worker-worker connection:

future: <Task finished name='Task-1578' coro=<_listener_handler() done, defined at /datasets/pentschev/miniconda3/envs/pydbg/lib/python3.8/site-packages/ucp/core.py:116> exception=ValueError('Both peers must set guarantee_msg_order identically')>
    raise ValueError("Both peers must set guarantee_msg_order identically")
ValueError: Both peers must set guarantee_msg_order identically

For @VibhuJawa and @beckernick , please mind the NVLink+IB hang if you're testing. If you have the chance a test with IB only (NVLink disabled) would be useful as well to see whether you experience any hangs or segfaults.

@pentschev
Copy link
Member

I intend to do some more testing tomorrow to confirm whether I can't reproduce IB errors anymore.

Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mads! Happy to see these cleanup improvements. 😄

Made some suggestion below. Where possible tried to include a code suggestion for simplicity.

ucp/_libs/ucx_api.pyx Outdated Show resolved Hide resolved
Comment on lines +324 to +331
# Close the endpoint
# TODO: Support UCP_EP_CLOSE_MODE_FORCE
status = ucp_ep_close_nb(handle, UCP_EP_CLOSE_MODE_FLUSH)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we file an issue about this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added #505

ucp/_libs/ucx_api.pyx Outdated Show resolved Hide resolved
ucp/_libs/ucx_api.pyx Show resolved Hide resolved
ucp/_libs/ucx_api.pyx Outdated Show resolved Hide resolved
ucp/_libs/ucx_api.pyx Outdated Show resolved Hide resolved
ucp/_libs/ucx_api.pyx Outdated Show resolved Hide resolved
ucp/_libs/ucx_api.pyx Outdated Show resolved Hide resolved
ucp/_libs/ucx_api.pyx Outdated Show resolved Hide resolved
ucp/core.py Outdated Show resolved Hide resolved
@madsbk madsbk mentioned this pull request Apr 22, 2020
Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mads! Looks much nicer 😄

Had a couple follow-up comments on UCXEndpoint so that we can use __cinit__ as is typical.

ucp/_libs/ucx_api.pyx Outdated Show resolved Hide resolved
ucp/_libs/ucx_api.pyx Outdated Show resolved Hide resolved
ucp/_libs/ucx_api.pyx Outdated Show resolved Hide resolved
ucp/_libs/ucx_api.pyx Outdated Show resolved Hide resolved
ucp/_libs/ucx_api.pyx Outdated Show resolved Hide resolved
@pentschev
Copy link
Member

I did more testing today and results were very similar to what I observed yesterday. Having NVLink enabled makes the cluster more likely to hang. IB was (almost) hang-free, it did hang once in some 30 runs. I also did various runs with larger data -- multiplying the sizes of the workflow in #402 (comment) by 8 -- which takes roughly 4 minutes to create the data and another 11-12 minutes for the merge to complete, and they all seemed to behave the same as the original size, hanging sometimes with NVLink and passing with IB. Due to the long wait between runs, I did 3 successful runs of each only, where I observed 2 or 3 hangs for NVLink (totalling 5-6 runs in that case).

I am very suspicious still of what happens during exchange_peer_info, it seems that dask-cuda-workers listener's will receive corrupted responses or not receive a response at all (not sure which one of the two is really happening), causing tags and guarantee_msg_order to be invalid. Perhaps this will eventually be fixed by #503, but it isn't the case at the time of writing. To be more specific, the reason I believe the hangs are connected to exchange_peer_info is that hangs happen just a little while after many of the workers raise

raise ValueError("Both peers must set guarantee_msg_order identically")
, something that happens at the beginning of the merge task stream when workers are establishing connections to each other.

@quasiben
Copy link
Member

How do folks generally feel about merging this PR in given that other work is now dependent on this ?

@jakirkham
Copy link
Member

I made a few minor suggestions above that would be nice to see addressed first.

@quasiben
Copy link
Member

I made a few minor suggestions above that would be nice to see addressed first.

Definitely agreed -- I wanted to get a sense where folks were at.

@jakirkham
Copy link
Member

Yeah after that's addressed I'm +1 on merging.

@pentschev
Copy link
Member

I'm +1 on merging this as well.

ucp/_libs/ucx_api.pyx Outdated Show resolved Hide resolved
@madsbk madsbk mentioned this pull request Apr 23, 2020
@madsbk
Copy link
Member Author

madsbk commented Apr 23, 2020

am very suspicious still of what happens during exchange_peer_info

@pentschev I think you are on to something: #506 !

@quasiben
Copy link
Member

Thank you @Mads and @jakirkham for the reviews

@pentschev
Copy link
Member

LGTM, any final comments before we merge @jakirkham ?

@jakirkham jakirkham merged commit 90c596d into rapidsai:branch-0.14 Apr 23, 2020
@jakirkham
Copy link
Member

Thanks Mads for the PR! Also thanks Peter and Ben for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants