Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose setting multiple protocols and ports via the dask-scheduler CLI #6898

Merged

Conversation

jacobtomlinson
Copy link
Member

@jacobtomlinson jacobtomlinson commented Aug 17, 2022

Closes #6891

  • Tests added / passed
  • Passes pre-commit run --all-files

In #6891 @mrocklin mentioned that Scheduler can take a list for protocol and port. This PR updates the dask-scheduler CLI to also allow lists. Users can optionally specify a comma-separated list for each.

$ dask-scheduler                                   
2022-08-17 11:54:30,973 - distributed.scheduler - INFO -   Scheduler at:   tcp://10.51.100.80:8786

$ dask-scheduler --protocol ws                     
2022-08-17 11:56:14,423 - distributed.scheduler - INFO -   Scheduler at:    ws://10.51.100.80:8786

$ dask-scheduler --protocol tcp,ws                            
2022-08-17 11:55:00,675 - distributed.scheduler - INFO -   Scheduler at:   tcp://10.51.100.80:8786
2022-08-17 11:55:00,675 - distributed.scheduler - INFO -   Scheduler at:   ws://10.51.100.80:52663

$ dask-scheduler --protocol tcp,ws --port 8786,8788
2022-08-17 11:55:24,119 - distributed.scheduler - INFO -   Scheduler at:   tcp://10.51.100.80:8786
2022-08-17 11:55:24,119 - distributed.scheduler - INFO -   Scheduler at:    ws://10.51.100.80:8788

@github-actions
Copy link
Contributor

github-actions bot commented Aug 17, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±0         15 suites  ±0   6h 50m 7s ⏱️ - 15m 48s
  3 001 tests +1    2 911 ✔️ +  3       88 💤  - 1  2  - 1 
22 253 runs  +7  21 205 ✔️ +13  1 046 💤  - 5  2  - 1 

For more details on these failures, see this check.

Results for commit e4485c3. ± Comparison against base commit 1d0701b.

♻️ This comment has been updated with latest results.

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jacobtomlinson , this is a clean, simple solution, awesome! I did some brief testing with UCX as well and didn't immediately see any issues with that, so feels like it should work as expected.

For reference, this is what I ran:

# on scheduler
dask-scheduler --protocol ucx,ws --port 8786,8788

# on worker
dask-cuda-worker ucx://SCHEDULER_IP:8786

# on client -- from https://github.com/rapidsai/dask-cuda/blob/branch-22.10/dask_cuda/benchmarks/local_cudf_merge.py
python local_cudf_merge.py --runs 10 --scheduler-address ucx://SCHEDULER_IP:8786 -c 50_000_000

I didn't do any test related to ws, mainly because I wouldn't know how to. 🙂

@pentschev
Copy link
Member

The interface change seems to work, it reports the listening interfaces correctly at least:

$ dask-scheduler --protocol ucx,tcp,ws --port 8786,8788,8789 --interface ib0,enp1s0f0,enp1s0f0
2022-08-17 07:49:40,367 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-17 07:49:40,985 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2022-08-17 07:49:41,028 - distributed.scheduler - INFO - State start
2022-08-17 07:49:41,039 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-17 07:49:41,039 - distributed.scheduler - INFO - Clear task state
2022-08-17 07:49:42,568 - distributed.scheduler - INFO -   Scheduler at:  ucx://10.33.225.163:8786
2022-08-17 07:49:42,568 - distributed.scheduler - INFO -   Scheduler at:  tcp://10.33.227.163:8788
2022-08-17 07:49:42,568 - distributed.scheduler - INFO -   Scheduler at:   ws://10.33.227.163:8789
2022-08-17 07:49:42,568 - distributed.scheduler - INFO -   dashboard at:        10.33.225.163:8787

For the cases I'm testing, the IP addresses match those I specified to --interface. However, it seems like Dask in general (even before this PR) isn't binding to the IP it reports, for example:

$ dask-scheduler --protocol tcp &
[1] 5397
$ 2022-08-17 07:51:30,454 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-17 07:51:31,064 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2022-08-17 07:51:31,104 - distributed.scheduler - INFO - State start
2022-08-17 07:51:31,115 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-17 07:51:31,116 - distributed.scheduler - INFO - Clear task state
2022-08-17 07:51:31,117 - distributed.scheduler - INFO -   Scheduler at:  tcp://10.33.227.163:8786
2022-08-17 07:51:31,117 - distributed.scheduler - INFO -   dashboard at:                     :8787

$ netstat -tupan | grep :878
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 0.0.0.0:8786            0.0.0.0:*               LISTEN      5397/python3.8
tcp        0      0 0.0.0.0:8787            0.0.0.0:*               LISTEN      5397/python3.8
tcp6       0      0 :::8786                 :::*                    LISTEN      5397/python3.8
tcp6       0      0 :::8787                 :::*                    LISTEN      5397/python3.8

In the example above we see Dask reporting it's binding to a specific IP address, but netstat clearly states it's binding to 0.0.0.0 (all interfaces). This looks like a bug to me, for this particular case it will be fine as you could connect to the scheduler in any of the interfaces, but it isn't doing what it's supposed to.

@pentschev
Copy link
Member

Nevermind, in the example above I didn't specify --interface, when I do I see the correct behavior:

$ dask-scheduler --protocol tcp --interface ib0 &
[1] 9736
$ 2022-08-17 07:59:58,304 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-17 07:59:59,078 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2022-08-17 07:59:59,120 - distributed.scheduler - INFO - State start
2022-08-17 07:59:59,130 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-17 07:59:59,130 - distributed.scheduler - INFO - Clear task state
2022-08-17 07:59:59,131 - distributed.scheduler - INFO -   Scheduler at:  tcp://10.33.225.163:8786
2022-08-17 07:59:59,131 - distributed.scheduler - INFO -   dashboard at:        10.33.225.163:8787

$ netstat -tupan | grep :878
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 10.33.225.163:8786      0.0.0.0:*               LISTEN      9736/python3.8
tcp        0      0 0.0.0.0:8787            0.0.0.0:*               LISTEN      9736/python3.8
tcp6       0      0 :::8787                 :::*                    LISTEN      9736/python3.8

That was an extrapolation of the incorrect behavior I see with individual protocols. The TCP interface is bound correctly, but the others are not:

$ UCX_TCP_CM_REUSEADDR=y dask-scheduler --protocol ucx,tcp,ws --port 8786,8788,8789 --interface enp1s0f0,enp1s0f0,enp1s0f0 &
[1] 11971
$ 2022-08-17 08:03:30,634 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-17 08:03:31,227 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2022-08-17 08:03:31,266 - distributed.scheduler - INFO - State start
2022-08-17 08:03:31,275 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-17 08:03:31,276 - distributed.scheduler - INFO - Clear task state
2022-08-17 08:03:32,929 - distributed.scheduler - INFO -   Scheduler at:  ucx://10.33.227.163:8786
2022-08-17 08:03:32,929 - distributed.scheduler - INFO -   Scheduler at:  tcp://10.33.227.163:8788
2022-08-17 08:03:32,929 - distributed.scheduler - INFO -   Scheduler at:   ws://10.33.227.163:8789
2022-08-17 08:03:32,929 - distributed.scheduler - INFO -   dashboard at:        10.33.227.163:8787

$ netstat -tupan | grep :878
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 10.33.227.163:8788      0.0.0.0:*               LISTEN      11971/python3.8
tcp        0      0 0.0.0.0:8789            0.0.0.0:*               LISTEN      11971/python3.8
tcp        0      0 0.0.0.0:8786            0.0.0.0:*               LISTEN      11971/python3.8
tcp        0      0 0.0.0.0:8787            0.0.0.0:*               LISTEN      11971/python3.8
tcp6       0      0 :::8789                 :::*                    LISTEN      11971/python3.8
tcp6       0      0 :::8787                 :::*                    LISTEN      11971/python3.8

I can confirm the behavior above for individual protocols (without this PR). For sure there's a bug with UCX, but I don't know if this is a bug with websockets or a known limitation.

@jacobtomlinson
Copy link
Member Author

Thanks for digging into this @pentschev. It sounds like this has identified some bugs but they are not related to this PR specifically. Should we open an issue to track that?

@pentschev
Copy link
Member

For UCX I've filed rapidsai/ucx-py#871 and #6901 to correct this behavior. But it may be worth filing an issue for someone to investigate whether websockets should be fixed too.

@jacobtomlinson
Copy link
Member Author

Test failures appear unrelated. Unless there are further review/comments I intend to merge on Monday.

@jacobtomlinson jacobtomlinson merged commit 11616a3 into dask:main Aug 22, 2022
@jacobtomlinson jacobtomlinson deleted the dask-scheduler-multiple-protocols branch August 22, 2022 10:14
gjoseph92 pushed a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for multiple protocols or heterogenous protocols
2 participants