-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose setting multiple protocols and ports via the dask-scheduler CLI #6898
Expose setting multiple protocols and ports via the dask-scheduler CLI #6898
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ±0 15 suites ±0 6h 50m 7s ⏱️ - 15m 48s For more details on these failures, see this check. Results for commit e4485c3. ± Comparison against base commit 1d0701b. ♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jacobtomlinson , this is a clean, simple solution, awesome! I did some brief testing with UCX as well and didn't immediately see any issues with that, so feels like it should work as expected.
For reference, this is what I ran:
# on scheduler
dask-scheduler --protocol ucx,ws --port 8786,8788
# on worker
dask-cuda-worker ucx://SCHEDULER_IP:8786
# on client -- from https://github.com/rapidsai/dask-cuda/blob/branch-22.10/dask_cuda/benchmarks/local_cudf_merge.py
python local_cudf_merge.py --runs 10 --scheduler-address ucx://SCHEDULER_IP:8786 -c 50_000_000
I didn't do any test related to ws, mainly because I wouldn't know how to. 🙂
The interface change seems to work, it reports the listening interfaces correctly at least: $ dask-scheduler --protocol ucx,tcp,ws --port 8786,8788,8789 --interface ib0,enp1s0f0,enp1s0f0
2022-08-17 07:49:40,367 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-17 07:49:40,985 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2022-08-17 07:49:41,028 - distributed.scheduler - INFO - State start
2022-08-17 07:49:41,039 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-17 07:49:41,039 - distributed.scheduler - INFO - Clear task state
2022-08-17 07:49:42,568 - distributed.scheduler - INFO - Scheduler at: ucx://10.33.225.163:8786
2022-08-17 07:49:42,568 - distributed.scheduler - INFO - Scheduler at: tcp://10.33.227.163:8788
2022-08-17 07:49:42,568 - distributed.scheduler - INFO - Scheduler at: ws://10.33.227.163:8789
2022-08-17 07:49:42,568 - distributed.scheduler - INFO - dashboard at: 10.33.225.163:8787 For the cases I'm testing, the IP addresses match those I specified to $ dask-scheduler --protocol tcp &
[1] 5397
$ 2022-08-17 07:51:30,454 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-17 07:51:31,064 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2022-08-17 07:51:31,104 - distributed.scheduler - INFO - State start
2022-08-17 07:51:31,115 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-17 07:51:31,116 - distributed.scheduler - INFO - Clear task state
2022-08-17 07:51:31,117 - distributed.scheduler - INFO - Scheduler at: tcp://10.33.227.163:8786
2022-08-17 07:51:31,117 - distributed.scheduler - INFO - dashboard at: :8787
$ netstat -tupan | grep :878
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 0.0.0.0:8786 0.0.0.0:* LISTEN 5397/python3.8
tcp 0 0 0.0.0.0:8787 0.0.0.0:* LISTEN 5397/python3.8
tcp6 0 0 :::8786 :::* LISTEN 5397/python3.8
tcp6 0 0 :::8787 :::* LISTEN 5397/python3.8 In the example above we see Dask reporting it's binding to a specific IP address, but |
Nevermind, in the example above I didn't specify $ dask-scheduler --protocol tcp --interface ib0 &
[1] 9736
$ 2022-08-17 07:59:58,304 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-17 07:59:59,078 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2022-08-17 07:59:59,120 - distributed.scheduler - INFO - State start
2022-08-17 07:59:59,130 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-17 07:59:59,130 - distributed.scheduler - INFO - Clear task state
2022-08-17 07:59:59,131 - distributed.scheduler - INFO - Scheduler at: tcp://10.33.225.163:8786
2022-08-17 07:59:59,131 - distributed.scheduler - INFO - dashboard at: 10.33.225.163:8787
$ netstat -tupan | grep :878
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 10.33.225.163:8786 0.0.0.0:* LISTEN 9736/python3.8
tcp 0 0 0.0.0.0:8787 0.0.0.0:* LISTEN 9736/python3.8
tcp6 0 0 :::8787 :::* LISTEN 9736/python3.8 That was an extrapolation of the incorrect behavior I see with individual protocols. The TCP interface is bound correctly, but the others are not: $ UCX_TCP_CM_REUSEADDR=y dask-scheduler --protocol ucx,tcp,ws --port 8786,8788,8789 --interface enp1s0f0,enp1s0f0,enp1s0f0 &
[1] 11971
$ 2022-08-17 08:03:30,634 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-17 08:03:31,227 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2022-08-17 08:03:31,266 - distributed.scheduler - INFO - State start
2022-08-17 08:03:31,275 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-17 08:03:31,276 - distributed.scheduler - INFO - Clear task state
2022-08-17 08:03:32,929 - distributed.scheduler - INFO - Scheduler at: ucx://10.33.227.163:8786
2022-08-17 08:03:32,929 - distributed.scheduler - INFO - Scheduler at: tcp://10.33.227.163:8788
2022-08-17 08:03:32,929 - distributed.scheduler - INFO - Scheduler at: ws://10.33.227.163:8789
2022-08-17 08:03:32,929 - distributed.scheduler - INFO - dashboard at: 10.33.227.163:8787
$ netstat -tupan | grep :878
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 10.33.227.163:8788 0.0.0.0:* LISTEN 11971/python3.8
tcp 0 0 0.0.0.0:8789 0.0.0.0:* LISTEN 11971/python3.8
tcp 0 0 0.0.0.0:8786 0.0.0.0:* LISTEN 11971/python3.8
tcp 0 0 0.0.0.0:8787 0.0.0.0:* LISTEN 11971/python3.8
tcp6 0 0 :::8789 :::* LISTEN 11971/python3.8
tcp6 0 0 :::8787 :::* LISTEN 11971/python3.8 I can confirm the behavior above for individual protocols (without this PR). For sure there's a bug with UCX, but I don't know if this is a bug with websockets or a known limitation. |
Thanks for digging into this @pentschev. It sounds like this has identified some bugs but they are not related to this PR specifically. Should we open an issue to track that? |
For UCX I've filed rapidsai/ucx-py#871 and #6901 to correct this behavior. But it may be worth filing an issue for someone to investigate whether websockets should be fixed too. |
Test failures appear unrelated. Unless there are further review/comments I intend to merge on Monday. |
Closes #6891
pre-commit run --all-files
In #6891 @mrocklin mentioned that
Scheduler
can take a list forprotocol
andport
. This PR updates thedask-scheduler
CLI to also allow lists. Users can optionally specify a comma-separated list for each.