You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am encountering an issue when specifying the port for the worker to listen on. When using the traditional Dask Distributed with dask-worker (excluding GPU usage), I can utilize the --worker-port parameter to define this behavior. However, with dask-cuda-worker (version 23.10.0), I am unable to locate any option for this purpose, except for the --host parameter.
Consequently, when I execute the following command: CUDA_VISIBLE_DEVICES=0 dask-cuda-worker --scheduler-file scheduler.json --host 127.0.0.1:12345, it results in the following error:
warnings.warn(f'''
2023-09-29 13:39:00,329 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-scratch-space/worker-bpnddwo9', purging
2023-09-29 13:39:00,337 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-09-29 13:39:00,337 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-09-29 13:39:00,338 - distributed.worker - ERROR - Failed to log closing event
Traceback (most recent call last):
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 616, in start
await wait_for(self.start_unsafe(), timeout=timeout)
File "/home/user/.local/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
return await asyncio.wait_for(fut, timeout)
File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
return await fut
File "/home/user/.local/lib/python3.10/site-packages/distributed/worker.py", line 1391, in start_unsafe
await self.listen(start_address, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 810, in listen
listener = await listen(
File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/core.py", line 256, in _
await self.start()
File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/tcp.py", line 573, in start
sockets = netutil.bind_sockets(
File "/usr/local/lib/python3.10/dist-packages/tornado/netutil.py", line 161, in bind_sockets
sock.bind(sockaddr)
OSError: [Errno 98] Address already in use
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/user/.local/lib/python3.10/site-packages/distributed/worker.py", line 1540, in close
self.log_event(self.address, {"action": "closing-worker", "reason": reason})
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 723, in address
raise ValueError("cannot get address of non-running Server")
ValueError: cannot get address of non-running Server
2023-09-29 13:39:00,340 - distributed.worker - INFO - Stopping worker. Reason: failure-to-start-<class 'OSError'>
2023-09-29 13:39:00,340 - distributed.worker - INFO - Closed worker has not yet started: Status.init
2023-09-29 13:39:00,341 - distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 616, in start
await wait_for(self.start_unsafe(), timeout=timeout)
File "/home/user/.local/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
return await asyncio.wait_for(fut, timeout)
File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
return await fut
File "/home/user/.local/lib/python3.10/site-packages/distributed/worker.py", line 1391, in start_unsafe
await self.listen(start_address, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 810, in listen
listener = await listen(
File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/core.py", line 256, in _
await self.start()
File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/tcp.py", line 573, in start
sockets = netutil.bind_sockets(
File "/usr/local/lib/python3.10/dist-packages/tornado/netutil.py", line 161, in bind_sockets
sock.bind(sockaddr)
OSError: [Errno 98] Address already in use
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 953, in run
async with worker:
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 630, in __aenter__
await self
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 624, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
2023-09-29 13:39:00,386 - distributed.nanny - ERROR - Failed to start process
Traceback (most recent call last):
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 616, in start
await wait_for(self.start_unsafe(), timeout=timeout)
File "/home/user/.local/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
return await asyncio.wait_for(fut, timeout)
File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
return await fut
File "/home/user/.local/lib/python3.10/site-packages/distributed/worker.py", line 1391, in start_unsafe
await self.listen(start_address, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 810, in listen
listener = await listen(
File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/core.py", line 256, in _
await self.start()
File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/tcp.py", line 573, in start
sockets = netutil.bind_sockets(
File "/usr/local/lib/python3.10/dist-packages/tornado/netutil.py", line 161, in bind_sockets
sock.bind(sockaddr)
OSError: [Errno 98] Address already in use
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 448, in instantiate
result = await self.process.start()
File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 748, in start
msg = await self._wait_until_connected(uid)
File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 889, in _wait_until_connected
raise msg["exception"]
File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 953, in run
async with worker:
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 630, in __aenter__
await self
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 624, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
2023-09-29 13:39:00,391 - distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:12345'. Reason: nanny-instantiate-failed
2023-09-29 13:39:00,391 - distributed.nanny - INFO - Nanny asking worker to close. Reason: nanny-instantiate-failed
2023-09-29 13:39:00,406 - distributed.nanny - INFO - Worker process 15064 was killed by signal 15
Traceback (most recent call last):
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 616, in start
await wait_for(self.start_unsafe(), timeout=timeout)
File "/home/user/.local/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
return await asyncio.wait_for(fut, timeout)
File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
return await fut
File "/home/user/.local/lib/python3.10/site-packages/distributed/worker.py", line 1391, in start_unsafe
await self.listen(start_address, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 810, in listen
listener = await listen(
File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/core.py", line 256, in _
await self.start()
File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/tcp.py", line 573, in start
sockets = netutil.bind_sockets(
File "/usr/local/lib/python3.10/dist-packages/tornado/netutil.py", line 161, in bind_sockets
sock.bind(sockaddr)
OSError: [Errno 98] Address already in use
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 616, in start
await wait_for(self.start_unsafe(), timeout=timeout)
File "/home/user/.local/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
return await asyncio.wait_for(fut, timeout)
File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
return await fut
File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 362, in start_unsafe
response = await self.instantiate()
File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 448, in instantiate
result = await self.process.start()
File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 748, in start
msg = await self._wait_until_connected(uid)
File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 889, in _wait_until_connected
raise msg["exception"]
File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 953, in run
async with worker:
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 630, in __aenter__
await self
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 624, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/user/.local/bin/dask-cuda-worker", line 8, in <module>
sys.exit(worker())
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/dask_cuda/cli.py", line 442, in worker
loop.run_sync(run)
File "/usr/local/lib/python3.10/dist-packages/tornado/ioloop.py", line 530, in run_sync
return future_cell[0].result()
File "/home/user/.local/lib/python3.10/site-packages/dask_cuda/cli.py", line 434, in run
await worker
File "/home/user/.local/lib/python3.10/site-packages/dask_cuda/cuda_worker.py", line 244, in _wait
await asyncio.gather(*self.nannies)
File "/usr/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 624, in start
raise RuntimeError(f"{type(self).__name__}failedtostart.") fromexcRuntimeError: Nannyfailedtostart.
Without using the --host parameter, everything functions as expected, although I am unable to specify the desired port. Is there a method to achieve this?
The text was updated successfully, but these errors were encountered:
IIRC, --host should only bind to the IP address, so specifying a port as well will indeed not work. I guess the --worker-port parameter was just never needed and thus never added, but there's no technical reason it's not there.
If the --worker-port is important for your use case, would care to submit a pull request with that?
Greetings!
I am encountering an issue when specifying the port for the worker to listen on. When using the traditional Dask Distributed with
dask-worker
(excluding GPU usage), I can utilize the--worker-port
parameter to define this behavior. However, withdask-cuda-worker
(version 23.10.0), I am unable to locate any option for this purpose, except for the--host
parameter.Consequently, when I execute the following command:
CUDA_VISIBLE_DEVICES=0 dask-cuda-worker --scheduler-file scheduler.json --host 127.0.0.1:12345
, it results in the following error:Without using the
--host
parameter, everything functions as expected, although I am unable to specify the desired port. Is there a method to achieve this?The text was updated successfully, but these errors were encountered: