Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dispatch.run is not resilient to worker loss #18

Open
hendrikmakait opened this issue Jun 11, 2024 · 0 comments
Open

dispatch.run is not resilient to worker loss #18

hendrikmakait opened this issue Jun 11, 2024 · 0 comments

Comments

@hendrikmakait
Copy link

hendrikmakait commented Jun 11, 2024

dispatch.run uses worker-restrictions to pin tasks to the workers they should get executed on. Should a worker get removed (or possibly restarted), this will cause the task to transition to the no-worker state and remain there indefinitely (see dask/distributed#7346). From what I see, there is no mechanism implemented to prevent this.

To circumvent this, dask-pytorch-ddp would probably also benefit from dask/distributed#8624.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant