-
-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Help: Scheduler on cluster doesn't seem to work #341
Comments
Yeah we should definitely fix that. Could you please raise a separate issue for this?
The scheduler should exit after it is idle for a certain amount of time. If that isn't working correctly then could you please raise a separate issue for that?
This sounds like a Dask or Python version mismatch between your local environment and the one that is running on Kubernetes. Can you check that your versions match in both environments?
This is currently undecided. In the long term I want to replace all of The current remote scheduler mode is a step towards this, but has drawbacks as you say. Perhaps instead of modifying |
Created a separate tickets for some of the issues:
@jacobtomlinson The only thing which is different, is python 3.7.3 vs 3.7.10, but this works fine if I set
|
Thanks for raising separate tickets for those. I am surprised to see the 127.0.0.1 address in that error, it doesn't match up with what I would expect given the code. dask-kubernetes/dask_kubernetes/core.py Lines 195 to 199 in 1390a6b
Deserialization issues are almost always a version mismatch issue. I would be surprised if a minor Python version would cause this though. Are you certain that all the package versions in the conda environment match? I am unable to reproduce this locally, and without a minimum reproducible example this is pretty hard to debug. When you opened this issue there was a template that requests a bunch of info, it would really help if you could provide that. |
I will extract a reproducible example. All versions are the same, and latest dask packages (created a fresh environment and docker image before testing today). I'm not sure why this localhost address is shipped to the worker, it seems to be coming from dask kubernetes |
The issue is that if you ship a queue to a task, which works normally. |
Could you share your |
This is the default from the docs (besides the large cpu/mem requests):
|
The classic |
This might need to be split into several tickets. I just tried to upgrade to a newer version of dask-kubernetes. If I switch on legacy mode, this seems to work fine. But if I switch to the new mode, where the scheduler runs as a separate pod, I run into several issues, but I might miss something which will resolve all three of these:
A small issue: The scheduler will take the same name as a worker (so you don't know which pod is the scheduler by looking at the name), but worse, it also uses the same resource requests (which it doesn't really need). Also, because the scheduler runs as a separate container, this will be a nightmare when you the client pod is killed/crashes (not terminated), as it won't cleanup anything, and instead of the old situation (workers exciting after 60 seconds) the workers and the scheduler will just stick around forever.
The bigger issue: I can't get it working at all, there are pickle errors when trying to connect to the scheduler both by the worker and the client
distributed.protocol.pickle - INFO - Failed to deserialize
, although it seems to be masked by a timeout error.Is the legacy mode going to disappear in the long run(the name suggests it), or is it safe to keep using it?
The text was updated successfully, but these errors were encountered: