-
-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support adaptive scaling in Helm cluster #277
Comments
Thanks for raising this @Jeffwan. A couple of corrections to your understanding:
In Dask terminology we refer to Running a single cluster with multiple clients connected is generally discouraged as Dask has no concepts to ensure fair usage. Tasks are executed on a first-come-first-served bases and this makes it very possible for a single client to hog a cluster. Generally for these use cases we recommend each client creates their own ephemeral cluster with You're not the first to ask about adding adaptive functionality to dask-kubernetes/dask_kubernetes/helm.py Lines 246 to 256 in ccb9864
The problem is that Dask workers are stateful, and the Dask scheduler manages that state. When an adaptive cluster decides to scale down state is intentionally removed from a worker before removing it, similar to draining a k8s node before removing it. However for the Helm Chart we use a Deployment resource for workers and can only set the desired number of replicas. So when removing a worker Dask would drain the state from one Pod, tell the worker process to exit, and then decrement the number of replicas in the deployment. Typically this causes a race condition where k8s can restart the worker before the new number of replicas updates and then remove a different pod when it does, resulting in lost state. This is not a factor in I would be really interested to hear your thoughts on this. Every assumption here is definitely up for debate. It would also be great to hear more about your use case so we can take that into consideration. |
Hi @jacobtomlinson - I wanted to piggyback off of this exact question to perhaps add some clarity towards people who are looking for Dask as a small-business solution to schedule workflows. By the way, thanks for everything you have done - we need more people like you. I am at a crossroads for my small business to deploy Dask as a way for our projected ~10 analysts to execute long-running Python computations. Here's the workflow that I run:
My Attempts:
The Actual Question:
Will spread the love if this is answered and I've understood that the last implementation as I outlined is the way to go! If I wrote something confusing, I'll be more than happy to correct myself. |
Thanks for the comment @omarsumadi (although it should probably be a new issue). Thanks for the praise! And to answer your question, yes that workflow sounds totally reasonable. Although I would like to point you to Prefect which is a workflow management tool built on Dask. It sounds like you are trying to build the same thing. cc @jcrist |
@jacobtomlinson Hey Jacob - ok great, that takes a lot of weight off my shoulders. I'm new to Github, so I didn't want to make a new issue because it didn't fall under any of the categories of:
What would you suggest doing next time if something like this came up? Also, you should let people sponsor you on Github! Thanks! Oh and about Perfect - I'll look into this. I use the word 'potential' analysts very strongly, as in, we are looking to get some people on board but still reaching out to funding. I'll reach out when the time comes hopefully, but nothing is in the bag right now! |
Sure I understand that. This is probably the kind of question that would be asked on a forum, which we have discussed creating in the past.
That's very kind of you, but I'll leave my employer to do the sponsoring 😄. If you want to give back to the Dask community you can donate via NumFocus. |
@jacobtomlinson I've been thinking about what adaptive scaling for HelmCluster could look like, curious what you think about a hybrid approach of the two current options (and also whether it overlaps or is even totally redundant with what you're thinking about in #318 dask/distributed#4605 etc). Basically I was imagining that the scheduler would be managed externally by Helm, but would start with either no workers or a very limited static worker deployment. Then the cluster object would connect to the existing scheduler like |
Thanks for sharing your interest here @bnaul. I think rather than trying to build some hybrid the best way to move forward would be a Dask operator for Kubernetes. We could have a Then we could shuffle both the Helm Chart and |
The |
As I understand, the different between
KubeCluster
andHelmCluster
isMy requirement is, I hope there's a long running scheduler in the cluster and multiple clients can connects this scheduler to submit tasks, the worker resources can come from same kubernetes cluster as scheduler and they can be scale up and down based on the load like what KubeCluster provides.
Seems it's a combination of KubeCluster and HelmCluster. Did community consider this case when we add Kubernetes support? Is there any technical blockers? If that's something reasonable, I can help work on this feature request
The text was updated successfully, but these errors were encountered: