-
-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically rescale to recover from pod deletion #717
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is such an elegant solution!
We are looking at exploring replacing every worker Pod with a worker Deployment in the future, but this could be a good short-term fix.
Adding a deletion finalizer does have downsides, mainly when it comes to testing and CI. Any worker Pod that gets created can only be deleted if the Dask Operator controller is running, which can be problematic in tests. But we can probably work around this if this solution works.
Co-authored-by: Jacob Tomlinson <[email protected]>
Unfortunately it seems like pull requests with head branches in an org don't support "allow edits from maintainers" https://github.com/orgs/community/discussions/5634 - I can refork to my own account and reopen if that's easier |
Ah fair enough. Not to worry, thanks for pushing in my suggestion so quickly. |
Looks like the test is hanging so I don't think this is working as expected.
What version of |
|
I'm not seeing the same error when I try it myself. I asked for the version because I seem to remember this error message from the Helm v2 days. Can you try the command outside of $ helm version
version.BuildInfo{Version:"v3.9.2", GitCommit:"1addefbfe665c350f4daf868a9adc5600cc064fd", GitTreeState:"clean", GoVersion:"go1.17.12"}
$ helm upgrade dask-gateway dask-gateway --install --repo=https://helm.dask.org --create-namespace --namespace dask-gateway
Release "dask-gateway" does not exist. Installing it now.
NAME: dask-gateway
LAST DEPLOYED: Thu May 18 17:20:28 2023
NAMESPACE: dask-gateway
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
You've installed Dask-Gateway version 2023.1.1, from chart
version 2023.1.1!
Your release is named "dask-gateway" and installed into the
namespace "dask-gateway".
You can find the public address(es) at:
$ kubectl --namespace=dask-gateway get service traefik-dask-gateway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me (assuming tests eventually work).
I can definitely add a +1 to the difficulty of getting tests running locally and on my own fork, I've occasionally made a start on trying to fix this, and then given up after spending an hour or two trying to run tests :(
A little off track, is there an issue/PR to comment on regarding this? IMO a deployment per worker doesn't add much over managing pods ourselves directly, but a deployment for an entire worker group might work with the pod deletion cost annotation, and simplify the implementation of the dask operator. |
#603 is the only issue related to this really. There is a small mention about it on our roadmap epic issue rapidsai/deployment#216. Managing Pods directly is causing us a bit of a headache. I recently spoke to some folks from the Kubernetes batch-sig who suggested we move away from managing Pods altogether and that we should think of it as an anti-pattern. We also need each worker to have a Service in order to avoid pod-pod communication which can be disabled in some deployments with tools like Istio. Having many Services and one Deployment could work in theory, but feels funny. The scheduler/controller need to coordinate to decide which workers to remove when scaling down. I guess they could then reduce the Pod deletion on the worker they want to remove and scale the Deployment down. However, this feels more fragile than deleting Pod/Deployments explicitly. For instance, does reducing the deletion budget on one Pod guarantee that it will be the next Pod to be removed? Or does it just make it more likely? |
The command works outside of |
I'm wondering if there is a broken install of $ kind delete cluster --name pytest-kind
# or
$ docker rm -f pytest-kind-control-plane |
Nice - the |
Whatever I do, |
It sounds like you're running into the problem I mention in #717 (review). If a Pod has an There are three options here:
|
If we patch the finalizer, though, I imagine the pod won't come back up at all. Maybe the timer-based approach is the better one here. |
The kopf docs mention:
I think |
Well at least now the thing that fails the test is the thing we're actually testing... progress? 🙃 I wonder if making the finaliser optional means it doesn't run at all? |
Crucially:
So I don't think the pod disappearing (the thing we're trying to handle here) would trigger deletion handlers at all. Maybe we can't do better than the timer method, though I can try the workaround in this issue. |
@kopf.on.event( | ||
kind="pod", when=resource_is_deleted, labels={"dask.org/component": "worker"} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think with this decorator the spec
that will be passed to daskworkergroup_replica_update
will be the spec for the Pod, not the DaskWorkerGroup.
I think we probably need to put this on a separate function that gets the spec for the worker group and then calls daskworkergroup_replica_update
.
Has this issue now been solved by #730? |
@NakulK48 yes it has :). I'll close this out. Thanks for all the effort here. |
Fixes #603 (hopefully!), ping @kwohlfahrt whose idea this is
This is useful in a couple of cases:
kubectl delete
the pod and let Dask recover it.I wasn't able to get the test running locally; I saw the following error and stack trace