Automatically rescale to recover from pod deletion #717

NakulK48 · 2023-05-18T12:31:25Z

Fixes #603 (hopefully!), ping @kwohlfahrt whose idea this is

This is useful in a couple of cases:

When running on a cloud deployment (especially on something like AWS spot) the node underlying your pod can suddenly disappear, taking down the pod
When you want to restart a faulty pod (we've seen things like CUDA errors on long-running pods), you can just kubectl delete the pod and let Dask recover it.

I wasn't able to get the test running locally; I saw the following error and stack trace

Error: UPGRADE FAILED: "dask-gateway" has no deployed releases

    @pytest.fixture(scope="session", autouse=True)
    def install_gateway(k8s_cluster):
        check_dependency("helm")
        # To ensure the operator can coexist with Gateway
>       subprocess.run(
            [
                "helm",
                "upgrade",
                "dask-gateway",
                "dask-gateway",
                "--install",
                "--repo=https://helm.dask.org",
                "--create-namespace",
                "--namespace",
                "dask-gateway",
            ],
            check=True,
            env={**os.environ},
        )

jacobtomlinson

This is such an elegant solution!

We are looking at exploring replacing every worker Pod with a worker Deployment in the future, but this could be a good short-term fix.

Adding a deletion finalizer does have downsides, mainly when it comes to testing and CI. Any worker Pod that gets created can only be deleted if the Dask Operator controller is running, which can be problematic in tests. But we can probably work around this if this solution works.

dask_kubernetes/operator/controller/controller.py

Co-authored-by: Jacob Tomlinson <[email protected]>

NakulK48 · 2023-05-18T12:45:04Z

Unfortunately it seems like pull requests with head branches in an org don't support "allow edits from maintainers" https://github.com/orgs/community/discussions/5634 - I can refork to my own account and reopen if that's easier

jacobtomlinson · 2023-05-18T12:51:05Z

Ah fair enough. Not to worry, thanks for pushing in my suggestion so quickly.

jacobtomlinson · 2023-05-18T14:44:15Z

Looks like the test is hanging so I don't think this is working as expected.

I wasn't able to get the test running locally; I saw the following error and stack trace

What version of helm do you have installed?

NakulK48 · 2023-05-18T16:14:34Z

$ helm version
version.BuildInfo{Version:"v3.9.2", GitCommit:"1addefbfe665c350f4daf868a9adc5600cc064fd", GitTreeState:"clean", GoVersion:"go1.17.12"}

jacobtomlinson · 2023-05-18T16:22:11Z

I'm not seeing the same error when I try it myself. I asked for the version because I seem to remember this error message from the Helm v2 days.

Can you try the command outside of pytest and see if it works?

$ helm version
version.BuildInfo{Version:"v3.9.2", GitCommit:"1addefbfe665c350f4daf868a9adc5600cc064fd", GitTreeState:"clean", GoVersion:"go1.17.12"}

$ helm upgrade dask-gateway dask-gateway --install --repo=https://helm.dask.org --create-namespace --namespace dask-gateway
Release "dask-gateway" does not exist. Installing it now.
NAME: dask-gateway
LAST DEPLOYED: Thu May 18 17:20:28 2023
NAMESPACE: dask-gateway
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
You've installed Dask-Gateway version 2023.1.1, from chart
version 2023.1.1!

Your release is named "dask-gateway" and installed into the
namespace "dask-gateway".

You can find the public address(es) at:

  $ kubectl --namespace=dask-gateway get service traefik-dask-gateway

kwohlfahrt

This looks good to me (assuming tests eventually work).

I can definitely add a +1 to the difficulty of getting tests running locally and on my own fork, I've occasionally made a start on trying to fix this, and then given up after spending an hour or two trying to run tests :(

kwohlfahrt · 2023-05-19T08:09:12Z

We are looking at exploring replacing every worker Pod with a worker Deployment in the future, but this could be a good short-term fix.

A little off track, is there an issue/PR to comment on regarding this? IMO a deployment per worker doesn't add much over managing pods ourselves directly, but a deployment for an entire worker group might work with the pod deletion cost annotation, and simplify the implementation of the dask operator.

jacobtomlinson · 2023-05-19T09:39:03Z

A little off track, is there an issue/PR to comment on regarding this? IMO a deployment per worker doesn't add much over managing pods ourselves directly

#603 is the only issue related to this really. There is a small mention about it on our roadmap epic issue rapidsai/deployment#216.

Managing Pods directly is causing us a bit of a headache. I recently spoke to some folks from the Kubernetes batch-sig who suggested we move away from managing Pods altogether and that we should think of it as an anti-pattern.

We also need each worker to have a Service in order to avoid pod-pod communication which can be disabled in some deployments with tools like Istio. Having many Services and one Deployment could work in theory, but feels funny.

The scheduler/controller need to coordinate to decide which workers to remove when scaling down. I guess they could then reduce the Pod deletion on the worker they want to remove and scale the Deployment down. However, this feels more fragile than deleting Pod/Deployments explicitly. For instance, does reducing the deletion budget on one Pod guarantee that it will be the next Pod to be removed? Or does it just make it more likely?

NakulK48 · 2023-05-22T09:25:15Z

I'm not seeing the same error when I try it myself. I asked for the version because I seem to remember this error message from the Helm v2 days.

Can you try the command outside of pytest and see if it works?

$ helm version
version.BuildInfo{Version:"v3.9.2", GitCommit:"1addefbfe665c350f4daf868a9adc5600cc064fd", GitTreeState:"clean", GoVersion:"go1.17.12"}

$ helm upgrade dask-gateway dask-gateway --install --repo=https://helm.dask.org --create-namespace --namespace dask-gateway
Release "dask-gateway" does not exist. Installing it now.
NAME: dask-gateway
LAST DEPLOYED: Thu May 18 17:20:28 2023
NAMESPACE: dask-gateway
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
You've installed Dask-Gateway version 2023.1.1, from chart
version 2023.1.1!

Your release is named "dask-gateway" and installed into the
namespace "dask-gateway".

You can find the public address(es) at:

  $ kubectl --namespace=dask-gateway get service traefik-dask-gateway

The command works outside of pytest but continues to fail inside it - maybe an issue with KIND specifically?

jacobtomlinson · 2023-05-22T10:28:58Z

I'm wondering if there is a broken install of dask-gateway hanging around on your kind cluster. Can you delete the kind cluster and try again?

$ kind delete cluster --name pytest-kind
# or
$ docker rm -f pytest-kind-control-plane

NakulK48 · 2023-05-22T10:46:35Z

Nice - the kind command didn't work (don't have the tool installed) but the docker rm ran and I'm able to get tests running locally.

NakulK48 · 2023-05-22T14:45:29Z

Whatever I do, kubectl delete pod seems to time out - is there some special invocation I need, or some better way to see what happens when a pod disappears (maybe somehow deleting the underlying node, if that's a thing in kind?)

jacobtomlinson · 2023-05-22T16:39:27Z

It sounds like you're running into the problem I mention in #717 (review).

If a Pod has an @on.delete decorator then kopf will register a finalizer that blocks the Pod from being deleted until the finalizer is removed. If the with kopf_runner(): context manager has exited before the finalizer has been removed then it will deadlock and time out.

There are three options here:

Don't use @on.delete events (our current approach)
Ensure the runner doesn't exit until all resources have been deleted (this may need to be done throughout the test suite)
Patch the finalizers at the end of each test with a fixture.

NakulK48 · 2023-05-23T10:58:53Z

If we patch the finalizer, though, I imagine the pod won't come back up at all. Maybe the timer-based approach is the better one here.

kwohlfahrt · 2023-05-23T12:11:06Z

The kopf docs mention:

If a delete handler is added but finalizers are not required to block the actual deletion, i.e. the handler is optional, the optional argument optional=True can be passed to the delete cause decorator

I think optional=True is what we want anyway - we don't want to block pod deletion, just respond to it when it happens.

NakulK48 · 2023-05-23T16:38:41Z

Well at least now the thing that fails the test is the thing we're actually testing... progress? 🙃

I wonder if making the finaliser optional means it doesn't run at all?

NakulK48 · 2023-05-24T07:48:19Z

nolar/kopf#701

Crucially:

When a resource is deleted without finalizers, it is just deleted, and that's it. Only a DELETED event is sent.

However, Kopf interprets the DELETED event as a "resource is gone" reason/cause type, i.e. not the actual deletion. And for this, it just logs that the resource is gone, without even trying to invoke any handlers.

So I don't think the pod disappearing (the thing we're trying to handle here) would trigger deletion handlers at all. Maybe we can't do better than the timer method, though I can try the workaround in this issue.

jacobtomlinson · 2023-05-24T09:22:55Z

dask_kubernetes/operator/controller/controller.py

+@kopf.on.event(
+    kind="pod", when=resource_is_deleted, labels={"dask.org/component": "worker"}
+)


I think with this decorator the spec that will be passed to daskworkergroup_replica_update will be the spec for the Pod, not the DaskWorkerGroup.

I think we probably need to put this on a separate function that gets the spec for the worker group and then calls daskworkergroup_replica_update.

NakulK48 · 2023-07-11T08:59:54Z

Has this issue now been solved by #730?

jacobtomlinson · 2023-07-11T09:18:20Z

@NakulK48 yes it has :). I'll close this out. Thanks for all the effort here.

NakulK48 added 2 commits May 18, 2023 10:51

Add/test autoscale on deletion

f106e0b

retrigger CI

e3d72d1

jacobtomlinson reviewed May 18, 2023

View reviewed changes

dask_kubernetes/operator/controller/controller.py Outdated Show resolved Hide resolved

jacobtomlinson marked this pull request as ready for review May 18, 2023 12:40

Update dask_kubernetes/operator/controller/controller.py

42e3f82

Co-authored-by: Jacob Tomlinson <[email protected]>

kwohlfahrt mentioned this pull request May 19, 2023

Recreate worker pods on deletion #651

Closed

kwohlfahrt approved these changes May 19, 2023

View reviewed changes

Try with timeouts?

1cd78b5

NakulK48 added 2 commits May 22, 2023 10:48

Add some flags for debuggin

e7f99c9

Force deletion

4120a35

Try with timer instead of on.delete

650ba82

Try with optional delete

b0c8bda

NakulK48 added 2 commits May 24, 2023 08:05

Try with delete event handler

dc1f7db

Format fixes

35d49aa

Try without timeout?

adf9051

jacobtomlinson reviewed May 24, 2023

View reviewed changes

Merge branch 'main' into nakul/handle-deleted-pods

4016371

jacobtomlinson closed this Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically rescale to recover from pod deletion #717

Automatically rescale to recover from pod deletion #717

NakulK48 commented May 18, 2023

jacobtomlinson left a comment

NakulK48 commented May 18, 2023

jacobtomlinson commented May 18, 2023

jacobtomlinson commented May 18, 2023

NakulK48 commented May 18, 2023

jacobtomlinson commented May 18, 2023

kwohlfahrt left a comment

kwohlfahrt commented May 19, 2023

jacobtomlinson commented May 19, 2023

NakulK48 commented May 22, 2023

jacobtomlinson commented May 22, 2023

NakulK48 commented May 22, 2023

NakulK48 commented May 22, 2023 •

edited

Loading

jacobtomlinson commented May 22, 2023

NakulK48 commented May 23, 2023

kwohlfahrt commented May 23, 2023

NakulK48 commented May 23, 2023 •

edited

Loading

NakulK48 commented May 24, 2023 •

edited

Loading

jacobtomlinson May 24, 2023

NakulK48 commented Jul 11, 2023

jacobtomlinson commented Jul 11, 2023 •

edited

Loading

Automatically rescale to recover from pod deletion #717

Automatically rescale to recover from pod deletion #717

Conversation

NakulK48 commented May 18, 2023

jacobtomlinson left a comment

Choose a reason for hiding this comment

NakulK48 commented May 18, 2023

jacobtomlinson commented May 18, 2023

jacobtomlinson commented May 18, 2023

NakulK48 commented May 18, 2023

jacobtomlinson commented May 18, 2023

kwohlfahrt left a comment

Choose a reason for hiding this comment

kwohlfahrt commented May 19, 2023

jacobtomlinson commented May 19, 2023

NakulK48 commented May 22, 2023

jacobtomlinson commented May 22, 2023

NakulK48 commented May 22, 2023

NakulK48 commented May 22, 2023 • edited Loading

jacobtomlinson commented May 22, 2023

NakulK48 commented May 23, 2023

kwohlfahrt commented May 23, 2023

NakulK48 commented May 23, 2023 • edited Loading

NakulK48 commented May 24, 2023 • edited Loading

jacobtomlinson May 24, 2023

Choose a reason for hiding this comment

NakulK48 commented Jul 11, 2023

jacobtomlinson commented Jul 11, 2023 • edited Loading

NakulK48 commented May 22, 2023 •

edited

Loading

NakulK48 commented May 23, 2023 •

edited

Loading

NakulK48 commented May 24, 2023 •

edited

Loading

jacobtomlinson commented Jul 11, 2023 •

edited

Loading