Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically rescale to recover from pod deletion #717

Closed
wants to merge 12 commits into from

Conversation

NakulK48
Copy link

Fixes #603 (hopefully!), ping @kwohlfahrt whose idea this is

This is useful in a couple of cases:

  • When running on a cloud deployment (especially on something like AWS spot) the node underlying your pod can suddenly disappear, taking down the pod
  • When you want to restart a faulty pod (we've seen things like CUDA errors on long-running pods), you can just kubectl delete the pod and let Dask recover it.

I wasn't able to get the test running locally; I saw the following error and stack trace

Error: UPGRADE FAILED: "dask-gateway" has no deployed releases
    @pytest.fixture(scope="session", autouse=True)
    def install_gateway(k8s_cluster):
        check_dependency("helm")
        # To ensure the operator can coexist with Gateway
>       subprocess.run(
            [
                "helm",
                "upgrade",
                "dask-gateway",
                "dask-gateway",
                "--install",
                "--repo=https://helm.dask.org",
                "--create-namespace",
                "--namespace",
                "dask-gateway",
            ],
            check=True,
            env={**os.environ},
        )

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is such an elegant solution!

We are looking at exploring replacing every worker Pod with a worker Deployment in the future, but this could be a good short-term fix.

Adding a deletion finalizer does have downsides, mainly when it comes to testing and CI. Any worker Pod that gets created can only be deleted if the Dask Operator controller is running, which can be problematic in tests. But we can probably work around this if this solution works.

dask_kubernetes/operator/controller/controller.py Outdated Show resolved Hide resolved
@jacobtomlinson jacobtomlinson marked this pull request as ready for review May 18, 2023 12:40
@NakulK48
Copy link
Author

Unfortunately it seems like pull requests with head branches in an org don't support "allow edits from maintainers" https://github.com/orgs/community/discussions/5634 - I can refork to my own account and reopen if that's easier

@jacobtomlinson
Copy link
Member

Ah fair enough. Not to worry, thanks for pushing in my suggestion so quickly.

@jacobtomlinson
Copy link
Member

Looks like the test is hanging so I don't think this is working as expected.

I wasn't able to get the test running locally; I saw the following error and stack trace

What version of helm do you have installed?

@NakulK48
Copy link
Author

$ helm version
version.BuildInfo{Version:"v3.9.2", GitCommit:"1addefbfe665c350f4daf868a9adc5600cc064fd", GitTreeState:"clean", GoVersion:"go1.17.12"}

@jacobtomlinson
Copy link
Member

I'm not seeing the same error when I try it myself. I asked for the version because I seem to remember this error message from the Helm v2 days.

Can you try the command outside of pytest and see if it works?

$ helm version
version.BuildInfo{Version:"v3.9.2", GitCommit:"1addefbfe665c350f4daf868a9adc5600cc064fd", GitTreeState:"clean", GoVersion:"go1.17.12"}

$ helm upgrade dask-gateway dask-gateway --install --repo=https://helm.dask.org --create-namespace --namespace dask-gateway
Release "dask-gateway" does not exist. Installing it now.
NAME: dask-gateway
LAST DEPLOYED: Thu May 18 17:20:28 2023
NAMESPACE: dask-gateway
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
You've installed Dask-Gateway version 2023.1.1, from chart
version 2023.1.1!

Your release is named "dask-gateway" and installed into the
namespace "dask-gateway".

You can find the public address(es) at:

  $ kubectl --namespace=dask-gateway get service traefik-dask-gateway

Copy link

@kwohlfahrt kwohlfahrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me (assuming tests eventually work).

I can definitely add a +1 to the difficulty of getting tests running locally and on my own fork, I've occasionally made a start on trying to fix this, and then given up after spending an hour or two trying to run tests :(

@kwohlfahrt
Copy link

We are looking at exploring replacing every worker Pod with a worker Deployment in the future, but this could be a good short-term fix.

A little off track, is there an issue/PR to comment on regarding this? IMO a deployment per worker doesn't add much over managing pods ourselves directly, but a deployment for an entire worker group might work with the pod deletion cost annotation, and simplify the implementation of the dask operator.

@jacobtomlinson
Copy link
Member

A little off track, is there an issue/PR to comment on regarding this? IMO a deployment per worker doesn't add much over managing pods ourselves directly

#603 is the only issue related to this really. There is a small mention about it on our roadmap epic issue rapidsai/deployment#216.

Managing Pods directly is causing us a bit of a headache. I recently spoke to some folks from the Kubernetes batch-sig who suggested we move away from managing Pods altogether and that we should think of it as an anti-pattern.

We also need each worker to have a Service in order to avoid pod-pod communication which can be disabled in some deployments with tools like Istio. Having many Services and one Deployment could work in theory, but feels funny.

The scheduler/controller need to coordinate to decide which workers to remove when scaling down. I guess they could then reduce the Pod deletion on the worker they want to remove and scale the Deployment down. However, this feels more fragile than deleting Pod/Deployments explicitly. For instance, does reducing the deletion budget on one Pod guarantee that it will be the next Pod to be removed? Or does it just make it more likely?

@NakulK48
Copy link
Author

I'm not seeing the same error when I try it myself. I asked for the version because I seem to remember this error message from the Helm v2 days.

Can you try the command outside of pytest and see if it works?

$ helm version
version.BuildInfo{Version:"v3.9.2", GitCommit:"1addefbfe665c350f4daf868a9adc5600cc064fd", GitTreeState:"clean", GoVersion:"go1.17.12"}

$ helm upgrade dask-gateway dask-gateway --install --repo=https://helm.dask.org --create-namespace --namespace dask-gateway
Release "dask-gateway" does not exist. Installing it now.
NAME: dask-gateway
LAST DEPLOYED: Thu May 18 17:20:28 2023
NAMESPACE: dask-gateway
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
You've installed Dask-Gateway version 2023.1.1, from chart
version 2023.1.1!

Your release is named "dask-gateway" and installed into the
namespace "dask-gateway".

You can find the public address(es) at:

  $ kubectl --namespace=dask-gateway get service traefik-dask-gateway

The command works outside of pytest but continues to fail inside it - maybe an issue with KIND specifically?

@jacobtomlinson
Copy link
Member

I'm wondering if there is a broken install of dask-gateway hanging around on your kind cluster. Can you delete the kind cluster and try again?

$ kind delete cluster --name pytest-kind
# or
$ docker rm -f pytest-kind-control-plane

@NakulK48
Copy link
Author

Nice - the kind command didn't work (don't have the tool installed) but the docker rm ran and I'm able to get tests running locally.

@NakulK48
Copy link
Author

NakulK48 commented May 22, 2023

Whatever I do, kubectl delete pod seems to time out - is there some special invocation I need, or some better way to see what happens when a pod disappears (maybe somehow deleting the underlying node, if that's a thing in kind?)

@jacobtomlinson
Copy link
Member

It sounds like you're running into the problem I mention in #717 (review).

If a Pod has an @on.delete decorator then kopf will register a finalizer that blocks the Pod from being deleted until the finalizer is removed. If the with kopf_runner(): context manager has exited before the finalizer has been removed then it will deadlock and time out.

There are three options here:

  • Don't use @on.delete events (our current approach)
  • Ensure the runner doesn't exit until all resources have been deleted (this may need to be done throughout the test suite)
  • Patch the finalizers at the end of each test with a fixture.

@NakulK48
Copy link
Author

If we patch the finalizer, though, I imagine the pod won't come back up at all. Maybe the timer-based approach is the better one here.

@kwohlfahrt
Copy link

The kopf docs mention:

If a delete handler is added but finalizers are not required to block the actual deletion, i.e. the handler is optional, the optional argument optional=True can be passed to the delete cause decorator

I think optional=True is what we want anyway - we don't want to block pod deletion, just respond to it when it happens.

@NakulK48
Copy link
Author

NakulK48 commented May 23, 2023

Well at least now the thing that fails the test is the thing we're actually testing... progress? 🙃

I wonder if making the finaliser optional means it doesn't run at all?

@NakulK48
Copy link
Author

NakulK48 commented May 24, 2023

nolar/kopf#701

Crucially:

When a resource is deleted without finalizers, it is just deleted, and that's it. Only a DELETED event is sent.

However, Kopf interprets the DELETED event as a "resource is gone" reason/cause type, i.e. not the actual deletion. And for this, it just logs that the resource is gone, without even trying to invoke any handlers.

So I don't think the pod disappearing (the thing we're trying to handle here) would trigger deletion handlers at all. Maybe we can't do better than the timer method, though I can try the workaround in this issue.

Comment on lines +469 to +471
@kopf.on.event(
kind="pod", when=resource_is_deleted, labels={"dask.org/component": "worker"}
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with this decorator the spec that will be passed to daskworkergroup_replica_update will be the spec for the Pod, not the DaskWorkerGroup.

I think we probably need to put this on a separate function that gets the spec for the worker group and then calls daskworkergroup_replica_update.

@NakulK48
Copy link
Author

Has this issue now been solved by #730?

@jacobtomlinson
Copy link
Member

jacobtomlinson commented Jul 11, 2023

@NakulK48 yes it has :). I'll close this out. Thanks for all the effort here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Worker group doesn't recover after pod deletion
3 participants