Operator unable to delete Kubernetes Deployment #910

thaisarcanjo-ow · 2024-10-14T09:58:24Z

There is an issue with the default settings available from the docs where the Operator tries to delete a Kubernetes Deployment using the wrong name and therefore cannot find. The Operator tries to delete a Deployment that is named like the Worker Pod name, which doesn't exist.

Reproducing steps:

Install the Operator with helm install --repo https://helm.dask.org --create-namespace -n dask-operator --generate-name dask-kubernetes-operator, ie this quick start step.
Create the cluster using the default yaml available from this guide as is. At this stage, two workers would be available from two deployments.
Create an autoscaler with the min workers set to 0 and install it

# autoscaler.yaml
apiVersion: kubernetes.dask.org/v1
kind: DaskAutoscaler
metadata:
  name: simple
spec:
  cluster: "simple"
  minimum: 0  # we recommend always having a minimum of 1 worker so that an idle cluster can start working on tasks immediately
  maximum: 10 # you can place a hard limit on the number of workers regardless of what the scheduler requests

Apply this AutoScaler settings:

kubectl apply -f autoscaler.yaml
daskautoscaler.kubernetes.dask.org/simple created

At this stage, the operator would already try to remove some deployments, but it is attempting to delete a Deployment resouirce that matches the Pod name, which doesn't exist:

[2024-10-14 09:22:42,559] kopf.objects         [INFO    ] [default/simple] Autoscaler updated simple worker count from 2 to 1
[2024-10-14 09:22:42,559] kopf.objects         [INFO    ] [default/simple] Timer 'daskautoscaler_adapt' succeeded.
[2024-10-14 09:22:42,662] httpx                [INFO    ] HTTP Request: GET https://10.96.0.1/apis/kubernetes.dask.org/v1/namespaces/default/daskclusters?fieldSelector=metadata.name%3Dsimple "HTTP/1.1 200 OK"
[2024-10-14 09:22:42,668] httpx                [INFO    ] HTTP Request: GET https://10.96.0.1/apis/apps/v1/namespaces/default/deployments?labelSelector=dask.org%2Fworkergroup-name%3Dsimple-default "HTTP/1.1 200 OK"
[2024-10-14 09:22:42,673] kopf.objects         [INFO    ] [default/simple-default] Scaled worker group simple-default up to 1 workers.
[2024-10-14 09:22:42,677] httpx                [INFO    ] HTTP Request: GET https://10.96.0.1/api/v1/namespaces/default/services?fieldSelector=metadata.name%3Dsimple-scheduler "HTTP/1.1 200 OK"
[2024-10-14 09:22:42,687] httpx                [INFO    ] HTTP Request: GET https://10.96.0.1/api/v1/namespaces/default/services?fieldSelector=metadata.name%3Dsimple-scheduler "HTTP/1.1 200 OK"
[2024-10-14 09:22:42,693] kopf.objects         [WARNING ] [default/simple-default] Scaling simple-default failed via the HTTP API and the Dask RPC, falling back to LIFO scaling. This can result in lost data, see https://kubernetes.dask.org/en/latest/operator_troubleshooting.html.
[2024-10-14 09:22:42,697] httpx                [INFO    ] HTTP Request: GET https://10.96.0.1/api/v1/namespaces/default/pods?labelSelector=dask.org%2Fworkergroup-name%3Dsimple-default "HTTP/1.1 200 OK"
[2024-10-14 09:22:42,701] kopf.objects         [INFO    ] [default/simple-default] Workers to close: ['simple-default-worker-057ae426b6-79bcbdb84b-vlcn7']
[2024-10-14 09:22:42,705] httpx                [INFO    ] HTTP Request: DELETE https://10.96.0.1/apis/apps/v1/namespaces/default/deployments/simple-default-worker-057ae426b6-79bcbdb84b-vlcn7 "HTTP/1.1 404 Not Found"
[2024-10-14 09:22:42,705] kopf.objects         [ERROR   ] [default/simple-default] Handler 'daskworkergroup_replica_update/spec.worker.replicas' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/kr8s/_api.py", line 168, in call_api
    response.raise_for_status()
  File "/usr/local/lib/python3.10/site-packages/httpx/_models.py", line 763, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '404 Not Found' for url 'https://10.96.0.1/apis/apps/v1/namespaces/default/deployments/simple-default-worker-057ae426b6-79bcbdb84b-vlcn7'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/kr8s/_objects.py", line 336, in delete
    async with self.api.call_api(
  File "/usr/local/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/site-packages/kr8s/_api.py", line 186, in call_api
    raise ServerError(
kr8s._exceptions.ServerError: deployments.apps "simple-default-worker-057ae426b6-79bcbdb84b-vlcn7" not found

If I check the pods, the name simple-default-worker-057ae426b6-79bcbdb84b-vlcn7 of the deployment it tried to delete indeed exists, but as a worker pod:

kubectl get pods -l dask.org/cluster-name=simple
NAME                                                READY   STATUS    RESTARTS   AGE
simple-default-worker-057ae426b6-79bcbdb84b-vlcn7   1/1     Running   0          9m36s
simple-default-worker-54afdedac5-6bdb8f746b-7lzsg   1/1     Running   0          9m36s
simple-scheduler-78db7fbfd8-zmwgr                   1/1     Running   0          9m36s

However, the deployment name that controls this pod has a different name:

kubectl get deployments -l dask.org/cluster-name=simple
NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
simple-default-worker-057ae426b6   1/1     1            1           15m
simple-default-worker-54afdedac5   1/1     1            1           15m
simple-scheduler                   1/1     1            1           15m

As you can see, the deployment that controls that worker pod is actually named simple-default-worker-057ae426b6 instead of simple-default-worker-057ae426b6-79bcbdb84b-vlcn7, so as a result, the operator is unable to delete the deployments and the workers are never deleted from the namespace. It could be coming from this linehere the deletion using worker name as expected Deployment name.

Anything else we need to know?:
This may be relate to #855

Environment:

Dask version: 2024.9.1
Python version: 3.11
Operating System: Mac/Linux
Install method (conda, pip, source): pip

The text was updated successfully, but these errors were encountered:

thaisarcanjo-ow · 2024-10-15T16:07:25Z

To provide some extra information, seems like the operator tries 3 times to get the information of which worker/deployment to remove:

Dashboard http here
Dask RCP here
Kubernetes API here (I think the fallback option should not be Pods but Deployments here)

From the logs, we see the first two failed, which was a bit unexpected given the operator can scale up the workers.
We added to the operator some params to get the debug logs with

helm install --repo https://helm.dask.org --create-namespace -n dask-operator dask-kubernetes-operator dask-kubernetes-operator --set kopfArgs="{--all-namespaces,--verbose,--debug}"

and could see that there were some 404 on the response body (would be useful to see which request it was) and after digging through the issues here, this one #807 gave some light on adding distributed.http.scheduler.api to the distributed.scheduler.http.routes Dask config, so added that to the config map as:

    # config map settings applied to the dask-cluster
    distributed:
      scheduler:
        http:
          routes:
          - distributed.http.scheduler.prometheus
          - distributed.http.scheduler.info
          - distributed.http.scheduler.json
          - distributed.http.health
          - distributed.http.proxy
          - distributed.http.statics
          - distributed.http.scheduler.api

then recreated the scheduler and we could see that likely the first http call on getting the workers to retire returned the right name (which seems to be the value from the env var DASK_WORKER_NAME, given when we open the dashboard the workers are named like that, ie matching deployment name) and they are then getting removed after all tasks were computed:

[2024-10-15 15:40:04,912] kopf.objects         [INFO    ] [my-namespace/dask-autoscaler] Autoscaler updated dask-cluster worker count from 2 to 1
[2024-10-15 15:40:04,912] kopf.objects         [INFO    ] [my-namespace/dask-autoscaler] Timer 'daskautoscaler_adapt' succeeded.
[2024-10-15 15:40:04,997] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?fieldSelector=metadata.name%3Ddask-cluster "HTTP/1.1 200 OK"
[2024-10-15 15:40:05,022] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments?labelSelector=dask.org%2Fworkergroup-name%3Ddask-cluster-default "HTTP/1.1 200 OK"
[2024-10-15 15:40:05,034] kopf.objects         [INFO    ] [my-namespace/dask-cluster-default] Scaled worker group dask-cluster-default up to 1 workers.
[2024-10-15 15:40:05,041] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/api/v1/namespaces/my-namespace/services?fieldSelector=metadata.name%3Ddask-cluster-scheduler "HTTP/1.1 200 OK"
[2024-10-15 15:40:05,057] kopf.objects         [INFO    ] [my-namespace/dask-cluster-default] Retired workers {'tcp://172.18.69.197:34793': {'type': 'Worker', 'id': 'dask-cluster-default-worker-9e4e522e22', 'host': '172.18.69.197', 'resources': {}, 'local_directory': '/tmp/dask-scratch-space/worker-20y99qa3', 'name': 'dask-cluster-default-worker-9e4e522e22', 'nthreads': 1, 'memory_limit': 12000000000, 'last_seen': 1729006804.7547565, 'services': {'dashboard': 44215}, 'metrics': {'task_counts': {}, 'bandwidth': {'total': 100000000, 'workers': {}, 'types': {}}, 'digests_total_since_heartbeat': {'tick-duration': 0.5005748271942139, 'latency': 0.0019073486328125}, 'managed_bytes': 0, 'spilled_bytes': {'memory': 0, 'disk': 0}, 'transfer': {'incoming_bytes': 0, 'incoming_count': 0, 'incoming_count_total': 12, 'outgoing_bytes': 0, 'outgoing_count': 0, 'outgoing_count_total': 36}, 'event_loop_interval': 0.020009407997131346, 'cpu': 4.0, 'memory': 187879424, 'time': 1729006804.256867, 'host_net_io': {'read_bps': 285.6785263997705, 'write_bps': 1480.334182253356}, 'host_disk_io': {'read_bps': 8182.791917017202, 'write_bps': 270032.1332615676}, 'num_fds': 22}, 'status': 'closed', 'nanny': 'tcp://172.18.69.197:41727'}}
[2024-10-15 15:40:05,058] kopf.objects         [INFO    ] [my-namespace/dask-cluster-default] Workers to close: ['dask-cluster-default-worker-9e4e522e22']
[2024-10-15 15:40:05,067] httpx                [INFO    ] HTTP Request: DELETE https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments/dask-cluster-default-worker-9e4e522e22 "HTTP/1.1 200 OK"
[2024-10-15 15:40:05,067] kopf.objects         [INFO    ] [my-namespace/dask-cluster-default] Scaled worker group dask-cluster-default down to 1 workers.
[2024-10-15 15:40:05,068] kopf.objects         [INFO    ] [my-namespace/dask-cluster-default] Handler 'daskworkergroup_replica_update/spec.worker.replicas' succeeded.
[2024-10-15 15:40:05,068] kopf.objects         [INFO    ] [my-namespace/dask-cluster-default] Updating is processed: 1 succeeded; 0 failed.
[2024-10-15 15:40:07,830] kopf.objects         [INFO    ] [my-namespace/dask-cluster] Timer 'daskcluster_autoshutdown' succeeded.

Is this setting distributed.http.scheduler.api correct to add to have the downscale bit of autoscaler working? That wasn't required to get the scale up bit working (workers are created correctly)

jacobtomlinson added the bug label Oct 14, 2024

RossmacD mentioned this issue Oct 15, 2024

Improve retire_worker fallback by finding deployments name instead of pod name #912

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator unable to delete Kubernetes Deployment #910

Operator unable to delete Kubernetes Deployment #910

thaisarcanjo-ow commented Oct 14, 2024 •

edited

Loading

thaisarcanjo-ow commented Oct 15, 2024 •

edited

Loading

Operator unable to delete Kubernetes Deployment #910

Operator unable to delete Kubernetes Deployment #910

Comments

thaisarcanjo-ow commented Oct 14, 2024 • edited Loading

thaisarcanjo-ow commented Oct 15, 2024 • edited Loading

thaisarcanjo-ow commented Oct 14, 2024 •

edited

Loading

thaisarcanjo-ow commented Oct 15, 2024 •

edited

Loading