-
-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing idleTimeout key in daskcluster_autoshutdown #882
Comments
It looks like the idle timeout option isn't making it through to the resource in Kubernetes. Could you describe the cluster resource and ensure it is set correctly? Could you also ensure you have the latest version on the operator installed? |
This is the config yaml of the DaskCluster (I removed unnecessary parts), if this helps apiVersion: kubernetes.dask.org/v1
kind: DaskCluster
metadata:
annotations:
kopf.zalando.org/last-handled-configuration: >
{"spec":} # Same spec dict as below
creationTimestamp: '2024-04-15T15:37:26Z'
finalizers:
- kopf.zalando.org/KopfFinalizerMarker
generation: 4
managedFields:
- apiVersion: kubernetes.dask.org/v1
fieldsType: FieldsV1
fieldsV1:
f:spec:
.: {}
f:scheduler:
.: {}
f:service:
.: {}
f:ports:
.: {}
k:{"port":8786,"protocol":"TCP"}:
.: {}
f:name: {}
f:port: {}
f:protocol: {}
f:targetPort: {}
k:{"port":8787,"protocol":"TCP"}:
.: {}
f:name: {}
f:port: {}
f:protocol: {}
f:targetPort: {}
f:selector:
.: {}
f:dask.org/cluster-name: {}
f:dask.org/component: {}
f:type: {}
f:spec:
.: {}
f:containers: {}
f:imagePullSecrets: {}
f:worker:
.: {}
f:replicas: {}
f:spec:
.: {}
f:containers: {}
f:imagePullSecrets: {}
f:volumes: {}
f:status:
f:phase: {}
manager: kr8s
operation: Update
time: '2024-04-15T15:37:26Z'
- apiVersion: kubernetes.dask.org/v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:kopf.zalando.org/last-handled-configuration: {}
f:finalizers:
.: {}
v:"kopf.zalando.org/KopfFinalizerMarker": {}
f:status: {}
manager: kopf
operation: Update
time: '2024-04-15T15:37:27Z'
name: dask-cluster
namespace: dask-operator
resourceVersion: '712042645'
uid: 3c7db72d-8f94-4904-b0c6-3e496f9b1ff6
spec:
scheduler:
...
worker:
...
status:
phase: Running The operator is running image
I'm using dask in conjunction with prefect, and the creation of the KubeCluster is handed over to the prefect DaskTaskRunner, however the spec = make_cluster_spec(
name=f"dask-cluster-{getuser()}-{now}",
# ...
n_workers=n_workers,
resources=resources,
idle_timeout=5,
)
runner = DaskTaskRunner(
cluster_class="dask_kubernetes.operator.KubeCluster",
cluster_kwargs={
"idle_timeout": 5,
"custom_cluster_spec": spec,
"namespace": "dask-operator",
}, Not sure if this is related somehow. |
This is strange, I don't see
Can you confirm that |
What about |
I can confirm,
The same spec dict is also present in the So, as you pointed out, it is set correctly but not given to the resource properly. The worker and scheduler |
Thanks for confirming. That dict gets passed straight to the create call, so there's nowhere for that key to get dropped in between. The only thing I can think is perhaps your CRDs are out of date and don't contain that property and so Kubernetes is silently dropping it. Can you uninstall the operator and ensure the CRDs have been cleaned up, then install it again? |
I did uninstall the operator and made sure everything related to it is gone , followed this guide, installed again, but unfortunately the problem persists. The Exception is a bit annoying because it spams the logs of the kubernetes cluster, but not critical. My core issue was that resources were not deleted properly, but as a workaround I solved that by making sure to manualy delete all depoyments related to dask using I'm open to other suggestions, otherwise if other people do not see this problem feel free to close the issue. Thanks for the quick help so far! |
Yeah it's just strange that key is being dropped somewhere. I also feel like it may be specific to your setup because nobody else has reported it. We could easily change |
Hello, I think I am facing the same issue. None of my attempts to have the cluster automatically shut down after
I used to get the same error as the OP before updating to the latest version of dask + operator. Now, I see the following log message every 5 seconds, but the cluster never shuts down:
If I describe the Also, I have verified using debug breakpoints that So as @jacobtomlinson said, it seems like this parameter is dropped somewhere inbetween the constructor call, and the resource creation in the cluster. Uninstalling and re-installing the operator unfortunately did not fix the issue. package versions: dask-kubernetes-operator-2024.5.0 helm chart with app version 2022.4.1 |
Describe the issue:
My KubeClusters sometimes do not get shut down properly on kubernetes when they're done with their work. Kubernetes logs state that there's an exception in a kopf finalizer which is retried indefinitely, apparently due to the spec dict given to
daskcluster_autoshutdown
:When I remove these lines from the DaskCluster resource YAML in kubernetes, the problem is gone
Is it correct that
daskcluster_autoshutdown
as below receivesspec
as a specification dict, e.g. frommake_cluster_spec(..., idle_timeout=5)
? I tried expicitly adding theidle_timeout
, but the problem persistsNot sure if this is a proper bug, or an issue with kopf, or anything is misconfigured on my end. Appreciate any help.
I'd also be fine with just removing the timer/finalizer if that's possible.
Anything else we need to know?:
Environment:
The text was updated successfully, but these errors were encountered: