-
-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot close cluster after interrupt (ctrl-c) #444
Comments
I can try to work on it though not sure I know enough yet. Is there a "development" guide that I can use to get up and running? |
BTW, as a workaround, I am using the following to "manually" find and delete the nodes: import pathlib
from kubernetes import client as k8sClient, config
DASK_CLUSTER_NAME_LABEL = "dask.org/cluster-name"
DASK_COMPONENT_LABEL = "dask.org/component"
def load_k8s_config():
"""load config - either from .kube in home directory (for local clients) or in-cluster (for e.g. CI)"""
if (pathlib.Path.home() / ".kube" / "config").exists():
config.load_kube_config()
else:
config.load_incluster_config()
def delete_adhoc_cluster(cluster_name: str, namespace: str):
"""delete the dask cluster created by KubeCluster"""
load_k8s_config()
v1 = k8sClient.CoreV1Api()
pods_to_delete = []
pods_list = v1.list_namespaced_pod(namespace=namespace)
for pod in pods_list.items:
if (
DASK_CLUSTER_NAME_LABEL in pod.metadata.labels
and pod.metadata.labels[DASK_CLUSTER_NAME_LABEL] == cluster_name
):
pods_to_delete.append(
dict(
name=pod.metadata.name,
component=pod.metadata.labels.get(DASK_COMPONENT_LABEL, "n/a")
)
)
for pod in pods_to_delete:
print(f"deleting {pod['name']} ({pod['component']})...")
v1.delete_namespaced_pod(pod['name'], "research") |
Apologies looks like I had typed a response here but forgot to press the button, nice that GitHub saves these things! Thanks @cagantomer I am able to reproduce this locally. My guess from looking at the traceback is that the cluster manager is unable to connect to the scheduler after things start closing out. Perhaps it's worth exploring the port forwarding code in case it has been closed. You might also be interested in trying the operator we've been working on which is intended to supersede the current |
I saw the operator efforts and I kind of have mixed feeling towards it. I am reluctant regarding anything that requires extra installations :-) I really liked the KubeCluster experience - just having the right permissions to k8s and everything works relatively smoothly... What are the advantages of using the operator? |
That's totally understandable. I'm optimistic that the couple of extra lines you need to run the first time you use it will be worth it.
We often find that this is rare and folks don't have the right permissions. Particularly creating pods as that can be a large security risk.
I'll try to enumerate some of the goals: Switching to Custom Resource Definitions and an Operator hugely simplifies the client-side code and provides much better decoupling. This will help maintainability as the current In multi-user clusters an admin can install the operator once for the cluster and then user's can just happily create Dask clusters. With the current Many folks use Dask clusters in a multi-stage pipeline, where each stage is a separate Python script or notebook. The current We are also in the process of adding a The decoupled state means we have been able to implement support for dask-ctl so that clusters can be managed via the CLI/API that provides along with the new CLI dashboard (once we get that finished). We currently struggle with supporting clusters that have Istio installed due to the way our pods talk directly to each other. The operator aims to solve that by handling the complex services required in a way that is transparent to the user. The autoscaling logic is moved from the cluster manager to the operator which will also benefit multi-stage pipeline workflows. |
@jacobtomlinson - thanks for the information. Once we are stable with our current implementation, I'll look into trying the operator (with hope the DevOps will help with it 😄) |
The classic |
What happened:
I am using
KubeCluster
for running ad-hoc dask cluster as part of my script.I am trying to gracefully shut down the cluster in case there was an interrupt (SIGINT / ctrl-C, SIGKILL) with the example below.
When the scripts runs without interrupt, it will correctly clean up the cluster created. When interrupt it, with ctrl-c, the handler catches the exception, it but calling
client.cluster.close
in the finally clause raises a timeout error (see below). The workers and scheduler pods remain active and not terminated.What you expected to happen:
The pods of the ad-hoc cluster should terminate and there should not be any exception.
Minimal Complete Verifiable Example:
The output from execution:
Anything else we need to know?:
Click ctrl-c once "waiting for tasks to finish..." appears on the screen.
Also, not related, but the dashboard_link always shows localhost instead of something more useful (I recall reading a discussion about it - but it was a while ago).
Environment:
The text was updated successfully, but these errors were encountered: