-
-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TOCTOU Bug while scaling down workers #855
Comments
Is it possible to allow the scheduler to delete the deployment directly? maybe with something like a plugin or so? |
Are you seeing this in practice? Or is this a hypothetical race condition? Each worker should have a unique ID, and when the scheduler retires it then a worker with that ID cannot reconnect in the future. So even if Kubernetes restarts the Pod in the time before the delete call happens and the Pod is cascade deleted the new worker should just repeatedly fail to connect to the scheduler.
I don't think we should be giving the scheduler the ability to interact with the Kubernetes API as that would require us to give permissions to the scheduler Pod which can execute arbitrary user code. |
Yes, I'm seeing this during large-scale cluster scaling e.g. 100-200 workers |
I'am having that behavior each time I try to scale down even with 2 workers (see #856), so my cluster never scale down. |
Hi, Scheduler logs :
One worker logs:
Operator logs :
My workers deployment are never deleted and my worker group never scale down, my worker restart immediately when the worker stop. |
The currently implemented logic for deletion can be summarized into:
Ref: https://github.com/dask/dask-kubernetes/blob/main/dask_kubernetes/operator/controller/controller.py#L600-L611
However, between 2 and 3 the Kubernetes API may interfere and restart the worker deployment so a new pod will be created and join the cluster for some time before the operator deletes the deployment effectively interpreting the pod mid-run.
The text was updated successfully, but these errors were encountered: