-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KubeRay] KubeRay introduces new APIs for RayJob termination #3994
Comments
IIUC the deletion policy only affects complete Jobs: kueue/pkg/controller/core/workload_controller.go Lines 696 to 711 in 097262a
However, I didn't go deeply into the code of the Ray changes. Do you have some specific scenario in mind @kevin85421 ? |
I tihnk we just need to update the validation logic to check for kueue/pkg/controller/jobs/rayjob/rayjob_webhook.go Lines 102 to 104 in 097262a
|
It's correct.
If a RayJob finishes and terminates all Ray worker Pods, Kueue assumes that the entire RayCluster resource is released. Is it possible that for a subsequent RayJob, Kueue thinks the Kubernetes cluster has enough resources when it actually does not due to leaked head Pods? |
Yes, when the main workload is finished keueue assumes all quota resources are reclaimed, and so reclaima the quota for the leaked pods, and may give it to the subsequent workload which, as a result may got admitted but not scheduled. Are the leaked head pods leaked for good or just a short while?is it a bug in ray or there are legitimate use cases to leak the pods? |
This is intended behavior. The use case is that some users want to log into the head Pod to check the Ray dashboard after the job finishes, especially if the Ray job fails. |
What would you like to be added:
ray-project/kuberay#2643 KubeRay v1.3.0 will introduce a new API to support different deletion policies for RayJob. For example, KubeRay can delete all worker Pods while keeping the head Pod alive for troubleshooting. Will this make Kueue make incorrect scheduling decisions?
cc @andrewsykim @rueian @MortalHappiness
Why is this needed:
Completion requirements:
This enhancement requires the following artifacts:
The artifacts should be linked in subsequent comments.
The text was updated successfully, but these errors were encountered: