Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KubeRay] KubeRay introduces new APIs for RayJob termination #3994

Open
3 tasks
kevin85421 opened this issue Jan 17, 2025 · 5 comments
Open
3 tasks

[KubeRay] KubeRay introduces new APIs for RayJob termination #3994

kevin85421 opened this issue Jan 17, 2025 · 5 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@kevin85421
Copy link

What would you like to be added:

ray-project/kuberay#2643 KubeRay v1.3.0 will introduce a new API to support different deletion policies for RayJob. For example, KubeRay can delete all worker Pods while keeping the head Pod alive for troubleshooting. Will this make Kueue make incorrect scheduling decisions?

cc @andrewsykim @rueian @MortalHappiness

Why is this needed:

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@kevin85421 kevin85421 added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 17, 2025
@mimowo
Copy link
Contributor

mimowo commented Jan 17, 2025

IIUC the deletion policy only affects complete Jobs:
"
// DeletionPolicy indicates what resources of the RayJob are deleted upon job completion.
// Valid values are 'DeleteCluster', 'DeleteWorkers', 'DeleteSelf' or 'DeleteNone'.
// If unset, deletion policy is based on 'spec.shutdownAfterJobFinishes'.
// This field requires the RayJobDeletionPolicy feature gate to be enabled.
"
IIUC the policies only affect the RayJob / Cluster once finished, and so I think it should not impact quota management by Kueue, because Kueue does not include quota for Jobs which are finished - once a workload is marked as Finished, then kueue deleted its resource usage in cache, see [here](

case status == workload.StatusFinished || !active:
if !active {
log.V(2).Info("Workload will not be queued because the workload is not active", "workload", klog.KObj(wl))
}
// The workload could have been in the queues if we missed an event.
r.queues.DeleteWorkload(wl)
// trigger the move of associated inadmissibleWorkloads, if there are any.
r.queues.QueueAssociatedInadmissibleWorkloadsAfter(ctx, wl, func() {
// Delete the workload from cache while holding the queues lock
// to guarantee that requeued workloads are taken into account before
// the next scheduling cycle.
if err := r.cache.DeleteWorkload(oldWl); err != nil && prevStatus == workload.StatusAdmitted {
log.Error(err, "Failed to delete workload from cache")
}
})
.

However, I didn't go deeply into the code of the Ray changes. Do you have some specific scenario in mind @kevin85421 ?

@andrewsykim
Copy link
Member

I tihnk we just need to update the validation logic to check for shudownAfterJobFinishes OR deletionPolicy == DeleteCluster

if !spec.ShutdownAfterJobFinishes {
allErrors = append(allErrors, field.Invalid(specPath.Child("shutdownAfterJobFinishes"), spec.ShutdownAfterJobFinishes, "a kueue managed job should delete the cluster after finishing"))
}

@kevin85421
Copy link
Author

IIUC the deletion policy only affects complete Jobs:

It's correct.

because Kueue does not include quota for Jobs which are finished - once a workload is marked as Finished, then kueue deleted its resource usage in cache, ...

If a RayJob finishes and terminates all Ray worker Pods, Kueue assumes that the entire RayCluster resource is released. Is it possible that for a subsequent RayJob, Kueue thinks the Kubernetes cluster has enough resources when it actually does not due to leaked head Pods?

@mimowo
Copy link
Contributor

mimowo commented Jan 17, 2025

Yes, when the main workload is finished keueue assumes all quota resources are reclaimed, and so reclaima the quota for the leaked pods, and may give it to the subsequent workload which, as a result may got admitted but not scheduled.

Are the leaked head pods leaked for good or just a short while?is it a bug in ray or there are legitimate use cases to leak the pods?

@kevin85421
Copy link
Author

This is intended behavior. The use case is that some users want to log into the head Pod to check the Ray dashboard after the job finishes, especially if the Ray job fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

3 participants