[KubeRay] KubeRay introduces new APIs for RayJob termination #3994

kevin85421 · 2025-01-17T06:34:10Z

What would you like to be added:

ray-project/kuberay#2643 KubeRay v1.3.0 will introduce a new API to support different deletion policies for RayJob. For example, KubeRay can delete all worker Pods while keeping the head Pod alive for troubleshooting. Will this make Kueue make incorrect scheduling decisions?

cc @andrewsykim @rueian @MortalHappiness

Why is this needed:

Completion requirements:

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

mimowo · 2025-01-17T12:03:26Z

IIUC the deletion policy only affects complete Jobs:
"
// DeletionPolicy indicates what resources of the RayJob are deleted upon job completion.
// Valid values are 'DeleteCluster', 'DeleteWorkers', 'DeleteSelf' or 'DeleteNone'.
// If unset, deletion policy is based on 'spec.shutdownAfterJobFinishes'.
// This field requires the RayJobDeletionPolicy feature gate to be enabled.
"
IIUC the policies only affect the RayJob / Cluster once finished, and so I think it should not impact quota management by Kueue, because Kueue does not include quota for Jobs which are finished - once a workload is marked as Finished, then kueue deleted its resource usage in cache, see [here](

kueue/pkg/controller/core/workload_controller.go

Lines 696 to 711 in 097262a

    
           case status == workload.StatusFinished || !active: 
        
           	if !active { 
        
           		log.V(2).Info("Workload will not be queued because the workload is not active", "workload", klog.KObj(wl)) 
        
           	} 
        
           	// The workload could have been in the queues if we missed an event. 
        
           	r.queues.DeleteWorkload(wl) 
        
           	// trigger the move of associated inadmissibleWorkloads, if there are any. 
        
           	r.queues.QueueAssociatedInadmissibleWorkloadsAfter(ctx, wl, func() { 
        
           		// Delete the workload from cache while holding the queues lock 
        
           		// to guarantee that requeued workloads are taken into account before 
        
           		// the next scheduling cycle. 
        
           		if err := r.cache.DeleteWorkload(oldWl); err != nil && prevStatus == workload.StatusAdmitted { 
        
           			log.Error(err, "Failed to delete workload from cache") 
        
           		} 
        
           	})

.

However, I didn't go deeply into the code of the Ray changes. Do you have some specific scenario in mind @kevin85421 ?

andrewsykim · 2025-01-17T15:28:04Z

I tihnk we just need to update the validation logic to check for shudownAfterJobFinishes OR deletionPolicy == DeleteCluster

kueue/pkg/controller/jobs/rayjob/rayjob_webhook.go

Lines 102 to 104 in 097262a

    
           if !spec.ShutdownAfterJobFinishes { 
        
           	allErrors = append(allErrors, field.Invalid(specPath.Child("shutdownAfterJobFinishes"), spec.ShutdownAfterJobFinishes, "a kueue managed job should delete the cluster after finishing")) 
        
           }

kevin85421 · 2025-01-17T18:45:38Z

IIUC the deletion policy only affects complete Jobs:

It's correct.

because Kueue does not include quota for Jobs which are finished - once a workload is marked as Finished, then kueue deleted its resource usage in cache, ...

If a RayJob finishes and terminates all Ray worker Pods, Kueue assumes that the entire RayCluster resource is released. Is it possible that for a subsequent RayJob, Kueue thinks the Kubernetes cluster has enough resources when it actually does not due to leaked head Pods?

mimowo · 2025-01-17T19:50:56Z

Yes, when the main workload is finished keueue assumes all quota resources are reclaimed, and so reclaima the quota for the leaked pods, and may give it to the subsequent workload which, as a result may got admitted but not scheduled.

Are the leaked head pods leaked for good or just a short while?is it a bug in ray or there are legitimate use cases to leak the pods?

kevin85421 · 2025-01-17T20:08:49Z

This is intended behavior. The use case is that some users want to log into the head Pod to check the Ray dashboard after the job finishes, especially if the Ray job fails.

kevin85421 added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KubeRay] KubeRay introduces new APIs for RayJob termination #3994

[KubeRay] KubeRay introduces new APIs for RayJob termination #3994

kevin85421 commented Jan 17, 2025

mimowo commented Jan 17, 2025

andrewsykim commented Jan 17, 2025

kevin85421 commented Jan 17, 2025

mimowo commented Jan 17, 2025

kevin85421 commented Jan 17, 2025

[KubeRay] KubeRay introduces new APIs for RayJob termination #3994

[KubeRay] KubeRay introduces new APIs for RayJob termination #3994

Comments

kevin85421 commented Jan 17, 2025

mimowo commented Jan 17, 2025

andrewsykim commented Jan 17, 2025

kevin85421 commented Jan 17, 2025

mimowo commented Jan 17, 2025

kevin85421 commented Jan 17, 2025