KEP-2170: Support hundreds and thousands worker nodes for a single training Job #2318

tenzen-y · 2024-11-04T07:09:41Z

What you would like to be added?

We should support the multiple replicas per a replicatedJob like:

[...]
spec:
  replicatedJobs:
  - name:
    replicas: 5
[...]

Why is this needed?

Currently, we enforce 1 to the JobSet ReplicatedJob replicas:

training-operator/pkg/runtime.v2/core/trainingruntime.go

Lines 108 to 110 in 9e46f9d

    
           for _, rJob := range jobSetTemplateSpec.Spec.ReplicatedJobs { 
        
           	// By default every ReplicatedJob has only 1 replica. 
        
           	opts = append(opts, runtime.WithPodSpecReplicas(rJob.Name, 1, rJob.Template.Spec.Template.Spec))

However, when the single worker replicatedJob has batch/v1 Job with hundreds and thousands of completions (.spec.completions), this brings us a significant reconciling delay since the job-controller (combined within kube-controller-manager) reconciliation will take much longer time due to thousands of Pods, then following Jobs will be stuck in the workqueue.

spec:
  replicatedJobs:
  - name: training-node
    replicas: 1
    template:
      spec:
        completions: 2000
        parallelism: 2000

After that, the kube-controller-manger workqueue depth will be much deeper, which could potentially cause a memory leak.
Finally, the kube-controller-manager continues to restart, and any kind of Workload (even StatefulSet and Deployment) will fall unhandled.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

The text was updated successfully, but these errors were encountered:

tenzen-y · 2024-11-04T07:10:57Z

/remove-label lifecycle/needs-triage

tenzen-y added kind/feature lifecycle/needs-triage labels Nov 4, 2024

google-oss-prow bot removed the lifecycle/needs-triage label Nov 4, 2024

tenzen-y added this to KEP-2170: Kubeflow Training V2 API Nov 4, 2024

github-project-automation bot moved this to Todo in KEP-2170: Kubeflow Training V2 API Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-2170: Support hundreds and thousands worker nodes for a single training Job #2318

KEP-2170: Support hundreds and thousands worker nodes for a single training Job #2318

tenzen-y commented Nov 4, 2024 •

edited

Loading

tenzen-y commented Nov 4, 2024

KEP-2170: Support hundreds and thousands worker nodes for a single training Job #2318

KEP-2170: Support hundreds and thousands worker nodes for a single training Job #2318

Comments

tenzen-y commented Nov 4, 2024 • edited Loading

What you would like to be added?

Why is this needed?

Love this feature?

tenzen-y commented Nov 4, 2024

tenzen-y commented Nov 4, 2024 •

edited

Loading