Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Support hundreds and thousands worker nodes for a single training Job #2318

Open
tenzen-y opened this issue Nov 4, 2024 · 1 comment

Comments

@tenzen-y
Copy link
Member

tenzen-y commented Nov 4, 2024

What you would like to be added?

We should support the multiple replicas per a replicatedJob like:

[...]
spec:
  replicatedJobs:
  - name:
    replicas: 5
[...]

Why is this needed?

Currently, we enforce 1 to the JobSet ReplicatedJob replicas:

for _, rJob := range jobSetTemplateSpec.Spec.ReplicatedJobs {
// By default every ReplicatedJob has only 1 replica.
opts = append(opts, runtime.WithPodSpecReplicas(rJob.Name, 1, rJob.Template.Spec.Template.Spec))

However, when the single worker replicatedJob has batch/v1 Job with hundreds and thousands of completions (.spec.completions), this brings us a significant reconciling delay since the job-controller (combined within kube-controller-manager) reconciliation will take much longer time due to thousands of Pods, then following Jobs will be stuck in the workqueue.

spec:
  replicatedJobs:
  - name: training-node
    replicas: 1
    template:
      spec:
        completions: 2000
        parallelism: 2000

After that, the kube-controller-manger workqueue depth will be much deeper, which could potentially cause a memory leak.
Finally, the kube-controller-manager continues to restart, and any kind of Workload (even StatefulSet and Deployment) will fall unhandled.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

@tenzen-y
Copy link
Member Author

tenzen-y commented Nov 4, 2024

/remove-label lifecycle/needs-triage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

1 participant