You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, when the single worker replicatedJob has batch/v1 Job with hundreds and thousands of completions (.spec.completions), this brings us a significant reconciling delay since the job-controller (combined within kube-controller-manager) reconciliation will take much longer time due to thousands of Pods, then following Jobs will be stuck in the workqueue.
After that, the kube-controller-manger workqueue depth will be much deeper, which could potentially cause a memory leak.
Finally, the kube-controller-manager continues to restart, and any kind of Workload (even StatefulSet and Deployment) will fall unhandled.
Love this feature?
Give it a 👍 We prioritize the features with most 👍
The text was updated successfully, but these errors were encountered:
What you would like to be added?
We should support the multiple replicas per a replicatedJob like:
Why is this needed?
Currently, we enforce 1 to the JobSet ReplicatedJob replicas:
training-operator/pkg/runtime.v2/core/trainingruntime.go
Lines 108 to 110 in 9e46f9d
However, when the single worker replicatedJob has batch/v1 Job with hundreds and thousands of completions (
.spec.completions
), this brings us a significant reconciling delay since the job-controller (combined within kube-controller-manager) reconciliation will take much longer time due to thousands of Pods, then following Jobs will be stuck in the workqueue.After that, the kube-controller-manger workqueue depth will be much deeper, which could potentially cause a memory leak.
Finally, the kube-controller-manager continues to restart, and any kind of Workload (even StatefulSet and Deployment) will fall unhandled.
Love this feature?
Give it a 👍 We prioritize the features with most 👍
The text was updated successfully, but these errors were encountered: