-
Notifications
You must be signed in to change notification settings - Fork 952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix controller deadlook when wait dependson task #2898
fix controller deadlook when wait dependson task #2898
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a good fix. If A waits for B, but B needs to pull a large image, it will take a long time, and two minutes is too short. In fact, I don't think setting a timeout is a good solution.
You are right; as long as you are waiting, you will never know how long you should wait; we should remove the waiting, because the controller can make multiple judgments. In multiple control, create it if the create condition is met. Otherwise, we can do it next time. |
Can you explain in detail how |
d9ffb45
to
a49af3a
Compare
If the do you think this is a good solution?
|
a49af3a
to
96c6acf
Compare
Signed-off-by: Ren <[email protected]>
96c6acf
to
1d7c42b
Compare
I probably understand your idea, you want to re-trigger the pod creation through the controller, but are you sure it will re-trigger? |
I tested it ok, although I don't understand why the volcano doesn't use resync,
A dependsOn B, if B pods no ready, we skip to create A. when pod B ready, job controller add the B event to queue, Maybe, there is something I don't know, but we have tested this fix, It's ok. If we doesn't fix it, preempt has a bug, example A dependsOn B, A and B was killed all,if controller create B first, it will use a controller work to wait, if 3 work(default workNum) used to wait, the controller will die. |
@xiefan-github |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested, this code is ok, thanks for your contribution, this fix is awesome!
/approve
@wangyang0616 please trigger the CI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Signed-off-by: Ren <[email protected]>
7c40dab
to
7a9f1ba
Compare
@@ -378,7 +377,16 @@ func (cc *jobcontroller) syncJob(jobInfo *apis.JobInfo, updateStatus state.Updat | |||
go func(taskName string, podToCreateEachTask []*v1.Pod) { | |||
taskIndex := jobhelpers.GetTasklndexUnderJob(taskName, job) | |||
if job.Spec.Tasks[taskIndex].DependsOn != nil { | |||
cc.waitDependsOnTaskMeetCondition(taskName, taskIndex, podToCreateEachTask, job) | |||
if !cc.waitDependsOnTaskMeetCondition(taskName, taskIndex, podToCreateEachTask, job) { | |||
klog.V(4).Infof("Job %s/%s depends on task not ready", job.Name, job.Namespace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make CI happy,
I think we should improve log level
@@ -378,7 +377,16 @@ func (cc *jobcontroller) syncJob(jobInfo *apis.JobInfo, updateStatus state.Updat | |||
go func(taskName string, podToCreateEachTask []*v1.Pod) { | |||
taskIndex := jobhelpers.GetTasklndexUnderJob(taskName, job) | |||
if job.Spec.Tasks[taskIndex].DependsOn != nil { | |||
cc.waitDependsOnTaskMeetCondition(taskName, taskIndex, podToCreateEachTask, job) | |||
if !cc.waitDependsOnTaskMeetCondition(taskName, taskIndex, podToCreateEachTask, job) { | |||
klog.Errorf("Job %s/%s depends on task not ready", job.Name, job.Namespace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
klog.Errorf("Job %s/%s depends on task not ready", job.Name, job.Namespace) | |
klog.V(3).Infof("Job %s/%s depends on task not ready", job.Name, job.Namespace) |
8d3e5fe
to
e1e2963
Compare
Signed-off-by: Ren <[email protected]>
e1e2963
to
6d0ab3b
Compare
I verify codes success in my local, but it not work in volcano verity, how can I fix it? |
It may be an occasional error, but we're not in a hurry to fix it now because spark's CI also has an error (it's not you), we can wait for the spark CI error to be fixed and then retry verify's CI |
Wait for spark fix, help me to ci, thank you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/close |
@hwdef: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@hwdef: Reopened this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hwdef, william-wang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
issues #2897