Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix controller deadlook when wait dependson task #2898

Merged

Conversation

renwenlong-github
Copy link

issues #2897

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 6, 2023
Copy link
Member

@hwdef hwdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a good fix. If A waits for B, but B needs to pull a large image, it will take a long time, and two minutes is too short. In fact, I don't think setting a timeout is a good solution.

@renwenlong-github
Copy link
Author

I don't think this is a good fix. If A waits for B, but B needs to pull a large image, it will take a long time, and two minutes is too short. In fact, I don't think setting a timeout is a good solution.

You are right; as long as you are waiting, you will never know how long you should wait;

we should remove the waiting, because the controller can make multiple judgments. In multiple control, create it if the create condition is met. Otherwise, we can do it next time.

@hwdef
Copy link
Member

hwdef commented Jun 7, 2023

Can you explain in detail how controller can make multiple judgments

@renwenlong-github
Copy link
Author

renwenlong-github commented Jun 7, 2023

I don't think this is a good fix. If A waits for B, but B needs to pull a large image, it will take a long time, and two minutes is too short. In fact, I don't think setting a timeout is a good solution.

You are right; as long as you are waiting, you will never know how long you should wait;

we should remove the waiting, because the controller can make multiple judgments. In multiple control, create it if the create condition is met. Otherwise, we can do it next time.

I don't think this is a good fix. If A waits for B, but B needs to pull a large image, it will take a long time, and two minutes is too short. In fact, I don't think setting a timeout is a good solution.

i think bug code here

func (cc *jobcontroller) waitDependsOnTaskMeetCondition(taskName string, taskIndex int, podToCreateEachTask []*v1.Pod, job *batch.Job) {
	if job.Spec.Tasks[taskIndex].DependsOn == nil {
		return
	}

	dependsOn := *job.Spec.Tasks[taskIndex].DependsOn
	if len(dependsOn.Name) > 1 && dependsOn.Iteration == batch.IterationAny {
		wait.PollInfinite(detectionPeriodOfDependsOntask, func() (bool, error) {
			for _, task := range dependsOn.Name {
				if cc.isDependsOnPodsReady(task, job) {
					return true, nil
				}
			}
			return false, nil
		})
	} else {
		for _, dependsOnTask := range dependsOn.Name {
			wait.PollInfinite(detectionPeriodOfDependsOntask, func() (bool, error) {
				if cc.isDependsOnPodsReady(dependsOnTask, job) {
					return true, nil
				}
				return false, nil
			})
		}
	}
}

If the wait.PollInfinite is not terminated, it will wait forever in job preempted

do you think this is a good solution?

func (cc *jobcontroller) waitDependsOnTaskMeetCondition(taskName string, taskIndex int, podToCreateEachTask []*v1.Pod, job *batch.Job) bool {
	if job.Spec.Tasks[taskIndex].DependsOn == nil {
		return true
	}
	dependsOn := *job.Spec.Tasks[taskIndex].DependsOn
	if len(dependsOn.Name) > 1 && dependsOn.Iteration == batch.IterationAny {
		// any ready to create task
		for _, task := range dependsOn.Name {
			if cc.isDependsOnPodsReady(task, job) {
				return true
			}
		}
		return false
	} else {
		for _, dependsOnTask := range dependsOn.Name {
			// any not ready to skip
			if !cc.isDependsOnPodsReady(dependsOnTask, job) {
				return false
			}
			return true
		}
	}
	return true
}

@hwdef
Copy link
Member

hwdef commented Jun 7, 2023

I probably understand your idea, you want to re-trigger the pod creation through the controller, but are you sure it will re-trigger?

@renwenlong-github
Copy link
Author

renwenlong-github commented Jun 8, 2023

I probably understand your idea, you want to re-trigger the pod creation through the controller, but are you sure it will re-trigger?

I tested it ok, although I don't understand why the volcano doesn't use resync,
Even so, the watch event is enough for the volcano, I will prove my idea through the source code,job controller watch event as follows:

// https://github.com/volcano-sh/volcano/blob/master/pkg/controllers/job/job_controller.go#L155
	cc.jobInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		AddFunc:    cc.addJob,
		UpdateFunc: cc.updateJob,
		DeleteFunc: cc.deleteJob,
	})
// https://github.com/volcano-sh/volcano/blob/master/pkg/controllers/job/job_controller.go#LL189C6-L189C6
	cc.podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		AddFunc:    cc.addPod,
		UpdateFunc: cc.updatePod,
		DeleteFunc: cc.deletePod,
	})

A dependsOn B, if B pods no ready, we skip to create A. when pod B ready, job controller add the B event to queue,
it will run SyncJobAction -> SyncJob -> syncJob(jobInfo *apis.JobInfo, updateStatus state.UpdateStatusFn) -> waitDependsOnTaskMeetCondition -> ok -> create A

Maybe, there is something I don't know, but we have tested this fix, It's ok.

If we doesn't fix it, preempt has a bug, example A dependsOn B, A and B was killed all,if controller create B first, it will use a controller work to wait, if 3 work(default workNum) used to wait, the controller will die.
issues #2897

@hwdef
Copy link
Member

hwdef commented Jun 9, 2023

@xiefan-github
Thanks for your test, I will test your submitted code again later, if there is no problem, I will approve this modification.

Copy link
Member

@hwdef hwdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested, this code is ok, thanks for your contribution, this fix is awesome!
/approve

@hwdef
Copy link
Member

hwdef commented Jun 11, 2023

@wangyang0616 please trigger the CI

Copy link
Member

@hwdef hwdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@volcano-sh-bot volcano-sh-bot added lgtm Indicates that a PR is ready to be merged. and removed lgtm Indicates that a PR is ready to be merged. labels Jun 11, 2023
Signed-off-by: Ren <[email protected]>
@@ -378,7 +377,16 @@ func (cc *jobcontroller) syncJob(jobInfo *apis.JobInfo, updateStatus state.Updat
go func(taskName string, podToCreateEachTask []*v1.Pod) {
taskIndex := jobhelpers.GetTasklndexUnderJob(taskName, job)
if job.Spec.Tasks[taskIndex].DependsOn != nil {
cc.waitDependsOnTaskMeetCondition(taskName, taskIndex, podToCreateEachTask, job)
if !cc.waitDependsOnTaskMeetCondition(taskName, taskIndex, podToCreateEachTask, job) {
klog.V(4).Infof("Job %s/%s depends on task not ready", job.Name, job.Namespace)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make CI happy,
I think we should improve log level

@@ -378,7 +377,16 @@ func (cc *jobcontroller) syncJob(jobInfo *apis.JobInfo, updateStatus state.Updat
go func(taskName string, podToCreateEachTask []*v1.Pod) {
taskIndex := jobhelpers.GetTasklndexUnderJob(taskName, job)
if job.Spec.Tasks[taskIndex].DependsOn != nil {
cc.waitDependsOnTaskMeetCondition(taskName, taskIndex, podToCreateEachTask, job)
if !cc.waitDependsOnTaskMeetCondition(taskName, taskIndex, podToCreateEachTask, job) {
klog.Errorf("Job %s/%s depends on task not ready", job.Name, job.Namespace)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.Errorf("Job %s/%s depends on task not ready", job.Name, job.Namespace)
klog.V(3).Infof("Job %s/%s depends on task not ready", job.Name, job.Namespace)

@renwenlong-github
Copy link
Author

I verify codes success in my local, but it not work in volcano verity, how can I fix it?

@hwdef
Copy link
Member

hwdef commented Jun 25, 2023

It may be an occasional error, but we're not in a hurry to fix it now because spark's CI also has an error (it's not you), we can wait for the spark CI error to be fixed and then retry verify's CI

@renwenlong-github
Copy link
Author

It may be an occasional error, but we're not in a hurry to fix it now because spark's CI also has an error (it's not you), we can wait for the spark CI error to be fixed and then retry verify's CI

Wait for spark fix, help me to ci, thank you

Copy link
Member

@hwdef hwdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Jun 26, 2023
@hwdef
Copy link
Member

hwdef commented Jun 28, 2023

/close

@volcano-sh-bot
Copy link
Contributor

@hwdef: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hwdef
Copy link
Member

hwdef commented Jun 28, 2023

/reopen

@volcano-sh-bot
Copy link
Contributor

@hwdef: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@hwdef hwdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Copy link
Member

@william-wang william-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hwdef, william-wang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 18, 2023
@volcano-sh-bot volcano-sh-bot merged commit f9d22d6 into volcano-sh:master Jul 18, 2023
20 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants