Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix panic issue with proportional scheduling #2968

Merged

Conversation

Cdayz
Copy link

@Cdayz Cdayz commented Jul 13, 2023

No description provided.

@volcano-sh-bot volcano-sh-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jul 13, 2023
@Cdayz
Copy link
Author

Cdayz commented Jul 13, 2023

/assign @Thor-wl

@Cdayz Cdayz force-pushed the fix-issue-with-proportional-plugin branch 4 times, most recently from dc2acb6 to fa7a78a Compare July 13, 2023 10:20
@lowang-bh
Copy link
Member

/hold

@volcano-sh-bot volcano-sh-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 14, 2023
@Cdayz
Copy link
Author

Cdayz commented Jul 14, 2023

@lowang-bh Hi can you describe why you set hold label? Is there any problems with this PR, can i change it to make better?

@lowang-bh
Copy link
Member

Hi, @Cdayz can you see this comment? In my opinion, a more elegent fix is to replace comapre function with comparing cpu and memory independently.

image

@@ -33,6 +33,13 @@ func checkNodeResourceIsProportional(task *api.TaskInfo, node *api.NodeInfo, pro
return status, nil
}
}

if node.Idle.LessPartly(task.Resreq, api.Zero) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a good way is don't use the compare function. Just compare CPU and Memory differently at this position:

r := node.Idle.Clone()
r = r.Sub(task.Resreq)
if r.MilliCPU < cpuReserved || r.Memory < memoryReserved {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, probably it's not a right decision, this PR fixes bug when node have ANY of resources lower than requested by task, and it leads to panic on this line.

Copy link
Author

@Cdayz Cdayz Jul 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think that maybe this pr made in a wrong place, because for example in allocate action of scheduler, checking that node resources sufficient for run task made before run other checks.

predicateFn := func(task *api.TaskInfo, node *api.NodeInfo) ([]*api.Status, error) {
// Check for Resource Predicate
if ok, reason := task.InitResreq.LessEqualWithReason(node.FutureIdle(), api.Zero); !ok {
return nil, api.NewFitError(task, node, reason)
}
var statusSets util.StatusSets
statusSets, err := ssn.PredicateFn(task, node)
if err != nil {
return nil, fmt.Errorf("predicates failed in allocate for task <%s/%s> on node <%s>: %v",
task.Namespace, task.Name, node.Name, err)
}
if statusSets.ContainsUnschedulable() || statusSets.ContainsUnschedulableAndUnresolvable() ||
statusSets.ContainsErrorSkipOrWait() {
return nil, fmt.Errorf("predicates failed in allocate for task <%s/%s> on node <%s>, status is not success",
task.Namespace, task.Name, node.Name)
}
return nil, nil
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add this check to all actions in predicateFn ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lowang-bh check my comments please

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add this check to all actions in predicateFn ?

It doesn't all need resorce compare in predicateFn, eg, preempt and backfill action.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about use:

if r.MilliCPU - task.Resreq.MilliCPU  < cpuReserved || r.Memory - task.Resreq.Memory < memoryReserved 

to replace

 r := node.Idle.Clone() 
 r = r.Sub(task.Resreq) 
 if r.MilliCPU < cpuReserved || r.Memory < memoryReserved { 

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lowang-bh hey, i changed as you suggested

@lowang-bh
Copy link
Member

/hold cancel

@volcano-sh-bot volcano-sh-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 19, 2023
@Cdayz Cdayz force-pushed the fix-issue-with-proportional-plugin branch from fa7a78a to 46e014c Compare July 20, 2023 12:29
r = r.Sub(task.Resreq)
if r.MilliCPU < cpuReserved || r.Memory < memoryReserved {

if node.Idle.MilliCPU-task.Resreq.MilliCPU < cpuReserved || node.Idle.Memory-task.Resreq.Memory < memoryReserved {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Cdayz Hi, I'm sorry for the late review. Can you demonstrate how this introduce panic? I've not gotten the point here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if node.idle has some dimension less than requst, there will be a assert.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if node.idle has some dimension less than requst, there will be a assert.

OK. That's reasonable. I'm OK about the fix.

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Jul 26, 2023
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Thor-wl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 26, 2023
@volcano-sh-bot volcano-sh-bot merged commit 2bbaab8 into volcano-sh:master Jul 26, 2023
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants