-
Notifications
You must be signed in to change notification settings - Fork 970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: schedule message burst with nodes number increasion #3051
Conversation
Signed-off-by: lowang-bh <[email protected]>
e442c01
to
cf25a10
Compare
/priority critical-urgent |
@lowang-bh: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@lowang-bh: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @wangyang0616 @hwdef @Thor-wl @Yikun @william-wang |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Thanks
/assign @Thor-wl @william-wang |
1. improve performance 2. fix the pod messages has many ',' Signed-off-by: lowang-bh <[email protected]>
succeed predicated node has empty reason, and joined string will has empty string between ',' Signed-off-by: lowang-bh <[email protected]>
Hi, @hwdef I updated this pr. Please help to review. thanks. |
if err != nil { | ||
return predicateStatus, fmt.Errorf("plugin %s predicates failed %s", nodeports.Name, status.Message()) | ||
} | ||
if nodePortStatus.Code != api.Success { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there will be problems in judging the return of api.Success, such as the following scenarios:
When preempt filtering candidate nodes, the result returned by the nodeport plug-in filter is Unschedulable, and the result returned by podAffinity filter is UnschedulableAndUnresolvable. If you directly judge non-api.Success and return, the node will be mistakenly added to the candidate for preemption In the node list, but in fact the node does not meet the preemption conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the root cause? Both allocate/preemt/backfill has its own function to check the returned predicateStatus set. Preemt action does not check Unschedulable
reason.
volcano/pkg/scheduler/actions/preempt/preempt.go
Lines 211 to 225 in 258ad60
predicateFn := func(task *api.TaskInfo, node *api.NodeInfo) ([]*api.Status, error) { | |
// Allows scheduling to nodes that are in Success or Unschedulable state after filtering by predicate. | |
var statusSets util.StatusSets | |
statusSets, err := ssn.PredicateFn(task, node) | |
if err != nil { | |
return nil, fmt.Errorf("preempt predicates failed for task <%s/%s> on node <%s>: %v", | |
task.Namespace, task.Name, node.Name, err) | |
} | |
if statusSets.ContainsUnschedulableAndUnresolvable() || statusSets.ContainsErrorSkipOrWait() { | |
return nil, fmt.Errorf("predicates failed in preempt for task <%s/%s> on node <%s>, status is not success or unschedulable", | |
task.Namespace, task.Name, node.Name) | |
} | |
return nil, nil | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wangyang0616 I know what you mean, the root cause is that some predicate plugin does not need to run in preemption, right? There are two solutions:
- keep the origin changes: put all predicate results to finial results, and then filter out those reason=="" when joined them, but the continue on same onde with other predicate plugin has some performance effect.
- add addtional plugin-enable-switch which independent from each actions, to controll each action should use its own plugins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For resource quota plug-ins, the predicate will return Unschedulable status, indicating that the current node does not allow allcate, but can preempt;
For plug-ins with inherent types of nodes (such as taint, pro-core), UnschedulableAndUnresolved is returned, indicating that neither allocate nor preempt is possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think plan 1 will be better. The action part formulates predicate rules, and each plug-in realizes its own capabilities according to the rules. It is best for the action to be unaware of the capabilities of each plug-in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about other reviewers? @hwdef @Thor-wl @william-wang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the processing of preempt. During the preemption process, pods on the node will continue to be evicted until certain conditions are met, and the eviction will stop.
The condition for stopping eviction is that extended resources such as cpu, memory, and gpu meet the requirements of the preemptor, and then stop eviction. In fact, there is no determination of the corresponding affinity, antiAffinity, topologyspread and other conditions, so the preempt will think that the node has met the scheduling conditions of the preemptor, but the actual allocation scheduling will still be intercepted by plugins such as affinity and antiAffinity, resulting in preemption failure.
volcano/pkg/scheduler/actions/preempt/preempt.go
Lines 274 to 279 in 36abf1b
for !victimsQueue.Empty() { | |
// If reclaimed enough resources, break loop to avoid Sub panic. | |
// If preemptor's queue is overused, it means preemptor can not be allcated. So no need care about the node idle resourace | |
if !ssn.Overused(currentQueue) && preemptor.InitResreq.LessEqual(node.FutureIdle(), api.Zero) { | |
break | |
} |
For example issue: #3068 is the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there will be problems in judging the return of api.Success, such as the following scenarios:
When preempt filtering candidate nodes, the result returned by the nodeport plug-in filter is Unschedulable, and the result returned by podAffinity filter is UnschedulableAndUnresolvable. If you directly judge non-api.Success and return, the node will be mistakenly added to the candidate for preemption In the node list, but in fact the node does not meet the preemption conditions.
Preemption should be rejected in this scenario. Currently, the preemption function of this field is not supported. Preemption is only effective for resource usage.
I agree with your current modification plan. |
/lgtm |
if err != nil { | ||
return predicateStatus, fmt.Errorf("plugin %s predicates failed %s", interpodaffinity.Name, status.Message()) | ||
} | ||
if podAffinityStatus.Code != api.Success { | ||
predicateStatus = append(predicateStatus, podAffinityStatus) | ||
return predicateStatus, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the analysis of issue: #3068, I think it is better to return error here. The reason is as described in the comments. Currently, Volcano only supports resource preemption, and does not support algorithm preemption functions such as affinity topology.
If it is convenient, you can update it in your pr, or I can mention another pr to fix it
Please help to review. Thanks. @william-wang |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: william-wang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
fix #3049, #2975,
fix #3050 by second commit
Signed-off-by: lowang-bh [email protected]