-
Notifications
You must be signed in to change notification settings - Fork 952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add hierarchical queues for capacity plugin #3743
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Rui-Gan <[email protected]>
ref: #3590 |
/assign @Monokaix @lowang-bh @shinytang6 |
attr := cp.queueOpts[job.Queue] | ||
attr.allocated.Sub(event.Task.Resreq) | ||
metrics.UpdateQueueAllocated(attr.name, attr.allocated.MilliCPU, attr.allocated.Memory) | ||
childAttr.deserved = helpers.Max(childAttr.deserved, childAttr.guarantee) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems every queue's deserved value is calculated in openSession
, why re-calculate here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the openSession function, I didn't calculate the deserved value for each queue. I'm not quite sure I understand what you mean. Could you please provide more specific details to clarify your point?
|
||
cp.updateShare(attr) | ||
// Check if the parent queue's deserved resources are less than the total deserved resources of child queues | ||
if attr.deserved.LessPartly(totalDeserved, api.Zero) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
openSession
will not retuen err when there is an err of a plugin , so even thougt we return an err here, the openSession
will not end and the plugin will contiune its logic, so does this check really take effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the latest implementation, if an error occurs during openSession, the injected functions will directly return the logic related to rejecting the scheduling. For example, in ssn.AddJobEnqueueableFn, it immediately returns a rejection, preventing the job from being enqueued. What do you think about this approach?
pkg/webhooks/admission/queues/mutate/mutate_hierarchical_queue.go
Outdated
Show resolved
Hide resolved
|
||
switch ar.Request.Operation { | ||
case admissionv1.Create, admissionv1.Update: | ||
err = closeQueue(queue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The close or closing state is updated by queue controller, and it will retry when err occured, so if we return err when user want to close root queue, the queue controller will keep re-enqueue and process the queue and never end, which is unacceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also seems that we didn't check whether hierarchical queue is enable of capacity plugin and mutate it directly, which is also inconsistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this scenario, I have conducted tests. It indeed results in the queue controller repeatedly processing the queue until it reaches the maximum number of attempts and stops retrying.
As mentioned in your comment below, I agree that we can consider adding the logic for opening/closing queues in a hierarchical queue structure within the queue controller. However, we can discuss whether it is still necessary to include the relevant logic in the webhook as a fallback handling mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also seems that we didn't check whether hierarchical queue is enable of capacity plugin and mutate it directly, which is also inconsistent.
In the latest implementation, if the hierarchical queue feature of the capacity plugin needs to be enabled, the webhook's configuration file must set "enableHierarchyCapacity" to true. What do you think about this approach, or do you have any suggestions?
} | ||
|
||
// HierarchicalQueues validate queues when capacity plugin enables hierarchy. | ||
func HierarchicalQueues(ar admissionv1.AdmissionReview) *admissionv1.AdmissionResponse { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the consideration here?
Here will also cause the problem that queue controller keep retrying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't cause the queue controller to keep retrying. The issue you mentioned only occurs when performing operations related to opening or closing queues. However, there might be some scenarios that I haven't considered. Could you provide a specific example to illustrate such a case?
Is somewhere validate that job can only be submitted to leaf node? |
Please add some E2E test for this function. Because this function is very important. |
} | ||
} | ||
|
||
func closeQueue(queue *schedulingv1beta1.Queue) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe put these logic to volcano controller is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that the logic for open/close queue can be added to the controller, but do you think it's necessary to include this logic in the webhook as well for fallback handling
Signed-off-by: Rui-Gan <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I will add it later |
I have already added this logic in the validate webhook of the job to ensure that jobs can only be submitted to leaf queues. |
config.ConfigData.Unlock() | ||
if enableHierarchyCapacity { | ||
if queue.Name == "root" { | ||
msg += fmt.Sprintf(" can not submit job to root queue;") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a scenario where there is only one root queue, can the root queue be considered a leaf queue and allocated jobs on it?
No description provided.