Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add hierarchical queues for capacity plugin #3743

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Rui-Gan
Copy link

@Rui-Gan Rui-Gan commented Sep 23, 2024

No description provided.

@volcano-sh-bot volcano-sh-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 23, 2024
@hwdef
Copy link
Member

hwdef commented Sep 23, 2024

ref: #3590

@hwdef
Copy link
Member

hwdef commented Sep 23, 2024

/assign @Monokaix @lowang-bh @shinytang6
Please take a look 👀

attr := cp.queueOpts[job.Queue]
attr.allocated.Sub(event.Task.Resreq)
metrics.UpdateQueueAllocated(attr.name, attr.allocated.MilliCPU, attr.allocated.Memory)
childAttr.deserved = helpers.Max(childAttr.deserved, childAttr.guarantee)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems every queue's deserved value is calculated in openSession, why re-calculate here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the openSession function, I didn't calculate the deserved value for each queue. I'm not quite sure I understand what you mean. Could you please provide more specific details to clarify your point?


cp.updateShare(attr)
// Check if the parent queue's deserved resources are less than the total deserved resources of child queues
if attr.deserved.LessPartly(totalDeserved, api.Zero) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openSession will not retuen err when there is an err of a plugin , so even thougt we return an err here, the openSession will not end and the plugin will contiune its logic, so does this check really take effect?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the latest implementation, if an error occurs during openSession, the injected functions will directly return the logic related to rejecting the scheduling. For example, in ssn.AddJobEnqueueableFn, it immediately returns a rejection, preventing the job from being enqueued. What do you think about this approach?


switch ar.Request.Operation {
case admissionv1.Create, admissionv1.Update:
err = closeQueue(queue)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The close or closing state is updated by queue controller, and it will retry when err occured, so if we return err when user want to close root queue, the queue controller will keep re-enqueue and process the queue and never end, which is unacceptable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also seems that we didn't check whether hierarchical queue is enable of capacity plugin and mutate it directly, which is also inconsistent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this scenario, I have conducted tests. It indeed results in the queue controller repeatedly processing the queue until it reaches the maximum number of attempts and stops retrying.
As mentioned in your comment below, I agree that we can consider adding the logic for opening/closing queues in a hierarchical queue structure within the queue controller. However, we can discuss whether it is still necessary to include the relevant logic in the webhook as a fallback handling mechanism.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also seems that we didn't check whether hierarchical queue is enable of capacity plugin and mutate it directly, which is also inconsistent.

In the latest implementation, if the hierarchical queue feature of the capacity plugin needs to be enabled, the webhook's configuration file must set "enableHierarchyCapacity" to true. What do you think about this approach, or do you have any suggestions?

}

// HierarchicalQueues validate queues when capacity plugin enables hierarchy.
func HierarchicalQueues(ar admissionv1.AdmissionReview) *admissionv1.AdmissionResponse {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the consideration here?
Here will also cause the problem that queue controller keep retrying.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't cause the queue controller to keep retrying. The issue you mentioned only occurs when performing operations related to opening or closing queues. However, there might be some scenarios that I haven't considered. Could you provide a specific example to illustrate such a case?

@Monokaix
Copy link
Member

Is somewhere validate that job can only be submitted to leaf node?

@hwdef
Copy link
Member

hwdef commented Sep 26, 2024

Please add some E2E test for this function. Because this function is very important.

}
}

func closeQueue(queue *schedulingv1beta1.Queue) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put these logic to volcano controller is better.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that the logic for open/close queue can be added to the controller, but do you think it's necessary to include this logic in the webhook as well for fallback handling

Signed-off-by: Rui-Gan <[email protected]>
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign lowang-bh
You can assign the PR to them by writing /assign @lowang-bh in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Rui-Gan
Copy link
Author

Rui-Gan commented Sep 29, 2024

Please add some E2E test for this function. Because this function is very important.

I will add it later

@Rui-Gan
Copy link
Author

Rui-Gan commented Sep 29, 2024

Is somewhere validate that job can only be submitted to leaf node?

I have already added this logic in the validate webhook of the job to ensure that jobs can only be submitted to leaf queues.

config.ConfigData.Unlock()
if enableHierarchyCapacity {
if queue.Name == "root" {
msg += fmt.Sprintf(" can not submit job to root queue;")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a scenario where there is only one root queue, can the root queue be considered a leaf queue and allocated jobs on it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants