Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

volcano calculates the true capacity #3684

Open
ls-2018 opened this issue Aug 20, 2024 · 6 comments
Open

volcano calculates the true capacity #3684

ls-2018 opened this issue Aug 20, 2024 · 6 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@ls-2018
Copy link

ls-2018 commented Aug 20, 2024

Description

Current total resources

Only one node is available, and other nodes cannot be scheduled

The value of a node resource is 60 based on cpu

img

img

img

queue Settings

Three queues are configured

1.deault queue: weight=1 and 0.7c are allocated

img

  1. In the low-priority-queue queue, the default weight is 1 and guarantee is 15
    img

3.deployment-queue. weight=88

img

Steps to reproduce the issue

Scenario 1

Create 3 podgroups in sequence under the deployment-queue queue

1.1 pod with 20c cpu usage

2.1 pod with 5c cpu usage

3.2 pods with 20c cpu usage

The expected result is: the first two podgroups are in Inqueue and the third is Pending.

The actual result is: all three podgroups are Inqueue, but the third podgroup will be Pending only if its cpu occupancy is 21c.

img

As seen in the source code, determining whether you can go from Pending to Inqueue is a function of the queue's realCapability-current-resource-queue-allocated-queue-inqueue-queue-elastic (the sum of minResource-Allocated for all podgroups)

All current resource values:

podgroup minResource  cpu 20000.00

queue realCapability cpu 45000.00

queue allocated cpu 0.00

queue inqueue cpu 25000.00

queue elastic cpu 0.00

The above value is the most problematic queue realCapability, the deault queue has been allocated 0.7c, low-priority-queue queue pre-occupation of 15c, then the maximum capacity of the deployment-queue queue should be 60-0.7-15 = 44.3 is more appropriate.

Scenario 2

  • Create a podgroup and pod under the low-priority-queue queue to enable preemption of the low-priority-queue queue

  • Create 3 podgroups in turn under the deployment-queue queue

    1.1 pod, cpu occupancy 10c

    2.1 pods with 20c cpu usage

    3.1 pods with 5 cpu usage

    Create 3 pods in turn using the last 3 podgroups

    ! img

Modify the 1st podgroup to 2 pods with 20c cpu usage.

At this point, the status of the podgroup becomes Unknown.

img

Started the 1st podgroup as the 2nd pod, volcano-scheduler crashed and could no longer be started properly

img

From the source code, the deserved makes is first calculated by weight (deault/low-priority-queue/deployment-queue:731.71/731.71/58536.59) and then compared to the realCapability (real capacity deault/low- priority-queue/deployment-queue:45000/60,000/45000) is taken to be small, and then compared to request (request volume deault/low-priority-queue/deployment-queue:700.00/100000.00/ 45000.00) is taken as small, and then with guarantee (preoccupation deault/low-priority-queue/deployment-queue:0/15000/0) is taken as large.

The deault/low-priority-queue/deployment-queue:700.00/15000.00/45000.00 is obtained by the above calculation.

And then add the sum to get 60700.00, more than the total resources 60000.00

It feels like the problem is still the queue realCapability, which also affects the request value. It feels like the deault queue has been allocated 0.7c, and the low-priority-queue queue is preoccupied with 15c, so the maximum capacity of the deployment-queue queue should be 60-0.7-15=44.3 more appropriately.

img

Describe the results you received and expected

Pods can be created normally and volcano does not report errors

What version of Volcano are you using?

1.10

Any other relevant information

No response

@ls-2018 ls-2018 added the kind/bug Categorizes issue or PR as related to a bug. label Aug 20, 2024
@hwdef
Copy link
Member

hwdef commented Sep 9, 2024

@Monokaix Please check this
.

@lowang-bh
Copy link
Member

Please use the fix in #3106

@Monokaix
Copy link
Member

cc @JesseStutler

@JesseStutler
Copy link
Contributor

@ls-2018 As for scenario 1, you mean that we should consider other queue already allocated resources when calculating whether the podgroup can be inqueued? I think the logic now works correctly, when we calculate whether the podgroup can be inqueued, we use queue's (allocated + inqueue - elastic), not consider other queues allocated resource, in your scenario, although the podgroup can be inqueued, but the pod can'be allocated, so it works fine.

@JesseStutler
Copy link
Contributor

I can't reproduce scenario 2, do you use the deployment rather than vcjob? Why is here no 0.7c pod in your image?
image

@JesseStutler
Copy link
Contributor

Seems that in scenario 2, Sub causes the panic, but why is that...I can't reproduce it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants