volcano calculates the true capacity #3684

ls-2018 · 2024-08-20T10:16:47Z

Description

Current total resources

Only one node is available, and other nodes cannot be scheduled

The value of a node resource is 60 based on cpu

queue Settings

Three queues are configured

1.deault queue: weight=1 and 0.7c are allocated

In the low-priority-queue queue, the default weight is 1 and guarantee is 15

3.deployment-queue. weight=88

Steps to reproduce the issue

Scenario 1

Create 3 podgroups in sequence under the deployment-queue queue

1.1 pod with 20c cpu usage

2.1 pod with 5c cpu usage

3.2 pods with 20c cpu usage

The expected result is: the first two podgroups are in Inqueue and the third is Pending.

The actual result is: all three podgroups are Inqueue, but the third podgroup will be Pending only if its cpu occupancy is 21c.

As seen in the source code, determining whether you can go from Pending to Inqueue is a function of the queue's realCapability-current-resource-queue-allocated-queue-inqueue-queue-elastic (the sum of minResource-Allocated for all podgroups)

All current resource values:

podgroup minResource  cpu 20000.00

queue realCapability cpu 45000.00

queue allocated cpu 0.00

queue inqueue cpu 25000.00

queue elastic cpu 0.00

The above value is the most problematic queue realCapability, the deault queue has been allocated 0.7c, low-priority-queue queue pre-occupation of 15c, then the maximum capacity of the deployment-queue queue should be 60-0.7-15 = 44.3 is more appropriate.

Scenario 2

Create a podgroup and pod under the low-priority-queue queue to enable preemption of the low-priority-queue queue
Create 3 podgroups in turn under the deployment-queue queue

1.1 pod, cpu occupancy 10c

2.1 pods with 20c cpu usage

3.1 pods with 5 cpu usage

Create 3 pods in turn using the last 3 podgroups

! img

Modify the 1st podgroup to 2 pods with 20c cpu usage.

At this point, the status of the podgroup becomes Unknown.

Started the 1st podgroup as the 2nd pod, volcano-scheduler crashed and could no longer be started properly

From the source code, the deserved makes is first calculated by weight (deault/low-priority-queue/deployment-queue:731.71/731.71/58536.59) and then compared to the realCapability (real capacity deault/low- priority-queue/deployment-queue:45000/60,000/45000) is taken to be small, and then compared to request (request volume deault/low-priority-queue/deployment-queue:700.00/100000.00/ 45000.00) is taken as small, and then with guarantee (preoccupation deault/low-priority-queue/deployment-queue:0/15000/0) is taken as large.

The deault/low-priority-queue/deployment-queue:700.00/15000.00/45000.00 is obtained by the above calculation.

And then add the sum to get 60700.00, more than the total resources 60000.00

It feels like the problem is still the queue realCapability, which also affects the request value. It feels like the deault queue has been allocated 0.7c, and the low-priority-queue queue is preoccupied with 15c, so the maximum capacity of the deployment-queue queue should be 60-0.7-15=44.3 more appropriately.

Describe the results you received and expected

Pods can be created normally and volcano does not report errors

What version of Volcano are you using?

1.10

Any other relevant information

No response

The text was updated successfully, but these errors were encountered:

hwdef · 2024-09-09T08:32:03Z

@Monokaix Please check this
.

lowang-bh · 2024-09-09T13:45:20Z

Please use the fix in #3106

Monokaix · 2024-09-10T12:01:23Z

cc @JesseStutler

JesseStutler · 2024-09-12T09:00:39Z

@ls-2018 As for scenario 1, you mean that we should consider other queue already allocated resources when calculating whether the podgroup can be inqueued? I think the logic now works correctly, when we calculate whether the podgroup can be inqueued, we use queue's (allocated + inqueue - elastic), not consider other queues allocated resource, in your scenario, although the podgroup can be inqueued, but the pod can'be allocated, so it works fine.

JesseStutler · 2024-09-12T13:05:03Z

I can't reproduce scenario 2, do you use the deployment rather than vcjob? Why is here no 0.7c pod in your image?

JesseStutler · 2024-09-12T13:06:08Z

Seems that in scenario 2, Sub causes the panic, but why is that...I can't reproduce it

ls-2018 added the kind/bug Categorizes issue or PR as related to a bug. label Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

volcano calculates the true capacity #3684

volcano calculates the true capacity #3684

ls-2018 commented Aug 20, 2024 •

edited

Loading

hwdef commented Sep 9, 2024

lowang-bh commented Sep 9, 2024

Monokaix commented Sep 10, 2024

JesseStutler commented Sep 12, 2024

JesseStutler commented Sep 12, 2024

JesseStutler commented Sep 12, 2024

volcano calculates the true capacity #3684

volcano calculates the true capacity #3684

Comments

ls-2018 commented Aug 20, 2024 • edited Loading

Description

Current total resources

queue Settings

Steps to reproduce the issue

Scenario 1

Scenario 2

Describe the results you received and expected

What version of Volcano are you using?

Any other relevant information

hwdef commented Sep 9, 2024

lowang-bh commented Sep 9, 2024

Monokaix commented Sep 10, 2024

JesseStutler commented Sep 12, 2024

JesseStutler commented Sep 12, 2024

JesseStutler commented Sep 12, 2024

ls-2018 commented Aug 20, 2024 •

edited

Loading