Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vgpu cannot perform high-priority preemption scheduling #3186

Closed
AshinWu opened this issue Nov 9, 2023 · 10 comments
Closed

vgpu cannot perform high-priority preemption scheduling #3186

AshinWu opened this issue Nov 9, 2023 · 10 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@AshinWu
Copy link

AshinWu commented Nov 9, 2023

What happened:

Use the latest version of Volcano vGPU, hoping that high-priority tasks can preempt low-priority tasks.

Node capacity information:

status:
  capacity:
    volcano.sh/vgpu-number: '2'

volcano-scheduler.conf (configmap):

actions: "reclaim, allocate, backfill, preempt"
tiers:
- plugins:
  - name: priority
- plugins:
  - name: gang
    enableJobOrder: false
    enablePreemptable: false
    enableJobStarving: false
  - name: predicates
    arguments:
      predicate.GPUSharingEnable: true # enable GPU sharing
  - name: proportion
  - name: nodeorder
  - name: binpack

priorityClass:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "high priority"

2 low priority tasks:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: job-low1
spec:
  minAvailable: 1
  schedulerName: volcano
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 1
      name: testjob
      policies:
      - event: TaskCompleted
        action: CompleteJob
      template:
        metadata:
          annotations: 
            volcano.sh/preemptable: "true"
        spec:
          containers:
            - command:
              - sleep
              - 8m
              name: cuda-container
              image: nvidia/cuda:10.1-base-ubuntu18.04
              resources:
                limits:
                  volcano.sh/vgpu-number: 1
---
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: job-low2
spec:
  minAvailable: 1
  schedulerName: volcano
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 1
      name: testjob
      policies:
      - event: TaskCompleted
        action: CompleteJob
      template:
        metadata:
          annotations: 
            volcano.sh/preemptable: "true"
        spec:
          containers:
            - command:
              - sleep
              - 10m
              name: cuda-container
              image: nvidia/cuda:10.1-base-ubuntu18.04
              resources:
                limits:
                  volcano.sh/vgpu-number: 1

1 High priority task:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: job-high
spec:
  minAvailable: 1
  schedulerName: volcano
  priorityClassName: high-priority
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 1
      name: testjob
      policies:
      - event: TaskCompleted
        action: CompleteJob
      template:
        spec:
          containers:
            - command:
              - sleep
              - 2m
              name: cuda-container
              image: nvidia/cuda:10.1-base-ubuntu18.04
              resources:
                limits:
                  volcano.sh/vgpu-number: 1

What you expected to happen:

Two low-priority tasks have already occupied the entire node's vGPU resources (2 vGPUs). Now, creating a high-priority task that uses 1 vGPU should evict one of the low-priority tasks to allow the high-priority task to run, and the evicted low-priority task will be in a padding state. For example:

---begin---
NAME     STATUS
job-low1 Running
Job-low2 Running

---wait---
NAME     STATUS
job-high Running
job-low1 Running
Job-low2 Padding

---wait---
NAME     STATUS
job-high Completed
job-low1 Running
Job-low2 Running

---end--- 
NAME STATUS 
job-high Completed 
job-low1 Completed
Job-low2 Completed

Is it my configuration error or a bug?

How to reproduce it (as minimally and precisely as possible):

  1. If using CPU or memory, it can trigger priority preemptive scheduling.
  2. Whether using gpushare or vgpu, priority preemption scheduling cannot be performed. The most obvious observation from checking the schedule log is:

I1109 12:18:49.214378 1 preempt.go:43] Enter Preempt ...
I1109 12:18:49.214390 1 job_info.go:728] job job-high-14881f23-c9a4-44b9-a3cf-46e130a51b99/default actual: map[], ji.TaskMinAvailable: map[nginx:1]
I1109 12:18:49.214407 1 preempt.go:58] Job <default/job-high-14881f23-c9a4-44b9-a3cf-46e130a51b99> Queue skip preemption, reason: NotEnoughPodsOfTask, message Not enough valid pods of each task for gang-scheduling
I1109 12:18:49.214463 1 job_info.go:728] job job-low2-1d2e78fa-028d-475e-9ffc-5598d837d80b/default actual: map[nginx:1], ji.TaskMinAvailable: map[testjob:1]
I1109 12:18:49.214488 1 job_info.go:728] job job-low1-b024ff24-37f1-489b-8956-93e78c46a70c/default actual: map[nginx:1], ji.TaskMinAvailable: map[testjob:1]
I1109 12:18:49.214509 1 preempt.go:194] No Preemptors in Queue , break
I1109 12:18:49.214522 1 statement.go:378] Committing operations
I1109 12:18:49.214536 1 preempt.go:194] Leaving Preempt ...

image

Anything else we need to know?:

Similar problems:
#2547
#2916
...

Environment:

  • Volcano Version: 1.8.1
  • Kubernetes version (use kubectl version): 1.19
  • OS (e.g. from /etc/os-release): Ubuntu 18.04.6 LTS (image)
  • Kernel (e.g. uname -a): 5.4.0-150-generic
@AshinWu AshinWu added the kind/bug Categorizes issue or PR as related to a bug. label Nov 9, 2023
@Monokaix
Copy link
Member

Monokaix commented Nov 10, 2023

Hi, please try to modify volcano-scheduler.conf's actions field to "allocate, preempt, backfill" to see the result.

@AshinWu
Copy link
Author

AshinWu commented Nov 10, 2023

Hi, please try to modify volcano-scheduler.conf's actions field to "allocate, preempt, backfill" to see the result.

@Monokaix Thank you for your reply.
According to your prompt, it still doesn't work, and the high-priority job is always in Padding state.
The log is printed as follows:

E1110 07:30:56.874510 1 device_info.go:187] deviceSharing err= not enough gpu fitted on this node
I1110 07:30:56.874524 1 predicate_helper.go:75] Predicates failed for task <default/job-testjob-nginx-0> on node : task default/job-high-testjob-0 on node node-gpu fit failed: not enough gpu fitted on this node
I1110 07:30:56.874588 1 preempt.go:108] No preemptor task in job <default/job-high-cb92f563-74f3-4912-bacd-fe230e57915a>.
I1110 07:30:56.874605 1 statement.go:352] Discarding operations ...
I1110 07:30:56.874629 1 predicates.go:384] pod(default/job--high-testjob-0) affinity require information is nil, plugin InterPodAffinity is skipped
I1110 07:30:56.874676 1 statement.go:378] Committing operations ...
I1110 07:30:56.874683 1 statement.go:378] Committing operations ...
I1110 07:30:56.885269 1 cache.go:262] Updating pod condition for default/job-high-testjob-0 to (PodScheduled==False)
I1110 07:30:56.930763 1 session.go:240] Close Session

@AshinWu
Copy link
Author

AshinWu commented Nov 11, 2023

It seems to be the same reason mentioned in #2916, but I noticed that the issue has been fixed and submitted in version 1.8.0.
Because vgpu sharing is enabled, in the device_info.go , the gpu resource check fails in the FilterNode() function, resulting in an "not enough gpu fitted on this node" exception being thrown, which prevents the correct calculation of the nodes in the predicateNodes phase, thus preventing the execution of the logic for preemption scheduling?

Could you please provide me with some solutions or ideas? @wangyang0616 @william-wang

@william-wang
Copy link
Member

@archlitchi Have you encountered the same issue in your env?

@Monokaix
Copy link
Member

Hi, please try to modify volcano-scheduler.conf's actions field to "allocate, preempt, backfill" to see the result.

@Monokaix Thank you for your reply. According to your prompt, it still doesn't work, and the high-priority job is always in Padding state. The log is printed as follows:

E1110 07:30:56.874510 1 device_info.go:187] deviceSharing err= not enough gpu fitted on this node
I1110 07:30:56.874524 1 predicate_helper.go:75] Predicates failed for task <default/job-testjob-nginx-0> on node : task default/job-high-testjob-0 on node node-gpu fit failed: not enough gpu fitted on this node
I1110 07:30:56.874588 1 preempt.go:108] No preemptor task in job <default/job-high-cb92f563-74f3-4912-bacd-fe230e57915a>.
I1110 07:30:56.874605 1 statement.go:352] Discarding operations ...
I1110 07:30:56.874629 1 predicates.go:384] pod(default/job--high-testjob-0) affinity require information is nil, plugin InterPodAffinity is skipped
I1110 07:30:56.874676 1 statement.go:378] Committing operations ...
I1110 07:30:56.874683 1 statement.go:378] Committing operations ...
I1110 07:30:56.885269 1 cache.go:262] Updating pod condition for default/job-high-testjob-0 to (PodScheduled==False)
I1110 07:30:56.930763 1 session.go:240] Close Session

Whether can preempt happen using cpu/memory after modify schedule config?

@AshinWu
Copy link
Author

AshinWu commented Nov 17, 2023

Whether can preempt happen using cpu/memory after modify schedule config?

That's it. For CPU/memory, preemption works normally, but for vgpu/gpu-share resources, preemption does not work and high-priority tasks are always in Padding state.

@Monokaix
Copy link
Member

What's the node's allocatable status?

@Monokaix
Copy link
Member

#3450 and #3458 can solve this, you can try it using the latest version: )

@Monokaix
Copy link
Member

/close

@volcano-sh-bot
Copy link
Contributor

@Monokaix: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants