Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for volumeBinding plugin #2973

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

chxk
Copy link

@chxk chxk commented Jul 13, 2023

This submission is based on a previous PR(#2506), and fixes the issue where the volumeBinding plugin did not correctly filter nodes when performing predicates on multiple tasks of a job.

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign qiankunli
You can assign the PR to them by writing /assign @qiankunli in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 13, 2023
@chxk
Copy link
Author

chxk commented Jul 13, 2023

/assign @qiankunli @k82cn @william-wang

@william-wang
Copy link
Member

@chxk Thanks for the pr :) Would you log a issue firstly, and add provide the following information?
What happened:
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):

@chxk
Copy link
Author

chxk commented Jul 16, 2023

@chxk Thanks for the pr :) Would you log a issue firstly, and add provide the following information? What happened: What you expected to happen: How to reproduce it (as minimally and precisely as possible):

ok @william-wang
What happened:
Currently Volcano can't filter nodes by PV, which can't meet our requirements. We found a PR(#2506) which added this feature support (based on the volumeBinding plugin) before, but it was reverted.
After investigation, we found the problem introduced by this PR:
When allocating nodes for each task in a job, it uses the ssn.cache.AllocateVolumes, which based on an already initialized SchedulerVolumeBinder, and will update its pvCache and pvcCache. However, every time a task is predicated, a new volumeBinding plugin is initialized, and the empty pvCache and pvcCache are used for Filter, instead of which in the initialized SchedulerVolumeBinder. This can potentially cause a PV to be bound to multiple different PVCs.

What you expected to happen:
Add volumeBinding feature to Volcano and it should work normally.
After our changes were merged and deployed, we have not observed the memory leak issue that was previously mentioned in #2555

How to reproduce it (as minimally and precisely as possible):
based on a version with PR(#2506):

  1. Create a local PV with storageClass fast-disks.
  2. Create a job with two pods, each requesting a different PVC(storageClass is both fast-disks) and using a nodeSelector pointing to the node of the local PV. The yaml file is as following
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: test-binding
  namespace: default
spec:
  minMember: 2
  queue: default
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-binding-1
  namespace: default
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
  storageClassName: fast-disks
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-binding-2
  namespace: default
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
  storageClassName: fast-disks
---
apiVersion: v1
kind: Pod
metadata:
  name: test-binding-1
  namespace: default
  annotations:
    scheduling.k8s.io/group-name: test-binding
spec:
  schedulerName: volcano
  nodeSelector:
    testpvc: owner
  containers:
  - name: test
    image: <your image>
    resources:
      limits:
        cpu: 3
    command:
      - /bin/bash
      - -c
      - sleep 1d
    volumeMounts:
    - name: hahaha
      mountPath: /fast-disks
  volumes:
  - name: hahaha
    persistentVolumeClaim:
      claimName: test-binding-1
---
apiVersion: v1
kind: Pod
metadata:
  name: test-binding-2
  namespace: default
  annotations:
    scheduling.k8s.io/group-name: test-binding
spec:
  schedulerName: volcano
  nodeSelector:
    testpvc: owner
  containers:
  - name: test
    image: <your image>
    resources:
      limits:
        cpu: 3
    command:
      - /bin/bash
      - -c
      - sleep 1d
    volumeMounts:
    - name: hahaha
      mountPath: /fast-disks
  volumes:
  - name: hahaha
    persistentVolumeClaim:
      claimName: test-binding-2
  1. Observe the result: the job gets scheduled, but only one pod's PVC gets bound to the local PV.

@Thor-wl Thor-wl requested review from william-wang, hwdef and Thor-wl and removed request for merryzhou and alcorj-mizar July 24, 2023 03:18
Copy link
Member

@hwdef hwdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know much about this area.
I think the code is ok.

@hwdef
Copy link
Member

hwdef commented Jul 25, 2023

lgtm

@chxk
Copy link
Author

chxk commented Aug 1, 2023

I don't know much about this area. I think the code is ok.

Thank you for review.
Could you help review this pr please? @Thor-wl @william-wang

@stale
Copy link

stale bot commented Oct 15, 2023

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2023
@stale stale bot closed this Dec 15, 2023
@lowang-bh
Copy link
Member

/reopen

@volcano-sh-bot
Copy link
Contributor

@lowang-bh: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@stale stale bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2023
@bibibox
Copy link

bibibox commented Feb 27, 2024

i met same problem, and i think this PR can solve it greatly

@bibibox
Copy link

bibibox commented Feb 27, 2024

lgtm

@volcano-sh-bot volcano-sh-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 9, 2024
@volcano-sh-bot
Copy link
Contributor

@chxk: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants