-
Notifications
You must be signed in to change notification settings - Fork 971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: Add dynamic resource allocation(DRA) plugin #3799
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cc |
@@ -25,6 +25,7 @@ custom: | |||
leader_elect_enable: false | |||
enabled_admissions: "/jobs/mutate,/jobs/validate,/podgroups/mutate,/pods/validate,/pods/mutate,/queues/mutate,/queues/validate" | |||
colocation_enable: false | |||
feature_gates: DynamicResourceAllocation=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to enable this feature-gate by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think? Do we need to enable it by default? I think that if users need to enable dra plugin, they also need to reconfigure the scheduler plugins configmap to enable it, it also looks like featuregate. If we don't enable this feture-gate, users need to do reconfiguring the scheduler plugins configmap and enabling the feature-gate two steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK we need not to enable this feature-gate by default, I will add some informers in schedulerCache, some users may need to use k8s below v1.31, if there are no APIs like DeviceClass
, ResourceClaim
, etc registered, it will cause some errors.
@@ -0,0 +1,808 @@ | |||
/* | |||
Copyright 2022 The Kubernetes Authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this part directly imported into the scheduler code? Or has it been modified and adapted (I don't know much about it)? If not, should it be marked in the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just changed the New
to NewDRAPlugin
and deleted EventsToRegister
and some related codes with it, I think volcano doesn't need them right now. These codes are copied from the kubernetes repo, so the license needs to keep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understand, if so, if the upstream code changes, do I need to make the same changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, before the dynamicResources
struct changed to DynamicResources
, we can only copy the whole codes related into volcano, otherwise we can't directly call those extension points. It may cause us to miss tracking some repaired codes, but if we can access the DynamicResources
struct in v1.32, the problem will be solved.
de77070
to
b120b42
Compare
Signed-off-by: jessestutler <[email protected]>
Signed-off-by: jessestutler <[email protected]>
Signed-off-by: jessestutler <[email protected]>
Signed-off-by: jessestutler <[email protected]>
b120b42
to
72d27ea
Compare
What type of PR is this?
/kind feature
What this PR does / why we need it:
Background
Inspired by #3798 , introduce DRA into volcano. It solves the following things:
AddPreBindFn
orAddBindFn
in session #3618dynamicresources
struct can't be accessed outside the k8s.io/kuberentes package, if we need to call k8s extension points such asPreFilter
,Filter
, etc., we have to access the struct. After this pr is merged and volcano upgrade the k8s version to 1.32: Refactor the dynamicResources struct to DynamicResources kubernetes/kubernetes#128399, we may remove these codes.See docs/design/dynamicresources-plugin.md for more details.
Verfication
I followed up https://github.com/kubernetes-sigs/dra-example-driver to deploy an example resource driver and deploy the demo yaml gpu-test{1,2,3,4,5}.yaml, the pod can be successfully scheduled:
GPU information printed in containers:
And allocation result in ResourceClaim is successfully updated: