-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP: add scheduler policy design doc and code #152
KEP: add scheduler policy design doc and code #152
Conversation
db00db3
to
d84ce59
Compare
yes, but we need to take numa information into consideration. because we bind GPU to task in hami-scheduler, so we can't use 'Topologymanager' in kubelet configuration. If we implement 'binpack' and 'spread' schedule policy, combined with numa, we need to implement 4 strategies, 'binpack-numaEnforce', 'binpack-numaBesteffort','spread-numaEnforece','spread-numaBesteffort' Information about topologyManager and numa-policy (https://kubernetes.io/zh-cn/docs/tasks/administer-cluster/topology-manager/) |
numa scheduler can use https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/noderesourcetopology this project. |
de5313b
to
8294392
Compare
9b001e8
to
e8f1ccb
Compare
e8f1ccb
to
76f80a9
Compare
@archlitchi @wawa0210 PTAL. |
76f80a9
to
ff83bd6
Compare
5a56fec
to
b927614
Compare
Test ClusterOne cluster two nodes, and two GPU devices per node. Test CaseNode Binback policy, GPU Binback policyapiVersion: v1
kind: Pod
metadata:
name: gpu-pod
annotations:
hami.io/node-scheduler-policy: "binpack"
hami.io/gpu-scheduler-policy: "binpack"
spec:
containers:
- name: ubuntu-container
image: chrstnhntschl/gpu_burn
args:
- "6000"
resources:
limits:
nvidia.com/gpu: 1
nvidia.com/gpumem: 1000
nvidia.com/gpucores: 10
---
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod1
annotations:
hami.io/node-scheduler-policy: "binpack"
hami.io/gpu-scheduler-policy: "binpack"
spec:
containers:
- name: ubuntu-container
image: chrstnhntschl/gpu_burn
args:
- "6000"
resources:
limits:
nvidia.com/gpu: 1
nvidia.com/gpumem: 1000
nvidia.com/gpucores: 10 Test Result:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
gpu-pod 1/1 Running 0 6m55s 10.233.74.99 controller-node-1 <none> <none>
gpu-pod1 1/1 Running 0 6m55s 10.233.74.114 controller-node-1 <none> <none>
$ kubectl get pods gpu-pod -o jsonpath="{.metadata.annotations['hami\.io/vgpu-devices-allocated']}"
GPU-e441928e-e386-c020-4f78-dddd4debb238,NVIDIA,1000,10:;
$ kubectl get pods gpu-pod1 -o jsonpath="{.metadata.annotations['hami\.io/vgpu-devices-allocated']}"
GPU-e441928e-e386-c020-4f78-dddd4debb238,NVIDIA,1000,10:; Node Binback policy, GPU spread policyapiVersion: v1
kind: Pod
metadata:
name: gpu-pod
annotations:
hami.io/node-scheduler-policy: "binpack"
hami.io/gpu-scheduler-policy: "spread"
spec:
containers:
- name: ubuntu-container
image: chrstnhntschl/gpu_burn
args:
- "6000"
resources:
limits:
nvidia.com/gpu: 1
nvidia.com/gpumem: 1000
nvidia.com/gpucores: 10
---
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod1
annotations:
hami.io/node-scheduler-policy: "binpack"
hami.io/gpu-scheduler-policy: "spread"
spec:
containers:
- name: ubuntu-container
image: chrstnhntschl/gpu_burn
args:
- "6000"
resources:
limits:
nvidia.com/gpu: 1
nvidia.com/gpumem: 1000
nvidia.com/gpucores: 10 Test Result:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
gpu-pod 1/1 Running 0 2m13s 10.233.84.237 worker-node-1 <none> <none>
gpu-pod1 1/1 Running 0 2m13s 10.233.84.198 worker-node-1 <none> <none>
$ kubectl get pods gpu-pod -o jsonpath="{.metadata.annotations['hami\.io/vgpu-devices-allocated']}"
GPU-a784a920-1cc2-5aee-072f-6d4ea477e2b4,NVIDIA,1000,10:;
$ kubectl get pods gpu-pod1 -o jsonpath="{.metadata.annotations['hami\.io/vgpu-devices-allocated']}"
GPU-ebe7c3f7-303d-558d-435e-99a160631fe4,NVIDIA,1000,10:; Node spread policy, GPU binback policyapiVersion: v1
kind: Pod
metadata:
name: gpu-pod
annotations:
hami.io/node-scheduler-policy: "spread"
hami.io/gpu-scheduler-policy: "binpack"
spec:
containers:
- name: ubuntu-container
image: chrstnhntschl/gpu_burn
args:
- "6000"
resources:
limits:
nvidia.com/gpu: 1
nvidia.com/gpumem: 1000
nvidia.com/gpucores: 10
---
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod1
annotations:
hami.io/node-scheduler-policy: "spread"
hami.io/gpu-scheduler-policy: "binpack"
spec:
containers:
- name: ubuntu-container
image: chrstnhntschl/gpu_burn
args:
- "6000"
resources:
limits:
nvidia.com/gpu: 1
nvidia.com/gpumem: 1000
nvidia.com/gpucores: 10
---
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod2
annotations:
hami.io/node-scheduler-policy: "spread"
hami.io/gpu-scheduler-policy: "binpack"
spec:
containers:
- name: ubuntu-container
image: chrstnhntschl/gpu_burn
args:
- "6000"
resources:
limits:
nvidia.com/gpu: 1
nvidia.com/gpumem: 1000
nvidia.com/gpucores: 10 Test Result:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
gpu-pod 1/1 Running 0 93s 10.233.74.83 controller-node-1 <none> <none>
gpu-pod1 1/1 Running 0 93s 10.233.84.247 worker-node-1 <none> <none>
gpu-pod2 1/1 Running 0 93s 10.233.74.68 controller-node-1 <none> <none>
$ kubectl get pods gpu-pod -o jsonpath="{.metadata.annotations['hami\.io/vgpu-devices-allocated']}"
GPU-e441928e-e386-c020-4f78-dddd4debb238,NVIDIA,1000,10:;
$ kubectl get pods gpu-pod1 -o jsonpath="{.metadata.annotations['hami\.io/vgpu-devices-allocated']}"
GPU-ebe7c3f7-303d-558d-435e-99a160631fe4,NVIDIA,1000,10:;
$ kubectl get pods gpu-pod2 -o jsonpath="{.metadata.annotations['hami\.io/vgpu-devices-allocated']}"
GPU-e441928e-e386-c020-4f78-dddd4debb238,NVIDIA,1000,10:; Node spread policy, GPU spread policyapiVersion: v1
kind: Pod
metadata:
name: gpu-pod
annotations:
hami.io/node-scheduler-policy: "spread"
hami.io/gpu-scheduler-policy: "spread"
spec:
containers:
- name: ubuntu-container
image: chrstnhntschl/gpu_burn
args:
- "6000"
resources:
limits:
nvidia.com/gpu: 1
nvidia.com/gpumem: 1000
nvidia.com/gpucores: 10
---
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod1
annotations:
hami.io/node-scheduler-policy: "spread"
hami.io/gpu-scheduler-policy: "spread"
spec:
containers:
- name: ubuntu-container
image: chrstnhntschl/gpu_burn
args:
- "6000"
resources:
limits:
nvidia.com/gpu: 1
nvidia.com/gpumem: 1000
nvidia.com/gpucores: 10
---
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod2
annotations:
hami.io/node-scheduler-policy: "spread"
hami.io/gpu-scheduler-policy: "spread"
spec:
containers:
- name: ubuntu-container
image: chrstnhntschl/gpu_burn
args:
- "6000"
resources:
limits:
nvidia.com/gpu: 1
nvidia.com/gpumem: 1000
nvidia.com/gpucores: 10 Test Result:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
gpu-pod 1/1 Running 0 25s 10.233.74.125 controller-node-1 <none> <none>
gpu-pod1 1/1 Running 0 25s 10.233.84.241 worker-node-1 <none> <none>
gpu-pod2 1/1 Running 0 25s 10.233.74.127 controller-node-1 <none> <none>
$ kubectl get pods gpu-pod -o jsonpath="{.metadata.annotations['hami\.io/vgpu-devices-allocated']}"
GPU-70a7e30d-99a5-1117-8e85-759a592fb582,NVIDIA,1000,10:;
$ kubectl get pods gpu-pod1 -o jsonpath="{.metadata.annotations['hami\.io/vgpu-devices-allocated']}"
GPU-a784a920-1cc2-5aee-072f-6d4ea477e2b4,NVIDIA,1000,10:;
$ kubectl get pods gpu-pod2 -o jsonpath="{.metadata.annotations['hami\.io/vgpu-devices-allocated']}"
GPU-e441928e-e386-c020-4f78-dddd4debb238,NVIDIA,1000,10:; |
5db7c18
to
ee98a74
Compare
Signed-off-by: rongfu.leng <[email protected]>
ee98a74
to
0e2a6e4
Compare
issue: #141