SKS-1978: Add 'cluster-api/accelerator' label to GPU node to support autoscaling #154

huaqing1994 · 2023-11-01T05:54:04Z

问题

[SKS-1978] 自动伸缩支持 GPU 资源 - Jira
slack: https://smartx1.slack.com/archives/C037V8QMKQA/p1698722686956959
pod 请求 GPU 触发自动扩容 GPU 节点后，在新节点 GPU 资源 ready 前，被缩容了。
原因是 Node 已经 ready，但 CA 不知道它是带 GPU 的节点。
需要按照 https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/clusterapi/README.md#special-note-on-gpu-instances 给 node 打 label，让 CA 知道 node 正在准备 GPU 资源。

修改

给带 GPU 的 node 打 label cluster-api/accelerator，值为第一个 GPU 或 vGPU 的类型

测试

创建 GPU 节点，在 Node 创建后，很快打上了 cluster-api/accelerator label：

触发 GPU 自动扩容：
新 node 创建后也有 cluster-api/accelerator label：

CA 日志显示该 node 的 GPU 资源还不 ready，pending 的 pod 可以被调度到 CA 模拟的正在扩容的节点上，不会触发缩容和额外的扩容：

最终 GPU 资源 ready，pod 成功被调度，自动扩容流程顺利结束，event:

删除 GPU 负载 Pod，自动缩容成功：

…autoscaling

codecov · 2023-11-01T06:14:27Z

Codecov Report

Merging #154 (1e93324) into master (fe30437) will decrease coverage by 0.18%.
Report is 1 commits behind head on master.
The diff coverage is 91.66%.

@@            Coverage Diff             @@
##           master     #154      +/-   ##
==========================================
- Coverage   56.35%   56.18%   -0.18%     
==========================================
  Files          17       18       +1     
  Lines        3208     3257      +49     
==========================================
+ Hits         1808     1830      +22     
- Misses       1244     1270      +26     
- Partials      156      157       +1

Files	Coverage Δ
pkg/util/labels/helpers.go	`35.13% <100.00%> (ø)`
controllers/elfmachine_controller.go	`72.90% <81.81%> (+0.02%)`	⬆️

controllers/elfmachine_controller.go

pkg/util/labels/helpers.go

SKS-1978: Add 'cluster-api/accelerator' label to GPU node to support …

e28d19f

…autoscaling

huaqing1994 requested review from jessehu and haijianyang November 1, 2023 05:54

Add comment

b0d38fb

jessehu reviewed Nov 1, 2023

View reviewed changes

controllers/elfmachine_controller.go Show resolved Hide resolved

pkg/util/labels/helpers.go Show resolved Hide resolved

jessehu approved these changes Nov 1, 2023

View reviewed changes

add comment and UT

1e93324

haijianyang approved these changes Nov 1, 2023

View reviewed changes

haijianyang merged commit b1f986d into smartxworks:master Nov 1, 2023
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SKS-1978: Add 'cluster-api/accelerator' label to GPU node to support autoscaling #154

SKS-1978: Add 'cluster-api/accelerator' label to GPU node to support autoscaling #154

huaqing1994 commented Nov 1, 2023

codecov bot commented Nov 1, 2023 •

edited

Loading

SKS-1978: Add 'cluster-api/accelerator' label to GPU node to support autoscaling #154

SKS-1978: Add 'cluster-api/accelerator' label to GPU node to support autoscaling #154

Conversation

huaqing1994 commented Nov 1, 2023

问题

修改

测试

codecov bot commented Nov 1, 2023 • edited Loading

Codecov Report

codecov bot commented Nov 1, 2023 •

edited

Loading