Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SKS-1978: Add 'cluster-api/accelerator' label to GPU node to support autoscaling #154

Merged
merged 3 commits into from
Nov 1, 2023

Conversation

huaqing1994
Copy link
Contributor

问题

[SKS-1978] 自动伸缩支持 GPU 资源 - Jira
slack: https://smartx1.slack.com/archives/C037V8QMKQA/p1698722686956959
pod 请求 GPU 触发自动扩容 GPU 节点后,在新节点 GPU 资源 ready 前,被缩容了。
原因是 Node 已经 ready,但 CA 不知道它是带 GPU 的节点。
需要按照 https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/clusterapi/README.md#special-note-on-gpu-instances 给 node 打 label,让 CA 知道 node 正在准备 GPU 资源。

修改

  • 给带 GPU 的 node 打 label cluster-api/accelerator,值为第一个 GPU 或 vGPU 的类型

测试

创建 GPU 节点,在 Node 创建后,很快打上了 cluster-api/accelerator label:
image
触发 GPU 自动扩容:
新 node 创建后也有 cluster-api/accelerator label:
image
CA 日志显示该 node 的 GPU 资源还不 ready,pending 的 pod 可以被调度到 CA 模拟的正在扩容的节点上,不会触发缩容和额外的扩容:
image
最终 GPU 资源 ready,pod 成功被调度,自动扩容流程顺利结束,event:
image
删除 GPU 负载 Pod,自动缩容成功:
image
image

Copy link

codecov bot commented Nov 1, 2023

Codecov Report

Merging #154 (1e93324) into master (fe30437) will decrease coverage by 0.18%.
Report is 1 commits behind head on master.
The diff coverage is 91.66%.

@@            Coverage Diff             @@
##           master     #154      +/-   ##
==========================================
- Coverage   56.35%   56.18%   -0.18%     
==========================================
  Files          17       18       +1     
  Lines        3208     3257      +49     
==========================================
+ Hits         1808     1830      +22     
- Misses       1244     1270      +26     
- Partials      156      157       +1     
Files Coverage Δ
pkg/util/labels/helpers.go 35.13% <100.00%> (ø)
controllers/elfmachine_controller.go 72.90% <81.81%> (+0.02%) ⬆️

@haijianyang haijianyang merged commit b1f986d into smartxworks:master Nov 1, 2023
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants