SKS-1978: Add 'cluster-api/accelerator' label to GPU node to support autoscaling #154
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
问题
[SKS-1978] 自动伸缩支持 GPU 资源 - Jira
slack: https://smartx1.slack.com/archives/C037V8QMKQA/p1698722686956959
pod 请求 GPU 触发自动扩容 GPU 节点后,在新节点 GPU 资源 ready 前,被缩容了。
原因是 Node 已经 ready,但 CA 不知道它是带 GPU 的节点。
需要按照 https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/clusterapi/README.md#special-note-on-gpu-instances 给 node 打 label,让 CA 知道 node 正在准备 GPU 资源。
修改
cluster-api/accelerator
,值为第一个 GPU 或 vGPU 的类型测试
创建 GPU 节点,在 Node 创建后,很快打上了
cluster-api/accelerator
label:触发 GPU 自动扩容:
新 node 创建后也有
cluster-api/accelerator
label:CA 日志显示该 node 的 GPU 资源还不 ready,pending 的 pod 可以被调度到 CA 模拟的正在扩容的节点上,不会触发缩容和额外的扩容:
最终 GPU 资源 ready,pod 成功被调度,自动扩容流程顺利结束,event:
删除 GPU 负载 Pod,自动缩容成功: