Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SKS-2344: Clean unused CloudTower labels created by CAPE every 24h #167

Merged
merged 3 commits into from
Jan 12, 2024

Conversation

haijianyang
Copy link
Contributor

@haijianyang haijianyang commented Jan 5, 2024

Issue SKS-2344

当前 Tower 每个标签都有一个虚拟机引用计数,这个计数是异步同步近实时更新的。
例如删除一个标签关联的虚拟机之后,标签的计数更新时间可能在 1 秒内,也可能几秒之后。

当前 CAPE 删除 ClusterNameLabel 和 NamespaceLabel 的时候会判断标签的虚拟机引用计数是否大于等于 0,
只有引用计数大于等于 0 的时候才会删除该标签。

因此,CAPE 在删除 ClusterNameLabel 和 NamespaceLabel 的时候会存在遗删的可能。

Change

定期清理(每天一次)没有被使用的 Tower 标签。

为了避免误删在使用的标签,清理标签的时候增加只能清理一天前创建的限制。

Test

程序运行之前,Tower 有很多一天前创建的标签
image

程序运行之后,新建一个集群,观察到只剩下当前集群使用的标签
image

@haijianyang haijianyang requested a review from jessehu January 5, 2024 06:42
Copy link

codecov bot commented Jan 5, 2024

Codecov Report

Attention: 26 lines in your changes are missing coverage. Please review.

Comparison is base (fb5730b) 59.46% compared to head (97bf16d) 59.61%.

Files Patch % Lines
pkg/service/vm.go 0.00% 19 Missing ⚠️
controllers/vm_limiter.go 80.00% 6 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #167      +/-   ##
==========================================
+ Coverage   59.46%   59.61%   +0.14%     
==========================================
  Files          20       20              
  Lines        3580     3655      +75     
==========================================
+ Hits         2129     2179      +50     
- Misses       1304     1328      +24     
- Partials      147      148       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

return "", nil
}

label := getLabelResp.Payload[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个label 可能并不一定关联的只有VM,只用label.Vms > 0 并不可靠?

Copy link
Contributor Author

@haijianyang haijianyang Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果后续关联了其他的,再加上是否关联其他的就行。

关于并发场景误删:因为标签本身也存在被人为手动误删的可能,所以无论如何系统都需要处理标签丢失补回来的情况。

  1. 对于 CAPE 要避免误删,可以给标签加锁,就可以避免并发操作的问题。
  2. 可以在标签被删除后,ElfMachine reconcile 的时候发现标签没有了就会补回来。

另外,原来通过 VMNum 判断是否还有虚拟机引用,也存在上述问题。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 标签本身也存在被人为手动误删的可能。这是用户的误操作导致的,我们的代码不应该出现这种误操作。
  • 对于 CAPE 要避免误删,可以给标签加锁,就可以避免并发操作的问题。<< 这个怎么加锁?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

原来通过 VMNum 判断是否还有虚拟机引用,也存在上述问题。 <<< 这个要改成 没有任何对象引用此label。 GraphQL是能返回这个总计数的。

Copy link
Contributor Author

@haijianyang haijianyang Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

给某个标签加锁,对所有标签的操作同时只能有一个操作可以进行。例如给虚拟机添加标签的时候不能删除该标签,删除标签的时候不能给虚拟机添加该标签。使用目前 CAPE 的内存锁就能实现。

“这个要改成 没有任何对象引用此label” 指的是?
“GraphQL是能返回这个总计数的” 指的是?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

图片

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haijianyang 这个 total_num 就是引用了此label 的对象个数 ?
标签 - CloudTower
图片

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total_num 是引用标签的所有资源(包括虚拟机)计数,跟 VMNum 一样,也是异步统计的,也是不准确的。

pkg/service/vm.go Show resolved Hide resolved
pkg/service/vm.go Show resolved Hide resolved
@haijianyang haijianyang changed the base branch from release-1.3 to master January 9, 2024 08:08
@haijianyang haijianyang requested a review from jessehu January 9, 2024 11:31
@jessehu jessehu changed the title SKS-2344: Fix deleting Tower labels when cluster is deleted SKS-2344: Clean unused CloudTower labels created by CAPE every 24h Jan 11, 2024
// it will not be retried and will be started again in the next reconcile.
func (r *ElfClusterReconciler) cleanLabels(ctx *context.ClusterContext) {
// Locking ensures that only one coroutine cleans at the same time
if ok := acquireTicketForGCTowerLabels(ctx.ElfCluster.Spec.Tower.Server); ok {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个acquireTicket实际是 acquireLock 吧?可以改成Lock

return
}

ctx.Logger.V(1).Info(fmt.Sprintf("Cleaning labels for Tower %s", ctx.ElfCluster.Spec.Tower.Server))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个改成 "Cleaning orphan labels in Tower %s created by CAPE"


ctx.Logger.V(1).Info(fmt.Sprintf("Cleaning labels for Tower %s", ctx.ElfCluster.Spec.Tower.Server))

keys := []string{towerresources.GetVMLabelClusterName(), towerresources.GetVMLabelVIP(), towerresources.GetVMLabelNamespace(), towerresources.GetVMLabelManaged()}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

towerresources.GetVMLabelManaged 这个不用清理,至多有一个。

keys := []string{towerresources.GetVMLabelClusterName(), towerresources.GetVMLabelVIP(), towerresources.GetVMLabelNamespace(), towerresources.GetVMLabelManaged()}
labelIDs, err := ctx.VMService.CleanLabels(keys)
if err != nil {
ctx.Logger.Error(err, fmt.Sprintf("failed to clean labels for Tower %s", ctx.ElfCluster.Spec.Tower.Server))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里改成 “Warning: failed to clean orphan labels in Tower %s" ,是一个可忽略的error。

@@ -298,6 +324,8 @@ func (r *ElfClusterReconciler) reconcileNormal(ctx *context.ClusterContext) (rec
return reconcile.Result{}, nil
}

r.cleanLabels(ctx)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanLabels -> cleanOrphanLabels

@haijianyang haijianyang requested a review from jessehu January 12, 2024 03:01
Copy link
Collaborator

@jessehu jessehu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@haijianyang haijianyang merged commit f6ab21f into smartxworks:master Jan 12, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants