-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SKS-2344: Clean unused CloudTower labels created by CAPE every 24h #167
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #167 +/- ##
==========================================
+ Coverage 59.46% 59.61% +0.14%
==========================================
Files 20 20
Lines 3580 3655 +75
==========================================
+ Hits 2129 2179 +50
- Misses 1304 1328 +24
- Partials 147 148 +1 ☔ View full report in Codecov by Sentry. |
pkg/service/vm.go
Outdated
return "", nil | ||
} | ||
|
||
label := getLabelResp.Payload[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个label 可能并不一定关联的只有VM,只用label.Vms > 0 并不可靠?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果后续关联了其他的,再加上是否关联其他的就行。
关于并发场景误删:因为标签本身也存在被人为手动误删的可能,所以无论如何系统都需要处理标签丢失补回来的情况。
- 对于 CAPE 要避免误删,可以给标签加锁,就可以避免并发操作的问题。
- 可以在标签被删除后,ElfMachine reconcile 的时候发现标签没有了就会补回来。
另外,原来通过 VMNum 判断是否还有虚拟机引用,也存在上述问题。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 标签本身也存在被人为手动误删的可能。这是用户的误操作导致的,我们的代码不应该出现这种误操作。
- 对于 CAPE 要避免误删,可以给标签加锁,就可以避免并发操作的问题。<< 这个怎么加锁?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
原来通过 VMNum 判断是否还有虚拟机引用,也存在上述问题。 <<< 这个要改成 没有任何对象引用此label。 GraphQL是能返回这个总计数的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
给某个标签加锁,对所有标签的操作同时只能有一个操作可以进行。例如给虚拟机添加标签的时候不能删除该标签,删除标签的时候不能给虚拟机添加该标签。使用目前 CAPE 的内存锁就能实现。
“这个要改成 没有任何对象引用此label” 指的是?
“GraphQL是能返回这个总计数的” 指的是?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@haijianyang 这个 total_num 就是引用了此label 的对象个数 ?
标签 - CloudTower
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
total_num 是引用标签的所有资源(包括虚拟机)计数,跟 VMNum 一样,也是异步统计的,也是不准确的。
controllers/elfcluster_controller.go
Outdated
// it will not be retried and will be started again in the next reconcile. | ||
func (r *ElfClusterReconciler) cleanLabels(ctx *context.ClusterContext) { | ||
// Locking ensures that only one coroutine cleans at the same time | ||
if ok := acquireTicketForGCTowerLabels(ctx.ElfCluster.Spec.Tower.Server); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个acquireTicket实际是 acquireLock 吧?可以改成Lock
controllers/elfcluster_controller.go
Outdated
return | ||
} | ||
|
||
ctx.Logger.V(1).Info(fmt.Sprintf("Cleaning labels for Tower %s", ctx.ElfCluster.Spec.Tower.Server)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个改成 "Cleaning orphan labels in Tower %s created by CAPE"
controllers/elfcluster_controller.go
Outdated
|
||
ctx.Logger.V(1).Info(fmt.Sprintf("Cleaning labels for Tower %s", ctx.ElfCluster.Spec.Tower.Server)) | ||
|
||
keys := []string{towerresources.GetVMLabelClusterName(), towerresources.GetVMLabelVIP(), towerresources.GetVMLabelNamespace(), towerresources.GetVMLabelManaged()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
towerresources.GetVMLabelManaged 这个不用清理,至多有一个。
controllers/elfcluster_controller.go
Outdated
keys := []string{towerresources.GetVMLabelClusterName(), towerresources.GetVMLabelVIP(), towerresources.GetVMLabelNamespace(), towerresources.GetVMLabelManaged()} | ||
labelIDs, err := ctx.VMService.CleanLabels(keys) | ||
if err != nil { | ||
ctx.Logger.Error(err, fmt.Sprintf("failed to clean labels for Tower %s", ctx.ElfCluster.Spec.Tower.Server)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里改成 “Warning: failed to clean orphan labels in Tower %s" ,是一个可忽略的error。
controllers/elfcluster_controller.go
Outdated
@@ -298,6 +324,8 @@ func (r *ElfClusterReconciler) reconcileNormal(ctx *context.ClusterContext) (rec | |||
return reconcile.Result{}, nil | |||
} | |||
|
|||
r.cleanLabels(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleanLabels -> cleanOrphanLabels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Issue SKS-2344
当前 Tower 每个标签都有一个虚拟机引用计数,这个计数是异步同步近实时更新的。
例如删除一个标签关联的虚拟机之后,标签的计数更新时间可能在 1 秒内,也可能几秒之后。
当前 CAPE 删除 ClusterNameLabel 和 NamespaceLabel 的时候会判断标签的虚拟机引用计数是否大于等于 0,
只有引用计数大于等于 0 的时候才会删除该标签。
因此,CAPE 在删除 ClusterNameLabel 和 NamespaceLabel 的时候会存在遗删的可能。
Change
定期清理(每天一次)没有被使用的 Tower 标签。
为了避免误删在使用的标签,清理标签的时候增加只能清理一天前创建的限制。
Test
程序运行之前,Tower 有很多一天前创建的标签
程序运行之后,新建一个集群,观察到只剩下当前集群使用的标签