-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script for defragmentation #15477
Comments
Here is one possible solution: https://github.com/ugur99/etcd-defrag-cronjob |
Thanks for raising the discussion on this. One complication with this is that while the largest, kubernetes is not the only user of etcd. With this in mind I think we would need to consider what would be best suited to sit under etcd op guide docs versus kubernetes etcd operations docs. It may make more sense for this issue or a tandem issue to be raised against the kubernetes etcd operations docs? |
I would like to have a solution (or documentation) for etcd.io first. I got bitten by outdated etcd docs on kubernetes.io once, and think having docs at two places is confusing. |
Hi @guettli I think defraging against leader is equivalent to against followers. Raft is not blocking because of rewriting the db file generally speaking. For example, |
@chaochn47 thank you for you answer. What is your advice for defragmenting etcd? How do you handle it? |
Hi @guettli, Here is how I would suggest Every a couple of minutes, evaluates if etcd should run defrag. It will run defrag if
It is guaranteed defrag won’t occur on more than one node at any given time. |
That's true, in OpenShift we recommend doing the leader last, because the additional IO/memory/cache churn can impact the performance negatively. If a defrag takes down the leader, the other nodes are at least safely defrag'd already and can continue with the next election. We also do not defrag if any member is unhealthy. @guettli Are you looking for a simple bash script in etcd/contrib or something more official as part of the CLI? |
@tjungblu I don't have a preference about what the solution looks like. It could be a shell script, something added to etcdutil or maybe just some docs. @chaochn47 explained the steps, but I a not familiar enough with etcd to write a corresponding script to implement this. I hope that someone with more knowledge of etcd can provide an executable solution. |
Taking a quick look at how This would simplify any downstream implementations of defrag functionality as each implementation would not have to reinvent how to prioritize the cluster wide defrag provided they were built on top of We could then update website docs or add |
@jmhbnz would it be possible to get this into etcdctl:
Then calling defragmentation does not need to be wrapped in a "dirty" shell. |
Hey @guettli - I don't think we can build all of that into As mentioned I do think we can solve the issue of completing defrag for members one at a time first before the leader as a built in approach in For some of the other requirements we have been working out in this issue, like scheduling or perhaps some of the monitoring based checks I think those will need to be handled as either documentation or additional resources in @ahrtr, @serathius - Keen for maintainer input on this. If what I have suggested makes sense feel free to assign to me and I can work on it. |
Apologies, removing my assignment for this as I am about to be traveling for several weeks and attending Kubecon so I likely won't have much capacity for a while. If anyone else has capacity they are welcome to pick it up. |
I would recommend looking into reducing bbolt fragmentation so we can get rid of defrag all together, instead of adding another feature/subproject that increases maintenance cost. |
#9222 looks related. Seems like reducing bbolt fragmentation will be a third option in addition to option 1 and option 2 mentioned here - #9222 (comment). Has this been discussed before and was there any conclusion on preferred design approach? Should contributors interested in solving this start from scratch or build upon prior guidance? @serathius @ptabor /cc @chaochn47 @cenkalti |
To me, it makes more sense to fix this at BoltDB layer. By design (based on LMDB), the database should not require any maintenance operation. BoltDB has |
From K8s perspective most fragmentation we see comes from Events, OpenShift also suffers from Images (CRDs for container image builds) on build-heavy clusters. On larger clusters we advise to shard those to another etcd instance within the cluster, but maybe we can offer some "ephemeral keys" that have some more relaxed storage and consistency guarantees? Or which use a different storage than bbolt, rocksdb/leveldb (or anything LSM based)... |
@serathius this would be great. Having a cron-job which defragments the non-leaders first, then the leader is extra overhead. Especially since there is no official version of such a script, and people solve the same task again and again. Let me know, if I can help somehow. |
cc @ptabor who mentioned some ideas to limit bbolt fragmentation. |
Note that I don't expect bbolt side change, at least in the near future, because we are still struggling to reproduce etcd-io/bbolt#402 and etcd-io/bbolt#446. I think it makes sense to provide a official reference (just reference!) on how to perform defragmentation. The rough idea (on top of all the inputs in this thread, e.g. @tjungblu @chaochn47 etc.) is:
Please also see Compaction & Defragmentation I might spend some time to provide a such script for reference. |
@ahrtr "I might spend some time to provide a such script for reference." An official script would realy help here. The topic is too hot to let everybody re-solve this on its own. |
I've been writing my own script for this. I guess my biggest question is: |
It does not. Please take a look at the discussion in #15664. You could use etcd/server/storage/backend/backend.go Lines 450 to 451 in 9e1abba
|
It isn't correct. It waits for the defrag to finish before moving on to the next member. FYI. I am implementing a tool to do defragmentation. Hopefully the first version can be ready next week. |
It does wait but it timeouts after some duration. IIRC it's 30 seconds. |
Yes, it's another story. The default command timeout is 5s. It's recommended to set a bigger value (e.g. 1m. ) for defragmentation, because it may take a long time to defragment a large DB. I don't have the performance data for now on how much time it may need for different DB size. |
The pattern I see is usually 10s per GB. @bradjones1320 you can set a larger timeout with |
FYI. https://github.com/ahrtr/etcd-defrag Just as I mentioned in #15477 (comment), the tool etcd-defrag,
|
When saying "stop-the-world", are you only referring to the following check: etcd/server/etcdserver/v3_server.go Lines 669 to 671 in 63c9fe1
or there may be other reasons that might stop-the-world? |
When etcdserver is processing the defragmentation, it can't serve any client requests, see etcd/server/storage/backend/backend.go Lines 456 to 465 in 63c9fe1
The main functionality of https://github.com/ahrtr/etcd-defrag is ready, the left work is to add more utilities (e.g. Dockerfile, manifest for K8s, etc.). Please anyone feel free to let me know if you have any suggestions or questions. |
Could you share why defragmentation will cause leadership transfers? For my understanding, when the leader is processing the defragmentation, it blocks the system from reading and writing data. However, raft is not blocked, so defrag will not cause leadership transfer. FYI, I did a test on a 3 nodes cluster. When defraging, Etcd leader node healthy check failed but there was not leader election. Test logic
test output
|
It turned out to be that the leader doesn't stop the world during processing defragmentation, because the apply workflow is executed async, etcd/server/etcdserver/server.go Line 847 in 63c9fe1
Confirmed that it doesn't cause leadership transfer no matter how long the leader is blocked on processing defragmentation. It should be an issue, and we should fix it. I have a pending PR #15440, let me think how to resolve them together. |
Again, it's still recommended to run defragmentation on leader last, because leader has more responsibilities (e.g. send snapshot, etc.) than followers, once it's blocked for a long time, then all the responsibilities dedicated to leader are not working. Please also read https://github.com/ahrtr/etcd-defrag |
Since we already have https://github.com/ahrtr/etcd-defrag, can we close this ticket? @guettli FYI. I might formally release |
A defrag call will not cause leadership transfer, but the resulting IO+CPU load might cause this. Try again on a machine with a very slow disk or limited CPU. We've definitely seen this happening on loaded control planes. |
closing this, since https://github.com/ahrtr/etcd-defrag exists. |
It would still be awesome to get built-in / official support for this, yeah? Or will https://github.com/ahrtr/etcd-defrag be the |
@TechDufus good question. If you know the answer, please write it here into this issue. Thank you. |
|
What would you like to be added?
I would like to see an official solution how to defragment etcd.
AFAIK a one-line cron-job is not enough, since you should not defragment the current leader.
Related: #14975
Maybe it is enough to add a simple example script to the docs.
Why is this needed?
defragmenting the leader can lead to performance degradation and should be avoided.
I don't think it makes sense that every company running etcd invents its own way to solve this.
The text was updated successfully, but these errors were encountered: