-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: add new compactor based revision count #16426
Comments
The existing CompactorModeRevision can also meet your requirement. We just need to set the same value as the
I am afraid It isn't correct. Free pages can't be reused before they are reclaimed (by defragmentation). |
Thanks @ahrtr for the quick comment.
The fact is that it is not. This runs every 5-minute if enough of logs have proceeded. That's why I make this proposal. etcd/server/etcdserver/api/v3compactor/revision.go Lines 61 to 78 in 0d89fa7
If I understand correctly, Defragmentation is to replay all the key/values into new db file. The defrag function just goes through all the buckets and copy all the keys into new db file. It doesn't delete any things. But compactor does. The defrag is used to reduce the total db size if there are a lot of free pages. etcd/server/storage/backend/backend.go Lines 563 to 620 in 0d89fa7
For example, setup single ETCD server from scratch and disable auto compactor.
benchmark put --rate=100 --total=10000 --compact-interval=0 --key-space-size=3000 --key-size=256 --val-size=10240 We will get: total size is 136 MiB and InUse size is 123 MiB etcdctl endpoint status dc1e6cd4c757f755 -w table
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | STORAGE VERSION | DB SIZE | DB SIZE IN USE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | dc1e6cd4c757f755 | 3.6.0-alpha.0 | 3.6.0 | 136 MB | 123 MB | true | false | 2 | 10004 | 10004 | |
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
There is just few MiB as free pages. The total size will be 123 MiB. etcdctl defrag dc1e6cd4c757f755 -w table
Finished defragmenting etcd member[127.0.0.1:2379]. took 2.995811265s
etcdctl endpoint status dc1e6cd4c757f755 -w table
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | STORAGE VERSION | DB SIZE | DB SIZE IN USE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | dc1e6cd4c757f755 | 3.6.0-alpha.0 | 3.6.0 | 123 MB | 123 MB | true | false | 2 | 10004 | 10004 | |
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
It's expected because defrag doesn't delete any things and it just replays all the key/values.
etcdctl get foo -w json
{"header":{"cluster_id":5358838169441251993,"member_id":15861234598778763093,"revision":10001,"raft_term":2}}
etcdctl compact 10001
compacted revision 10001
etcdctl endpoint status dc1e6cd4c757f755 -w table
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | STORAGE VERSION | DB SIZE | DB SIZE IN USE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | dc1e6cd4c757f755 | 3.6.0-alpha.0 | 3.6.0 | 124 MB | 36 MB | true | false | 2 | 10005 | 10005 | |
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
After compact, we get total free page size is 88 MiB. Actually, it can be reused.
And the InUse size is increased to 48 MiB from 36 MiB. NOTE: if there is not much more continuous pages, the total size will be increased. benchmark put --rate=100 --total=1000 --compact-interval=0 --key-space-size=3000 --key-size=128 --val-size=10240
etcdctl endpoint status dc1e6cd4c757f755 -w table
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | STORAGE VERSION | DB SIZE | DB SIZE IN USE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | dc1e6cd4c757f755 | 3.6.0-alpha.0 | 3.6.0 | 124 MB | 48 MB | true | false | 2 | 11005 | 11005 | |
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
The compactor deletes out-of-date revisions, and the free pages can be reused. Hope the comment can make it clear. :) |
Not super convinced, but I overall don't understand need for revision based compaction, so I might be wrong. There is a tradeoff between predictability of compaction window vs storage size, however I don't know if anyone would prefer storage size especially for kubernetes. If there is a burst of changes, you especially don't want to compact too fast because it risks clients not being able to catch up. Fast compaction will lead to unsynced clients, that will need to download the whole state again. Seems like not a good change from scalability standpoint. cc @mborsz @wojtek-t as for opinion about revision based compaction. |
@fuweid sorry not to make it clear.
|
Thanks for the review @serathius @ahrtr The painpoint (at least to me) is that the bustable traffic is unpredictable and So, I am seeking a way to compact old revisions in time. It's kind of revision garbage collector.
Yes. For this issue, we can consider to introduce
Yes. Burst list-all-huge-dataset traffic is nightmare.
Yes. But it needs us to gc the old revisions in time to reclaim the pages. Based on that, maybe we should consider that the quota available checker should consider InUseSize at first and then the total size, since there will be available free pages can be reused after compact. Looking forward to your feedback. |
Not sure I understand your motivation. It's normal to overprovision disk. What kind of burst you are talking about? Are you expecting data size you double in 5 minutes? |
The burst is about too many PUT requests to kube-apiserver. I was facing issues that the argo workflow submitted the cronjobs and the ETCD total db size was increased by 2-3 GiB in 5 minutes. The number of pods was about 3,000 ~ 6,000 and each pod has one init-container and two containers. There were at least 5 PUT requests to ETCD for each pod. And most of pods were short-live. But workflow job kept create/delete/re-create for hours. If the compact doesn't work in time, the bbolt would expand very fast and then it exceeds the quota.
Sorry for unclear comment. My motivation is to make ETCD clear the old revisions as soon as possible and then ETCD can reuse the free page, which can reduce downtime caused by NO_SPACE alarm. Currently, once ETCD server detects the bbolt db size exceeds the max quota size, the ETCD server files NO_SPACE alarm and then gets into read-only mode. Even if the compactor reclaims space which has enough continuous pages to serve most of upcoming requests, the ETCD server still denies the request. The server needs the defragment to decrease the db size which lower than the max quota. And disable the NO_SPACE alarm. However, there is no API, like Watch for change of key/value, to notify the operator or admin about the NO_SPACE alarm. It needs the operator component to poll the Alarm list. In order to reduce the downtime cause by NO_SPACE, there are only two options for existing ETCD releases:
So, I make this proposal to compact the old revisions in time and keep the DB total size is smaller than max quota size. Besides this proposal, maybe we can consider to use In-Use-Size as current usage, instead of bbolt total db size. |
We're also seeing those bursty workloads more and more often, eg with argo and batch-style operations.
I wonder if we can make the MVCC and bbolt more "generational". What if we were to shard bbolt by revision range? Alternatively, have one "permgen"-style bbolt file where we can keep historical revisions that don't change, new writes would go into a "newgen". Where we would occasionally copy older revisions into the permgen, and then recycle the newgen once the fragmentation would be too big. Admittedly, we would lose the ootb transaction support of one bbolt file, which is really neat. |
+1 |
What would you like to be added?
Add new compactor based revision count, instead of fixed interval time.
In order to make it happen, the mvcc store needs to export
CompactNotify
function to notify the compactor that configured number of write transactions have occured since previsious compaction. The new compactor can get the revision change and delete out-of-date data in time, instead of waiting with fixed interval time. The underly bbolt db can reuse the free pages as soon as possible.Why is this needed?
In the kubernetes cluster, for instance, argo workflow, there will be batch requests to create pods , and then there are also a lot of pod status's PATCH requests, especially when the pod has more than 3 containers. If the burst requests increase the db size in short time, it will be easy to exceed the max quota size. And then the cluster admin get involved to defrag, which may casue long downtime. So, we hope the ETCD can delete the out-of-date data as soon as possible and slow down the grow of total db size.
Currently, both revision and periodic are based on time. It's not easy to use fixed interval time to face the unexpected burst update requests. The new compactor based on revision count can make the admin life easier. For instance, let's say that average of object size is 50 KiB. The new compactor will compact based on 10,000 revisions. It's like that ETCD can compact after new 500 MiB data in, no matter how long ETCD takes to get new 10,000 revisions. It can handle the burst update requests well.
There are some test results:
For the burst requests, we needs to use short periodic interval. Otherwise, the total size will be large. I think the new compactor can handle it well. And the cluster admin can configure it based on the payload size easily.
Additional Change:
Currently, the quota system only checks DB total size. However, there could be a lot of free pages which can be reused to upcoming requests. Based on this proposal, I also want to extend current quota system with DB's InUse size.
If the InUse size is less than max quota size, we should allow requests to update. Since the bbolt might be resized if there is no available continuous pages, we should setup a hard limit for the overflow, like 1 GiB.
And it's likely to disable NO_SPACE alarm if the compact can get much more free pages. It can reduce downtime.
Demo: #16427
cc @ahrtr @serathius @wenjiaswe @jmhbnz @chaochn47
The text was updated successfully, but these errors were encountered: