feature: add new compactor based revision count #16426

fuweid · 2023-08-16T15:39:07Z

What would you like to be added?

Add new compactor based revision count, instead of fixed interval time.

In order to make it happen, the mvcc store needs to export CompactNotify function to notify the compactor that configured number of write transactions have occured since previsious compaction. The new compactor can get the revision change and delete out-of-date data in time, instead of waiting with fixed interval time. The underly bbolt db can reuse the free pages as soon as possible.

Why is this needed?

In the kubernetes cluster, for instance, argo workflow, there will be batch requests to create pods , and then there are also a lot of pod status's PATCH requests, especially when the pod has more than 3 containers. If the burst requests increase the db size in short time, it will be easy to exceed the max quota size. And then the cluster admin get involved to defrag, which may casue long downtime. So, we hope the ETCD can delete the out-of-date data as soon as possible and slow down the grow of total db size.

Currently, both revision and periodic are based on time. It's not easy to use fixed interval time to face the unexpected burst update requests. The new compactor based on revision count can make the admin life easier. For instance, let's say that average of object size is 50 KiB. The new compactor will compact based on 10,000 revisions. It's like that ETCD can compact after new 500 MiB data in, no matter how long ETCD takes to get new 10,000 revisions. It can handle the burst update requests well.

There are some test results:

Fixed value size: 10 KiB, Update Rate: 100/s, Total key space: 3,000

enchmark put --rate=100 --total=300000 --compact-interval=0 \
  --key-space-size=3000 --key-size=256 --val-size=10240

Compactor	DB Total Size	DB InUse Size
Revision(5min,retension:10000)	570 MiB	208 MiB
Periodic(1m)	232 MiB	165 MiB
Periodic(30s)	151 MiB	127 MiB
NewRevision(retension:10000)	195 MiB	187 MiB

Random value size: [9 KiB, 11 KiB], Update Rate: 150/s, Total key space: 3,000

bnchmark put --rate=150 --total=300000 --compact-interval=0 \
  --key-space-size=3000 --key-size=256 --val-size=10240 \
  --delta-val-size=1024

Compactor	DB Total Size	DB InUse Size
Revision(5min,retension:10000)	718 MiB	554 MiB
Periodic(1m)	297 MiB	246 MiB
Periodic(30s)	185 MiB	146 MiB
NewRevision(retension:10000)	186 MiB	178 MiB

Random value size: [6 KiB, 14 KiB], Update Rate: 200/s, Total key space: 3,000

bnchmark put --rate=200 --total=300000 --compact-interval=0 \
  --key-space-size=3000 --key-size=256 --val-size=10240 \
  --delta-val-size=4096

Compactor	DB Total Size	DB InUse Size
Revision(5min,retension:10000)	874 MiB	221 MiB
Periodic(1m)	357 MiB	260 MiB
Periodic(30s)	215 MiB	151 MiB
NewRevision(retension:10000)	182 MiB	176 MiB

For the burst requests, we needs to use short periodic interval. Otherwise, the total size will be large. I think the new compactor can handle it well. And the cluster admin can configure it based on the payload size easily.

Additional Change:

Currently, the quota system only checks DB total size. However, there could be a lot of free pages which can be reused to upcoming requests. Based on this proposal, I also want to extend current quota system with DB's InUse size.

If the InUse size is less than max quota size, we should allow requests to update. Since the bbolt might be resized if there is no available continuous pages, we should setup a hard limit for the overflow, like 1 GiB.

 // Quota represents an arbitrary quota against arbitrary requests. Each request
@@ -130,7 +134,17 @@ func (b *BackendQuota) Available(v interface{}) bool {
                return true
        }
        // TODO: maybe optimize Backend.Size()
-       return b.be.Size()+int64(cost) < b.maxBackendBytes
+
+       // Since the compact comes with allocatable pages, we should check the
+       // SizeInUse first. If there is no continuous pages for key/value and
+       // the boltdb continues to resize, it should not increase more than 1
+       // GiB. It's hard limitation.
+       //
+       // TODO: It should be enabled by flag.
+       if b.be.Size()+int64(cost)-b.maxBackendBytes >= maxAllowedOverflowBytes(b.maxBackendBytes) {
+               return false
+       }
+       return b.be.SizeInUse()+int64(cost) < b.maxBackendBytes
 }

And it's likely to disable NO_SPACE alarm if the compact can get much more free pages. It can reduce downtime.

Demo: #16427

cc @ahrtr @serathius @wenjiaswe @jmhbnz @chaochn47

The text was updated successfully, but these errors were encountered:

ahrtr · 2023-08-16T18:59:10Z

The existing CompactorModeRevision can also meet your requirement. We just need to set the same value as the revisionThreshold in your case.

However, there could be a lot of free pages which can be reused to upcoming requests. Based on this proposal, I also want to extend current quota system with DB's InUse size.

I am afraid It isn't correct. Free pages can't be reused before they are reclaimed (by defragmentation).

fuweid · 2023-08-17T06:05:37Z

Thanks @ahrtr for the quick comment.

The existing CompactorModeRevision can also meet your requirement. We just need to set the same value as the revisionThreshold in your case.

The fact is that it is not. This runs every 5-minute if enough of logs have proceeded. That's why I make this proposal.

etcd/server/etcdserver/api/v3compactor/revision.go

Lines 61 to 78 in 0d89fa7

    
           const revInterval = 5 * time.Minute 
        
           // Run runs revision-based compactor. 
        
           func (rc *Revision) Run() { 
        
           	prev := int64(0) 
        
           	go func() { 
        
           		for { 
        
           			select { 
        
           			case <-rc.ctx.Done(): 
        
           				return 
        
           			case <-rc.clock.After(revInterval): 
        
           				rc.mu.Lock() 
        
           				p := rc.paused 
        
           				rc.mu.Unlock() 
        
           				if p { 
        
           					continue 
        
           				} 
        
           			}

I am afraid It isn't correct. Free pages can't be reused before they are reclaimed (by defragmentation).

If I understand correctly, Defragmentation is to replay all the key/values into new db file. The defrag function just goes through all the buckets and copy all the keys into new db file. It doesn't delete any things. But compactor does. The defrag is used to reduce the total db size if there are a lot of free pages.

etcd/server/storage/backend/backend.go

Lines 563 to 620 in 0d89fa7

    
           func defragdb(odb, tmpdb *bolt.DB, limit int) error { 
        
           	// open a tx on tmpdb for writes 
        
           	tmptx, err := tmpdb.Begin(true) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	defer func() { 
        
           		if err != nil { 
        
           			tmptx.Rollback() 
        
           		} 
        
           	}() 
        
           	// open a tx on old db for read 
        
           	tx, err := odb.Begin(false) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	defer tx.Rollback() 
        
           	c := tx.Cursor() 
        
           	count := 0 
        
           	for next, _ := c.First(); next != nil; next, _ = c.Next() { 
        
           		b := tx.Bucket(next) 
        
           		if b == nil { 
        
           			return fmt.Errorf("backend: cannot defrag bucket %s", string(next)) 
        
           		} 
        
           		tmpb, berr := tmptx.CreateBucketIfNotExists(next) 
        
           		if berr != nil { 
        
           			return berr 
        
           		} 
        
           		tmpb.FillPercent = 0.9 // for bucket2seq write in for each 
        
           		if err = b.ForEach(func(k, v []byte) error { 
        
           			count++ 
        
           			if count > limit { 
        
           				err = tmptx.Commit() 
        
           				if err != nil { 
        
           					return err 
        
           				} 
        
           				tmptx, err = tmpdb.Begin(true) 
        
           				if err != nil { 
        
           					return err 
        
           				} 
        
           				tmpb = tmptx.Bucket(next) 
        
           				tmpb.FillPercent = 0.9 // for bucket2seq write in for each 
        
           				count = 0 
        
           			} 
        
           			return tmpb.Put(k, v) 
        
           		}); err != nil { 
        
           			return err 
        
           		} 
        
           	} 
        
           	return tmptx.Commit() 
        
           }

For example, setup single ETCD server from scratch and disable auto compactor.

step 1: use following command to ingest 10,000 revisions.

benchmark put --rate=100 --total=10000 --compact-interval=0 --key-space-size=3000 --key-size=256 --val-size=10240

We will get: total size is 136 MiB and InUse size is 123 MiB

etcdctl endpoint status dc1e6cd4c757f755 -w table
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        |    VERSION    | STORAGE VERSION | DB SIZE | DB SIZE IN USE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | dc1e6cd4c757f755 | 3.6.0-alpha.0 |           3.6.0 |  136 MB |         123 MB |      true |      false |         2 |      10004 |              10004 |        |
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+

step 2: use defrag

There is just few MiB as free pages. The total size will be 123 MiB.

 etcdctl defrag dc1e6cd4c757f755 -w table
Finished defragmenting etcd member[127.0.0.1:2379]. took 2.995811265s

etcdctl endpoint status dc1e6cd4c757f755 -w table
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        |    VERSION    | STORAGE VERSION | DB SIZE | DB SIZE IN USE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | dc1e6cd4c757f755 | 3.6.0-alpha.0 |           3.6.0 |  123 MB |         123 MB |      true |      false |         2 |      10004 |              10004 |        |
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+

It's expected because defrag doesn't delete any things and it just replays all the key/values.

step 3: use compact

etcdctl get foo -w json
{"header":{"cluster_id":5358838169441251993,"member_id":15861234598778763093,"revision":10001,"raft_term":2}}


etcdctl compact 10001
compacted revision 10001

 etcdctl endpoint status dc1e6cd4c757f755 -w table
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        |    VERSION    | STORAGE VERSION | DB SIZE | DB SIZE IN USE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | dc1e6cd4c757f755 | 3.6.0-alpha.0 |           3.6.0 |  124 MB |          36 MB |      true |      false |         2 |      10005 |              10005 |        |
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+

After compact, we get total free page size is 88 MiB. Actually, it can be reused.

step 4: ingest new 1000 revision

And the InUse size is increased to 48 MiB from 36 MiB. NOTE: if there is not much more continuous pages, the total size will be increased.

benchmark put --rate=100 --total=1000 --compact-interval=0 --key-space-size=3000 --key-size=128 --val-size=10240

etcdctl endpoint status dc1e6cd4c757f755 -w table
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        |    VERSION    | STORAGE VERSION | DB SIZE | DB SIZE IN USE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | dc1e6cd4c757f755 | 3.6.0-alpha.0 |           3.6.0 |  124 MB |          48 MB |      true |      false |         2 |      11005 |              11005 |        |
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+

The compactor deletes out-of-date revisions, and the free pages can be reused.
In my first comment, the new compactor can delete out-of-date in time can slow down the increase rate of total size.

Hope the comment can make it clear. :)

serathius · 2023-08-17T10:13:18Z

Not super convinced, but I overall don't understand need for revision based compaction, so I might be wrong.

There is a tradeoff between predictability of compaction window vs storage size, however I don't know if anyone would prefer storage size especially for kubernetes. If there is a burst of changes, you especially don't want to compact too fast because it risks clients not being able to catch up. Fast compaction will lead to unsynced clients, that will need to download the whole state again. Seems like not a good change from scalability standpoint.

cc @mborsz @wojtek-t as for opinion about revision based compaction.

ahrtr · 2023-08-17T11:35:29Z

@fuweid sorry not to make it clear.

One of the key points is not to compact the most recent revisions too fast just as @serathius mentioned above.
Yes, that's correct that the free pages in bbolt db can be reused.
With regard to the performance issue (e.g. OOM) caused by burst traffic, there are already a couple of related discussion (unfortunately we do not get time to take care of it so far. @geetasg)
I am not worry about the db size as long as the free pages can be reused later. FYI. Support customizing the rebalance threshold bbolt#422

fuweid · 2023-08-17T13:41:00Z

Thanks for the review @serathius @ahrtr

The painpoint (at least to me) is that the bustable traffic is unpredictable and
there is no watchable alerm API (if I understand correctly) to notify that ETCD Operator or admin
to compact it in time and defrag if the DB size exceeds the quota in short time.
The ETCD cluster would run into read-only mode, which is kind of unavailable issue.

So, I am seeking a way to compact old revisions in time. It's kind of revision garbage collector.
The solution is to introduce the compactor based on revision count, which I was thinking about and it's doable.
It can be configurable by cluster object size.

you especially don't want to compact too fast because it risks clients not being able to catch up.

Yes. For this issue, we can consider to introduce catch-up-revisions to keep new revisions and avoid too many relist calls. Does it make sense to you?

With regard to the performance issue (e.g. OOM) caused by burst traffic, there are already a couple of related discussion

Yes. Burst list-all-huge-dataset traffic is nightmare.

I am not worry about the db size as long as the free pages can be reused later. FYI etcd-io/bbolt#422

Yes. But it needs us to gc the old revisions in time to reclaim the pages.
Otherwise, the total db size exceeds the quota size and ETCD would file NO_SPACE alert and get into read-only.

Based on that, maybe we should consider that the quota available checker should consider InUseSize at first and then the total size, since there will be available free pages can be reused after compact.

Looking forward to your feedback.

serathius · 2023-08-18T08:40:31Z

The painpoint (at least to me) is that the bustable traffic is unpredictable and
there is no watchable alerm API (if I understand correctly) to notify that ETCD Operator or admin
to compact it in time and defrag if the DB size exceeds the quota in short time.
The ETCD cluster would run into read-only mode, which is kind of unavailable issue.

Not sure I understand your motivation. It's normal to overprovision disk. What kind of burst you are talking about? Are you expecting data size you double in 5 minutes?

fuweid · 2023-08-18T10:43:27Z

What kind of burst you are talking about? Are you expecting data size you double in 5 minutes?

The burst is about too many PUT requests to kube-apiserver. I was facing issues that the argo workflow submitted the cronjobs and the ETCD total db size was increased by 2-3 GiB in 5 minutes. The number of pods was about 3,000 ~ 6,000 and each pod has one init-container and two containers. There were at least 5 PUT requests to ETCD for each pod. And most of pods were short-live. But workflow job kept create/delete/re-create for hours. If the compact doesn't work in time, the bbolt would expand very fast and then it exceeds the quota.

Not sure I understand your motivation. It's normal to overprovision disk.

Sorry for unclear comment. My motivation is to make ETCD clear the old revisions as soon as possible and then ETCD can reuse the free page, which can reduce downtime caused by NO_SPACE alarm.

Currently, once ETCD server detects the bbolt db size exceeds the max quota size, the ETCD server files NO_SPACE alarm and then gets into read-only mode. Even if the compactor reclaims space which has enough continuous pages to serve most of upcoming requests, the ETCD server still denies the request. The server needs the defragment to decrease the db size which lower than the max quota. And disable the NO_SPACE alarm.

However, there is no API, like Watch for change of key/value, to notify the operator or admin about the NO_SPACE alarm. It needs the operator component to poll the Alarm list. In order to reduce the downtime cause by NO_SPACE, there are only two options for existing ETCD releases:

Short the compact interval
Short the interval to poll the NO_SPACE alarm

So, I make this proposal to compact the old revisions in time and keep the DB total size is smaller than max quota size.

Besides this proposal, maybe we can consider to use In-Use-Size as current usage, instead of bbolt total db size.
Only if the In-Use-Size exceeds the quota size, the ETCD gets into read-only mode. Because the bbolt total db size doesn't represents the real usage size. The quota available checker should use In-Use-Size + cost > max-quota-size to deny the update request. So, ETCD server can be back to normal from NO_SPACE after compact, even if there is burst requests flooding into the ETCD server.

tjungblu · 2024-05-30T09:26:15Z

We're also seeing those bursty workloads more and more often, eg with argo and batch-style operations.

So, I am seeking a way to compact old revisions in time. It's kind of revision garbage collector.

I wonder if we can make the MVCC and bbolt more "generational". What if we were to shard bbolt by revision range?
I think this would also help with the fragmentation and the slowness we observe when writing into almost full bbolt, as new writes would always go to a new bbolt database.

Alternatively, have one "permgen"-style bbolt file where we can keep historical revisions that don't change, new writes would go into a "newgen". Where we would occasionally copy older revisions into the permgen, and then recycle the newgen once the fragmentation would be too big.

Admittedly, we would lose the ootb transaction support of one bbolt file, which is really neat.

lance5890 · 2024-06-18T05:57:55Z

+1

fuweid added the type/feature label Aug 16, 2023

fuweid mentioned this issue Aug 16, 2023

feature: add new compactor based revision count #16427

Draft

jmhbnz added the area/bbolt label Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: add new compactor based revision count #16426

feature: add new compactor based revision count #16426

fuweid commented Aug 16, 2023 •

edited

Loading

ahrtr commented Aug 16, 2023

fuweid commented Aug 17, 2023 •

edited

Loading

serathius commented Aug 17, 2023 •

edited

Loading

ahrtr commented Aug 17, 2023

fuweid commented Aug 17, 2023

serathius commented Aug 18, 2023

fuweid commented Aug 18, 2023

tjungblu commented May 30, 2024

lance5890 commented Jun 18, 2024

feature: add new compactor based revision count #16426

feature: add new compactor based revision count #16426

Comments

fuweid commented Aug 16, 2023 • edited Loading

What would you like to be added?

Why is this needed?

Additional Change:

ahrtr commented Aug 16, 2023

fuweid commented Aug 17, 2023 • edited Loading

serathius commented Aug 17, 2023 • edited Loading

ahrtr commented Aug 17, 2023

fuweid commented Aug 17, 2023

serathius commented Aug 18, 2023

fuweid commented Aug 18, 2023

tjungblu commented May 30, 2024

lance5890 commented Jun 18, 2024

fuweid commented Aug 16, 2023 •

edited

Loading

fuweid commented Aug 17, 2023 •

edited

Loading

serathius commented Aug 17, 2023 •

edited

Loading