backend: add experimental defrag txn limit flag #15511

sjdot · 2023-03-18T17:34:19Z

I've been looking into defrag performance and noticed there is a hardcoded value defragLimit that controls the frequency at which transactions are committed during a defrag (currently set to 10k).

Seems this value has been present since the original defrag implementation and I couldn't find any docs or comments in the PR that state why 10k is used.

Through some brief synthetic testing I've observed both memory usage and defrag run time can be influenced by changing this value so figured it might be useful as a configurable via cli flag.

Is this reasonable or is there a reason for why this should stay at 10k?

Signed-off-by: Steven Johnson <[email protected]> Signed-off-by: sjdot <[email protected]>

jmhbnz

Thanks for proposing this idea @sjdot. The maintainers can likely give some insight on the original intention for the limit, just in terms of the code I have one suggestion below.

It sounds to me like a batch limit, so I suggest the new variable is called DefragBatchLimit and ExperimentalDefragBatchLimit. That makes it clear we are dealing with a batch versus overall limit.

Finally, you mention the performance of defrag can be influenced with this flag, are you able to provide some benchmarks when you're ready for wider review? Thanks!

chaochn47 · 2023-03-18T20:51:12Z

/cc @cenkalti

Signed-off-by: sjdot <[email protected]>

ahrtr · 2023-03-18T23:01:17Z

Thanks @sjdot . Actually I am thinking to enhance the bbolt.Compact interface to support this, and then update etcd to use the updated API (should be backward compatible) just as #15470 does.

sjdot · 2023-03-20T16:20:47Z

Thanks @ahrtr, sounds good. Do you anticipate your change would appear in a 3.5.x release, or would this be further down the line?

Would it make sense to have this in place before then so it can be passed in when the refactor is done?

ahrtr · 2023-03-20T22:55:59Z

Do you anticipate your change would appear in a 3.5.x release

Firstly, it depends on whether the change on bbolt.Compare is backward compatible. I think YES, but the interface may be a little ugly.

Secondly, based on the first point mentioned above. Technically speaking, it's OK to backport the change to 3.5.x. But since it's a feature, and usually we don't backport feature. Please share your test result on the performance comparison.

sjdot · 2023-03-21T22:38:10Z

@ahrtr

Here is some synthetic bench-marking I ran through.

Methodology was:

Put 20k 50kb keys in a single node running locally
Run 10 defrags in a row via a shell loop and get the output from etcdctl for how long it took
Sample the RSS of the etcd process once a second via /proc to get a rough estimate of peak memory usage
Repeat with the 10k default and the 1k txn commit frequency
Do this both on disk and in a tmpfs filesystem

Let me know if there is a better/standard way to do this.

Flag set to 10k, db on disk:

Finished defragmenting etcd member[http://localhost:2379]. took 5.333068127s
Finished defragmenting etcd member[http://localhost:2379]. took 6.433203078s
Finished defragmenting etcd member[http://localhost:2379]. took 6.27974422s
Finished defragmenting etcd member[http://localhost:2379]. took 5.805693345s
Finished defragmenting etcd member[http://localhost:2379]. took 6.329594334s
Finished defragmenting etcd member[http://localhost:2379]. took 6.307598353s
Finished defragmenting etcd member[http://localhost:2379]. took 6.000404371s
Finished defragmenting etcd member[http://localhost:2379]. took 5.874886916s
Finished defragmenting etcd member[http://localhost:2379]. took 5.633190573s
Finished defragmenting etcd member[http://localhost:2379]. took 6.171504848s

Peak RSS ~3178mb

Flag set to 1k, db on disk:

Finished defragmenting etcd member[http://localhost:2379]. took 3.382667041s
Finished defragmenting etcd member[http://localhost:2379]. took 3.052846006s
Finished defragmenting etcd member[http://localhost:2379]. took 3.109257513s
Finished defragmenting etcd member[http://localhost:2379]. took 3.111635591s
Finished defragmenting etcd member[http://localhost:2379]. took 3.099382545s
Finished defragmenting etcd member[http://localhost:2379]. took 3.096814046s
Finished defragmenting etcd member[http://localhost:2379]. took 2.981718829s
Finished defragmenting etcd member[http://localhost:2379]. took 2.933617856s
Finished defragmenting etcd member[http://localhost:2379]. took 2.826742199s
Finished defragmenting etcd member[http://localhost:2379]. took 3.068731601s

Peak RSS ~1700mb

Flag set to 10k, db in tmpfs:

Finished defragmenting etcd member[http://localhost:2379]. took 1.988840824s
Finished defragmenting etcd member[http://localhost:2379]. took 1.831227897s
Finished defragmenting etcd member[http://localhost:2379]. took 1.814925611s
Finished defragmenting etcd member[http://localhost:2379]. took 1.80727876s
Finished defragmenting etcd member[http://localhost:2379]. took 1.888374921s
Finished defragmenting etcd member[http://localhost:2379]. took 1.783177088s
Finished defragmenting etcd member[http://localhost:2379]. took 1.805623487s
Finished defragmenting etcd member[http://localhost:2379]. took 1.784464582s
Finished defragmenting etcd member[http://localhost:2379]. took 1.797325653s
Finished defragmenting etcd member[http://localhost:2379]. took 1.724021265s

Peak RSS ~2937mb

Flag set to 1k, db in tmpfs:

Finished defragmenting etcd member[http://localhost:2379]. took 902.344145ms
Finished defragmenting etcd member[http://localhost:2379]. took 895.942189ms
Finished defragmenting etcd member[http://localhost:2379]. took 896.139653ms
Finished defragmenting etcd member[http://localhost:2379]. took 888.574844ms
Finished defragmenting etcd member[http://localhost:2379]. took 910.106993ms
Finished defragmenting etcd member[http://localhost:2379]. took 908.550838ms
Finished defragmenting etcd member[http://localhost:2379]. took 894.697715ms
Finished defragmenting etcd member[http://localhost:2379]. took 893.926867ms
Finished defragmenting etcd member[http://localhost:2379]. took 888.552991ms
Finished defragmenting etcd member[http://localhost:2379]. took 889.956509ms

Peak RSS ~1038mb

cenkalti · 2023-03-22T16:42:26Z

Key size varies between different workloads. Does it make more sense to set the limit as transaction size value instead of key count?

Similar to bbolt.Compact implementation?
https://pkg.go.dev/go.etcd.io/bbolt#Compact
https://github.com/etcd-io/bbolt/blob/v1.3.7/compact.go#L24

ahrtr · 2023-03-22T22:43:18Z

Thanks @sjdot for sharing the test data. Smaller Peak RSS on 1K (the flag value) as compared to 10K is align with my understanding. But it's interesting on the smaller duration on 1K, it may be related to the key size (50kb in your example).

I agree that we should set a transaction size instead of key count. I think we may can get consistent test result, which should be independent on the key size.

sjdot · 2023-03-22T22:46:31Z

@ahrtr @cenkalti sure, I'll do some further local testing with size based vs key count based and share my results

sjdot · 2023-03-23T02:16:02Z

Ok @ahrtr @cenkalti I made a branch that uses the size as the limit instead of number of keys and it looks like it's more consistent regardless of the key size/count distribution.

Tested two different 1GB DBs, one which is made up of 20k 50kb keys as per above tests and another which is 2k 500kb keys.

Kept the size limit and the key count limit the same in both test cases and size is clearly more consistent for performance here. Seems it's just the number of bytes to get through vs the number of keys that influences the performance.

Tests below:

20k 50kb keys, size based txn commit at 50MB:

Finished defragmenting etcd member[http://localhost:2379]. took 1.21319986s
Finished defragmenting etcd member[http://localhost:2379]. took 1.153966878s
Finished defragmenting etcd member[http://localhost:2379]. took 1.182014535s
Finished defragmenting etcd member[http://localhost:2379]. took 1.159080576s
Finished defragmenting etcd member[http://localhost:2379]. took 1.137828246s
Finished defragmenting etcd member[http://localhost:2379]. took 1.160709211s
Finished defragmenting etcd member[http://localhost:2379]. took 1.169285698s
Finished defragmenting etcd member[http://localhost:2379]. took 1.131717019s
Finished defragmenting etcd member[http://localhost:2379]. took 1.160219651s
Finished defragmenting etcd member[http://localhost:2379]. took 1.123005736s

Peak RSS ~1692mb

20k 50kb keys, count based txn commit at 1k keys:

Finished defragmenting etcd member[http://localhost:2379]. took 1.119592168s
Finished defragmenting etcd member[http://localhost:2379]. took 1.054284303s
Finished defragmenting etcd member[http://localhost:2379]. took 1.083724043s
Finished defragmenting etcd member[http://localhost:2379]. took 1.073588978s
Finished defragmenting etcd member[http://localhost:2379]. took 1.118112596s
Finished defragmenting etcd member[http://localhost:2379]. took 1.081543095s
Finished defragmenting etcd member[http://localhost:2379]. took 1.094735675s
Finished defragmenting etcd member[http://localhost:2379]. took 1.080572673s
Finished defragmenting etcd member[http://localhost:2379]. took 1.072687346s
Finished defragmenting etcd member[http://localhost:2379]. took 1.06313357s

Peak RSS ~1695mb

2k 500kb keys, size based txn commit at 50MB (consistent with above 50MB test):

Finished defragmenting etcd member[http://localhost:2379]. took 1.093779666s
Finished defragmenting etcd member[http://localhost:2379]. took 1.045045904s
Finished defragmenting etcd member[http://localhost:2379]. took 1.056863083s
Finished defragmenting etcd member[http://localhost:2379]. took 1.043255168s
Finished defragmenting etcd member[http://localhost:2379]. took 1.077573313s
Finished defragmenting etcd member[http://localhost:2379]. took 1.071873754s
Finished defragmenting etcd member[http://localhost:2379]. took 1.087684704s
Finished defragmenting etcd member[http://localhost:2379]. took 1.083757851s
Finished defragmenting etcd member[http://localhost:2379]. took 1.21139664s
Finished defragmenting etcd member[http://localhost:2379]. took 1.102637689s

Peak RSS ~1708mb

2k 500kb keys, count based txn commit at 1k keys (slow down in comparison to 1k limit on above 1k test):

Finished defragmenting etcd member[http://localhost:2379]. took 2.003168456s
Finished defragmenting etcd member[http://localhost:2379]. took 1.876306178s
Finished defragmenting etcd member[http://localhost:2379]. took 1.783715993s
Finished defragmenting etcd member[http://localhost:2379]. took 1.859801159s
Finished defragmenting etcd member[http://localhost:2379]. took 1.750165853s
Finished defragmenting etcd member[http://localhost:2379]. took 1.761676024s
Finished defragmenting etcd member[http://localhost:2379]. took 1.739160742s
Finished defragmenting etcd member[http://localhost:2379]. took 1.776034929s
Finished defragmenting etcd member[http://localhost:2379]. took 1.821699408s
Finished defragmenting etcd member[http://localhost:2379]. took 1.729124961s

Peak RSS ~2895mb

ahrtr · 2023-04-26T00:24:31Z

After both etcd-io/bbolt#422 (comment) and #15470 are resolved, then we can continue to work on this PR. But I suggest to set the defragLimit to the transaction max bytes instead of number of keys, something like below,

// DefragBatchTxMaxBytes limit the transactions size of defragmentation process and may trigger intermittent commits
DefragBatchTxMaxBytes int

stale · 2023-08-12T06:49:22Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

k8s-ci-robot · 2024-06-11T05:05:37Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

sjdot added 2 commits March 18, 2023 13:36

backend: add experimental defrag txn limit flag

461ebc3

Signed-off-by: Steven Johnson <[email protected]> Signed-off-by: sjdot <[email protected]>

Fix comment

e9b810b

Signed-off-by: Steven Johnson <[email protected]> Signed-off-by: sjdot <[email protected]>

sjdot force-pushed the sjdot/defrag-limit branch from ffdf4c3 to e9b810b Compare March 18, 2023 17:37

jmhbnz reviewed Mar 18, 2023

View reviewed changes

Rename to DefragBatchLimit

201fad6

Signed-off-by: sjdot <[email protected]>

stale bot added the stale label Aug 12, 2023

stale bot closed this Sep 17, 2023

ahrtr reopened this Sep 17, 2023

ahrtr added stage/tracked and removed stale labels Sep 17, 2023

k8s-ci-robot added the needs-rebase label Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backend: add experimental defrag txn limit flag #15511

backend: add experimental defrag txn limit flag #15511

sjdot commented Mar 18, 2023

jmhbnz left a comment

chaochn47 commented Mar 18, 2023

ahrtr commented Mar 18, 2023

sjdot commented Mar 20, 2023

ahrtr commented Mar 20, 2023

sjdot commented Mar 21, 2023

cenkalti commented Mar 22, 2023

ahrtr commented Mar 22, 2023

sjdot commented Mar 22, 2023

sjdot commented Mar 23, 2023

ahrtr commented Apr 26, 2023

stale bot commented Aug 12, 2023

k8s-ci-robot commented Jun 11, 2024

backend: add experimental defrag txn limit flag #15511

Are you sure you want to change the base?

backend: add experimental defrag txn limit flag #15511

Conversation

sjdot commented Mar 18, 2023

jmhbnz left a comment

Choose a reason for hiding this comment

chaochn47 commented Mar 18, 2023

ahrtr commented Mar 18, 2023

sjdot commented Mar 20, 2023

ahrtr commented Mar 20, 2023

sjdot commented Mar 21, 2023

cenkalti commented Mar 22, 2023

ahrtr commented Mar 22, 2023

sjdot commented Mar 22, 2023

sjdot commented Mar 23, 2023

ahrtr commented Apr 26, 2023

stale bot commented Aug 12, 2023

k8s-ci-robot commented Jun 11, 2024