For discussion: "slack" compression option #15215

robn · 2023-08-28T02:19:31Z

Motivation and Context

We have a customer that deals in very large, generally uncompressible data. They run with compression=off and lately have been considering recordsize=16M. In testing, we found that this presented a real bottleneck for files that only fill a small amount of the last record, eg, a 17M file.

The main reason for this is simply that we’re doing a ton more writing IO than we need to - we write all those trailing zeroes that we don’t even need. There’s also substantial extra CPU load, eg a lot of time spent checksumming.

So, we’ve been experimenting with various ways of ignoring the “slack space” of trailing zeroes at the end of each record. We call this “slack compression”.

There was some interest in this at the last OpenZFS Leadership call and maybe even doing it outside of the compression system. So, I’m posting this PR as a starting point for discussion!

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Method

The method is pretty simple no matter how its hooked up:

in zio_write_compress(), call a function that will search the data buffer to find the last non-zero byte in the file. This value is the psize for the write.
In zio_read_bp_init(), push a transform that will set psize to the lsize.

Yes, this is basically describing compression 😄

The slack_compress() function shows a fairly simple implementation of the search. It runs backwards through the data buffer as an array of uint64_t. It exits at the first non-zero value it finds. This implementation is loosely modeled on zio_compress_zeroed_cb() (indeed, our earliest experiments were simply exposing ZIO_COMPRESS_EMPTY). There are obvious optimisation opportunities available here, including vectorisation and letting it work on a scatter ABD to avoid the need to linearise the incoming buffer.

The interesting part is how and when to use this functionality.

`compression=slack` (this PR)

This PR shows the simplest form: adding a compression=slack option. This works just fine as a substitute for compression=empty, and the ARC gets the benefit as well.

Early on, I plumbed this method directly into zio_write_compress rather than hooking it up to zio_compress_table, because we’re not actually doing any compression, so we don’t need a destination buffer or a copy; the source buffer can be reused. As it turns out it only made a small difference to performance because the search function still requires a linear buffer.

This works just fine. It could be merged today if we had any spare DMU_BACKUP_FEATURE_* bits). But its not really how I would like to do it.

`compression=none` and force psize to lsize

I am probably wrong about this, but as far as I could tell, there is only one situation outside of compression where we store psize < lsize, and that is when the ZIL is stored on a non-raidz vdev. Before issuing IO zio_shrink() is called to set psize as small as possible. On read, the data is read into a lsize'd buffer, so it naturally falls out right. I haven’t tried hard to find the code, but my intuition is that that doesn’t work for raidz because the block gets sliced up further and that goes wrong if the sizes aren’t right coming in.

If that’s right, that should mean that we could set a transform in zio_read_bp_init() for any uncompressed block with psize < lsize and it would “just work”, without requiring a separate compression option. It would still need a read feature flag as older software won’t understand it (perhaps only active if written to a raidz vdev), but possibly won’t require a stream feature flag.

We wouldn’t have to, but we could also teach the ARC about this, so that even though compression is “off” it could still keep only the real data portion in memory rather than larger buffers of zeroes.

Slack with content compression?

We didn’t try this, but it might be possible to use slack with data compression. This might be worthwhile, as collapsing zeroes could perhaps be made faster than a general compressor and may still have less overhead.

I’ve not thought hard about this, but I think method should be as simple as collapsing zeroes before invoking the real compressor, and when reading back, resetting psize after compression. However I think it would require that compression algorithms are able to know their decompressed size from the compressed state, since we can’t tell them ahead of time.

If that’s no good, then I can see taking a spare bit from somewhere in the BP (maybe bit 7 of the checksum type?) and use that as a signal that a further transform is necessary after decompression. Hmm, or maybe not, because where is the size? Do we need the final size for read even? Ok, I would really need to think about that more.

Slack compress the last block only?

We also spent some time trying to see if we could only trigger slack compression for the last block in an object, for when we know that the “inner” blocks are full and so pointless to search, but it seems to be a non-starter. If a file is written in order from start to finish and the ZIL isn’t used, then its usually possible to have a clear notion of “last block” which can be plumbed down through the write policy. If the block is written out to the ZIL though (and later hooked up with WR_INDIRECT) then there’s no way to know that it was the “last” block. Since our customer has a fsync()-heavy workload, most writes were of this type. For workloads that aren’t calling fsync() a lot it can be better, but there also problems if holes or zero-fills are introduced to “inner” blocks and so could benefit from slack compression are not now able to benefit. Taken together it ends up being a very situational gain, so we haven’t pursued it further.

Make ZLE faster?

We considered that perhaps we could avoid this by optimising ZLE. The main difference is that ZLE is a true compression algorithm, and so copies the data. It could certainly be optimised, but regardless, it would require the entire block to be scanned, so we’d be going from a full scan and checksum to a full scan, copying some amount and then another shorter scan for the checksum. We think the best case for slack will be no copies, scanning backwards to find the end of data, and then a forward scan to compute the checksum, so at the end, the entire block has only been seen once and no copies made.

(That said, ZLE is ripe for some optimising work, though I doubt many people are using it directly these days)

Performance

We have a test scaffold available that runs a representative simulation of the customer’s workload: open(), write(), fsync(), close(). We use this to test the effects of config changes, patches, etc.

Below are results for varying combinations of recordsize, compression and at different file sizes. The test is done on three 12-wide raidz3 pools, each with 10 writing threads, for 10 minutes. Note that there is a lot of other tuning involved (standard for the customer workload), so these are useful for ballpark comparison only.

The “throughput” column is showing the “real” amount of file data, not the total written, which helps compare the total loss due to trailing zeroes. The “fsync” column is the average time for the fsync() call to return, which is a useful proxy for the total IO time (and an important number for our customer as their clients are blocked until this returns).

These results have mostly given us confidence that even this quite naive version helps bring us closer to where we were with none, and can even rival lz4 some of the time. Our next steps are likely to see what we can gain by removing or minimising the number of allocations and copies required, and by improving the performance of the searching function.

recordsize	compression	file size	files written	throughput (MB/s)	fsync (ms)
128k	none	15m	84892	707	203.000
		16m	84054	747	202.667
		17m	77221	729	223.000
		24m	58366	778	295.000
		31m	47749	822	360.333
	lz4	15m	84937	707	203.000
		16m	83768	745	203.333
		17m	77605	733	222.333
		24m	57284	764	300.667
		31m	46755	805	368.000
	zle	15m	82360	686	209.000
		16m	82078	729	207.000
		17m	75475	713	228.333
		24m	54590	728	315.333
		31m	43946	757	391.333
	slack	15m	85562	713	201.000
		16m	84296	749	202.000
		17m	77982	736	220.333
		24m	58677	782	293.333
		31m	47734	822	359.333
16m	none	15m	85691	714	176.333
		16m	89569	796	159.333
		17m	47338	447	306.333
		24m	47172	629	304.667
		31m	46861	807	304.333
	lz4	15m	92176	768	158.667
		16m	88380	785	161.000
		17m	66540	628	187.667
		24m	58349	778	225.333
		31m	45889	790	299.333
	zle	15m	86731	723	165.000
		16m	86580	769	163.000
		17m	61562	581	196.333
		24m	57622	768	232.000
		31m	46719	804	301.333
	slack	15m	90687	756	157.667
		16m	88439	786	161.333
		17m	62187	587	190.667
		24m	56066	747	216.667
		31m	46534	801	306.000

Feedback

We’ll keep working on this, but we’re interested in feedback on any or all of the above. If you’ve got workloads you can test this on, or might like us to try, let us know. If you have thoughts about what this might mean if it wasn’t a compression option but “always on”, great! (especially in “weird” IO situations, like gang blocks, or metadata, or whatever). I’d like to hear more about what it all might mean for block pointer and IO pipeline changes. And I dunno what else; if you were interested in the dev call, please throw some thoughts down too!

allanjude · 2023-10-23T16:52:28Z

@rincebrain Have you had a look at this?

shodanshok · 2023-10-26T17:32:17Z

For what it is worth, my opinion is that optimizing ZLE should be preferred. After all, if LZ4 is so fast, why zero elimination should be slower? I understand the concerns about buffer copies - but if the end result is faster/simpler, I can not see any real issue.

amotin

Presence of such a trivial algorithm makes me even more wish to have ability to run compressor/decompressor code directly on ABD instead of linear buffer. In this trivial case even one copy looks excessive, but two are just wasteful. Our benchmarks show NAS currently using 10-20 times more memory bandwidth than data read/written. We should more care about it.

Just a thought about possible alternatives: couldn't we make all decompressors we have to report how much data they've actually filled to just fill the rest with zeroes up to the logical size? This way we could run the slack check before any normal compressors and just say compressor to ignore the zeroes. This way slack handling would be done on ABD without need to linearise the zero part of the buffer and using more advanced compressor on it.

amotin · 2024-07-03T14:44:31Z

module/zfs/slack.c

+{
+	(void) level;
+	ASSERT3U(d_len, >=, s_len);
+	memcpy(dst, src, s_len);


We should wipe the rest with zeroes here.

The "slack" option simply searches from the end of the block backwards to the last non-zero byte, and sets that position as the "compressed" size. This patch is highly experimental; please see the associated PR for discussion. Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.

robn marked this pull request as draft August 28, 2023 02:19

behlendorf added the Status: Design Review Needed Architecture or design is under discussion label Oct 20, 2023

allanjude requested a review from pcd1193182 October 23, 2023 16:52

robn mentioned this pull request Oct 25, 2023

send/recv: open up additional stream feature flags #15454

Merged

13 tasks

robn force-pushed the slack-compression branch from d58de83 to 0cfddae Compare January 12, 2024 04:44

robn force-pushed the slack-compression branch 3 times, most recently from 0f4861a to 3dc96e0 Compare July 3, 2024 09:28

amotin reviewed Jul 3, 2024

View reviewed changes

robn force-pushed the slack-compression branch from 3dc96e0 to dbee1a9 Compare September 11, 2024 11:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For discussion: "slack" compression option #15215

For discussion: "slack" compression option #15215

robn commented Aug 28, 2023 •

edited

Loading

allanjude commented Oct 23, 2023

shodanshok commented Oct 26, 2023

amotin left a comment

amotin Jul 3, 2024 •

edited

Loading

For discussion: "slack" compression option #15215

Are you sure you want to change the base?

For discussion: "slack" compression option #15215

Conversation

robn commented Aug 28, 2023 • edited Loading

Motivation and Context

Method

compression=slack (this PR)

compression=none and force psize to lsize

Slack with content compression?

Slack compress the last block only?

Make ZLE faster?

Performance

Feedback

allanjude commented Oct 23, 2023

shodanshok commented Oct 26, 2023

amotin left a comment

Choose a reason for hiding this comment

amotin Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

robn commented Aug 28, 2023 •

edited

Loading

`compression=slack` (this PR)

`compression=none` and force psize to lsize

amotin Jul 3, 2024 •

edited

Loading