Explode zstd into separate upstream files #12978

rincebrain · 2022-01-15T15:26:28Z

Motivation and Context

A few things.

Compiling the monolithic file on my lower-end systems was incredibly slow and sad (15+ minutes on my poor sparc)
Resulting object files are ~the same size without debug symbols (~650kB) (and markedly smaller with - 17 MB versus 11, on my Debian bullseye testbed)
It's much easier to track changes in a bunch of vendor'd upstream code if you update it later than in a preprocessed file, IMO
Even on modern multicore systems, handing them a bunch of smaller files is a win:

Ryzen 5900X, `make -j12` (6c VM):

make[1]: Leaving directory '/home/rich/zfs_baseline'

real    1m43.975s
user    5m44.267s
sys     0m25.594s

make[1]: Leaving directory '/home/rich/zfs_zstd145_explode'

real    1m29.173s
user    5m46.671s
sys     0m26.347s

Broadwell-DE, `make -j12` (8c16t, native):

make[1]: Leaving directory '/home/rich/zfs_baseline'

real    3m26.920s
user    17m22.782s
sys     2m4.706s

make[1]: Leaving directory '/home/rich/zfs_zstd145_explode'

real    2m51.619s
user    16m41.043s
sys     2m5.006s

Raspberry Pi 4, `make -j8` (4c, native):

make[1]: Leaving directory '/home/pi/zfs_baseline'

real    13m55.414s
user    36m12.288s
sys     6m1.421s

make[1]: Leaving directory '/home/pi/zfs_zstd145_explode'

real    13m38.206s
user    38m59.281s
sys     6m24.739s

(I find this one confusing too - to the point that I reran it. I guess it's bottlenecking elsewhere? Maybe -j8 is too much.)

T5140, `make -j100` (12c96t, native)

make[1]: Leaving directory '/home/rich/zfs_baseline'

real    32m35.553s
user    187m22.353s
sys     23m15.687s
$

make[1]: Leaving directory '/home/rich/zfs_zstd145_explode'

real    23m39.523s
user    196m27.774s
sys     25m37.003s
$

(Future spoiler) I'd like to use split files when importing 1.5.1, so doing it here first will make the diff more useful than "I threw out the big file and put in a bunch of small ones".

Description

As the README now says, I got a list of the files to include, moved them over, grabbed the #defines from the old zstd.c, pared off a few included files that obviously weren't relevant, and fixed compile errors.

(And reapplied a78f19d.)

Tada.

How Has This Been Tested?

Built on the above platforms, tested on Linux/amd64 and FBSD/amd64 to produce identical results on a test file as before the PR (enwik9, if curious to reproduce).

Otherwise, the CI can go to town.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

sempervictus · 2022-01-16T13:26:08Z

@rincebrain: thank you. Is there a test in ZFS to verify that compressed data matches the sums of known compressed data with prior implementation? My understanding is that this is the existing ZSTD code just moved to another location so it should produce the same buffers.... but thats been coming up a bunch lately with compressor-related issues so figure i'd ask.

rincebrain · 2022-01-16T13:32:05Z

Not aware of one, no, and it'd be pretty limited if one added one.

(One also couldn't add one right now for, say, gzip, given the QAT and Linux/FBSD differences, and it'd fail if you didn't include differing versions for different endiannesses...)

sempervictus · 2022-01-16T18:40:13Z

Yike. So basically deduplicated pools are not portable in OpenZFS.
Crypto implementations sometimes use well-known vectors against which they test their computational product, so i figure that having a well-known bufffer of bytes that compress to X, Y, and Z on lz4, gzip, and zstd for comparison might be handy... point taken though, its a bit of a free-for-all right now.

rincebrain · 2022-01-16T19:26:53Z

I believe, if memory serves, that was true before OpenZFS, even - BE and LE hosts would have resulted in different blocks, nevermind which compressor you picked. I am familiar, yes. My point remains that it'd be a combinatoric nightmare to test them all on any one platform, and then we'd be only testing subsets on any given run...to say nothing of opaque implementations we can't test against, like QAT's gzip, or the accelerator cards someone mentioned in a recent OZFS Leadership Meeting. Not to say that we shouldn't still test things, just that I'm not sure where the right balance of cost/benefit would be.

…

On Sun, Jan 16, 2022 at 1:40 PM RageLtMan ***@***.***> wrote: Yike. So basically deduplicated pools are not portable in OpenZFS. Crypto implementations sometimes use well-known vectors against which they test their computational product, so i figure that having a well-known bufffer of bytes that compress to X, Y, and Z on lz4, gzip, and zstd for comparison might be handy... point taken though, its a bit of a free-for-all right now. — Reply to this email directly, view it on GitHub <#12978 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABUI7IHYGLMQ3BV7JMHHJ3UWMGJRANCNFSM5MBDLJOA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

PrivatePuffin · 2022-01-28T08:51:37Z

@BrainSlayer you might want to go over this

BrainSlayer · 2022-01-28T09:08:44Z

compiling the zstd as monolithic file does take longer, but it also has a great benefit. it allows the compiler to optimize the code much better for performance. such monolithic compiling approaches are also named "the poor mans LTO". this should be considered. splitting the zstd code will decrease performance significant. the monolithic zstd code way was invented exactly for that performance approach.
tracking changes is not a issue. consider that the monolithic file isnt hand made. its generated by a script. you just need to track the original files. in addition compiling time is no argument. we do not run code from source. this is a one time job and yes also compiling the linux kernel will take hours on your poor sparc machine

future spoiler. you will find a 1.5.1 based zfs variant on my github account already.

BrainSlayer · 2022-01-28T09:15:40Z

lib/libzstd/Makefile.am

-# SSE register return with SSE disabled if -march=znverX is passed
-AM_CFLAGS += -U__BMI__
-
-


removing this will lead to compile errors on some X64 architectures, depending on your -march option. i added this workaround since i was running into that issue. i know that the origin of that problem is a upstream toolchain bug which might be resolved with newer compiler versions. but we should consider all supported toolchains

Sure, that being in this rev was just an oversight, I had removed it at one point and didn't revert it before pushing.

rincebrain · 2022-01-28T09:18:51Z

If only compilers had some form of LTO flags to get LTO optimizations without concatenating every source file.

Do you have benchmarks showing that the performance is markedly worse?

BrainSlayer · 2022-01-28T09:28:39Z

If only compilers had some form of LTO flags to get LTO optimizations without concatenating every source file.

Do you have benchmarks showing that the performance is markedly worse?

the compiler has lto optimization flags, but
this is not possible (yet) in linux kernel environment. there is a experimental lto patch for the linux kernel which does make use of it. but the resulting kernel binary is not stable and has several bugs. you cannot use lto optimization flags for specific object files only since it does require alot of patches for the kernel linker scripts which are used for compiling zfs and basicly everything which is related to kernel modules. and even if, whats with freebsd?
and lto is also not supported for all supported toolchains. lto is more or less stable starting from gcc 8, but you hear it. "more or less". lto does not guarantee to create good working code. so its up to the user todo his own experiments with it. no change for zfs is required here. just the kernel must be heavily patched. i did such experiments on various platforms and the only platform which worked with it without visible issues was arm. the current approach will work for all compilers and kernels

BrainSlayer · 2022-01-28T09:31:38Z

If only compilers had some form of LTO flags to get LTO optimizations without concatenating every source file.

Do you have benchmarks showing that the performance is markedly worse?

i have no benchmark on my table yet. consider that i finished the work on the zstd patch a long time ago. but @Ornias1993 has a good working benchmark script. you can use it to run performance tests with it using your patch and without. i can only run tests on my ryzen 16 core machine using gcc 11.

rincebrain · 2022-01-28T09:31:50Z

Great, so LTO flags won't work in the kernel.

I come back to "do you have benchmarks showing this is markedly worse", because I don't think "let's put everything in one file for performance" is the balance that the entire rest of the source tree takes.

e: Ah, good, replies criss-crossing.

I have a modified version of that script with some functionality I missed, like setting different recordsizes, that I can indeed use to benchmark this.

But if your argument is both "this is slower" and "using separate source files is bad", that only covers one of those objections, and I don't particularly see me convincing you that using multiple source files is a reasonable choice if that's your starting position.

BrainSlayer · 2022-01-28T09:42:57Z

Great, so LTO flags won't work in the kernel.

I come back to "do you have benchmarks showing this is markedly worse", because I don't think "let's put everything in one file for performance" is the balance that the entire rest of the source tree takes.

e: Ah, good, replies criss-crossing.

I have a modified version of that script with some functionality I missed, like setting different recordsizes, that I can indeed use to benchmark this.

But if your argument is both "this is slower" and "using separate source files is bad", that only covers one of those objections, and I don't particularly see me convincing you that using multiple source files is a reasonable choice if that's your starting position.

i dont have a issue with splitting source files in its original form. but zstd or compression at all is performance critical. the only argument i have is raw based on performance. i think you understand how a compiler works and that compiling all as single source does allow the compiler todo more aggressive optimization. the best example is "inlining stragedy". the compiler cannot auto inline code which is placed in different objects files. this is only possible if the compiler sees all in one source. processing several object files into a resulting binary is a later job by the linker and the linker cannot inline affected code later anymore.
the same thing applies if you want to optimize the code for size, which is usefull for embedded systems. here you dont need a benchmark. simply compile the code and compare the resulting zstd kernel binary (after you stripped the debug symbols etc of course for correct measurement). you will see a very big difference here

rincebrain · 2022-01-28T09:55:12Z

I'm familiar with the fact that you can't do certain optimizations with the loss of information from compiling the object files unless you embed additional information to reconstruct that state at a later stage, yes. I remain unconvinced it makes a notable difference in this case.

It's funny you say that...I did do that experiment, actually, on one test platform (Debian bullseye x86_64).

./configure --enable-debug, current master
before strip: 17178296 bytes
after strip: 541720 bytes

./configure --enable-debug, this PR
before strip: 11307912 bytes
after strip: 608504 bytes

So about 60 KiB. Not nothing, but probably not the difference between fitting in a cache line and not, for example.

BrainSlayer · 2022-01-28T11:03:21Z

the compiler does know the cacheline size depeching with march size is set. so the compiler optimizes the code in a way that it runs best for the desired cachlines. it you hack the kernel for forcefully enabling the optimize_inline option (cannot be enabled on most architectures). it has even better results. (kernel source/init/Kconfig, remove the dep for OPTIMIZE_INLINING option)
this will create best performing code.

BrainSlayer · 2022-01-28T11:04:36Z

an and consider that default kernel option for debian is "optimize for size" if i remember correct. i do personally optimize for speed on my local environment. you can also overwrite the default options in the CFLAGS in the zstd directory. i believe i did this in my original patch. might be still in there

BrainSlayer · 2022-01-28T11:05:53Z

@Ornias1993 i did run your bench script. but its not happy with the current fio version
fio: failed parsing randseed=${PERF_RANDSEED}
fio: failed parsing buffer_compress_percentage=${PERF_COMPPERCENT}
fio: failed parsing buffer_compress_chunk=${PERF_COMPCHUNK}
fio: job global dropped
job: you need to specify size=
fio: pid=0, err=22/file:filesetup.c:1057, func=total_file_size, error=Invalid argument
job: you need to specify size=
fio: pid=0, err=22/file:filesetup.c:1057, func=total_file_size, error=Invalid argument
job: you need to specify size=
fio: pid=0, err=22/file:filesetup.c:1057, func=total_file_size, error=Invalid argument
job: you need to specify size=
fio: pid=0, err=22/file:filesetup.c:1057, func=total_file_size, error=Invalid argument

PrivatePuffin · 2022-01-28T12:24:15Z

@Ornias1993 i did run your bench script. but its not happy with the current fio version fio: failed parsing randseed=${PERF_RANDSEED} fio: failed parsing buffer_compress_percentage=${PERF_COMPPERCENT} fio: failed parsing buffer_compress_chunk=${PERF_COMPCHUNK} fio: job global dropped job: you need to specify size= fio: pid=0, err=22/file:filesetup.c:1057, func=total_file_size, error=Invalid argument job: you need to specify size= fio: pid=0, err=22/file:filesetup.c:1057, func=total_file_size, error=Invalid argument job: you need to specify size= fio: pid=0, err=22/file:filesetup.c:1057, func=total_file_size, error=Invalid argument job: you need to specify size= fio: pid=0, err=22/file:filesetup.c:1057, func=total_file_size, error=Invalid argument

No idea, not working on it or ZFS anymore.

rincebrain · 2022-01-28T12:38:46Z

One is from this PR, one is from stock git master, both compiled without --enable-debug on Debian buster, run on my Ryzen 5900X, which was idling except for this.

There are differences (which may be runtime variation, I didn't run this more than once and average or anything, at the moment), but neither of them seems to be a strict win over the other?

e: These are all from the compression-bench.sh script mentioned before, I just modified it to take a recordsize option to configure before running its dd over, in this case, the mediawiki-based test file.

edit 2: Also, Debian compiles its kernels with CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y, and on buster also uses CONFIG_OPTIMIZE_INLINING=y.

BrainSlayer · 2022-01-30T11:02:15Z

which chart is which? i mean your benchmark shows that your PR is overall much faster. this doesnt look sane. but can also be caused by the BMI flag change. since the benchmark script doesnt work on my opensuse system and just spits errors, i cannot do a counter benchmark right now

rincebrain · 2022-01-30T11:25:05Z

The first chart is the PR.

Where are you looking? The only case I see being noticably different like that is zstd-fast-50 decompression?

Also the flag difference you mention is only in the userland bit - if you look in module/zstd/Makefile.in, it does include -U__BMI__, and that should be the one we care about here.

BrainSlayer · 2022-01-30T17:48:46Z

basicly every single value of your first chart is faster than in the second chart

BrainSlayer · 2022-01-30T17:51:13Z

and basicly every single value in your chart is 10 times slower than on my ryzen 5950X. just as a side note. can you provide me the script you're using for verification?

rincebrain · 2022-01-30T18:22:33Z

Well, my 5900X's VM could be 10 times slower than your 5950X on baremetal, but that seems at least minorly improbable.

As an experiment, I tried it a couple of data points quickly on my 8700k baremetal, which is running stock 2.0.6, and did not get a factor of 10 speedup.

“Compression results for zstd-3 with recordsize 1M”
3145728000 bytes (3.1 GB, 2.9 GiB) copied, 4.64117 s, 678 MB/s
“Decompression results for zstd-3 with recordsize 1M”
3145728000 bytes (3.1 GB, 2.9 GiB) copied, 3.49597 s, 900 MB/s
“Compression results for zstd-6 with recordsize 1M”
3145728000 bytes (3.1 GB, 2.9 GiB) copied, 14.4361 s, 218 MB/s
“Decompression results for zstd-6 with recordsize 1M”
3145728000 bytes (3.1 GB, 2.9 GiB) copied, 3.56506 s, 882 MB/s
“Compression results for zstd-9 with recordsize 1M”
3145728000 bytes (3.1 GB, 2.9 GiB) copied, 26.4371 s, 119 MB/s
“Decompression results for zstd-9 with recordsize 1M”
3145728000 bytes (3.1 GB, 2.9 GiB) copied, 3.32214 s, 947 MB/s

The speeds don't seem particularly surprising to me, zstd in ZFS has never been fast, just faster than gzip.

https://gist.github.com/rincebrain/2c39e981104a4f50d6d8364e1c17b6a7 is what I was using, I gave it a larger mediawiki dump file to copy around as I was concerned 1 GB wasn't running long enough (and increased the ramdisk used appropriately), you can find that file compressed here or just take the first few GB out of a mediawiki dump.

BrainSlayer · 2022-01-30T20:49:16Z

i think your recordsize math or your moded benchmark is broken. looks like your benchmark is not running threaded. but is just doing single writes. so yours is not using real world writes

writes (de)compression results for zstd-3
Speed:
2876 MB/s

reads (de)compression results for zstd-3
Speed:
15322 MB/s

readwrite (de)compression results for zstd-3
Speed:
5636 MB/s

zstd is lightning fast. (with default zstd 3 level). it normally outperformance standard io speed, so it writes in realtime

rincebrain · 2022-01-30T21:20:01Z

So your claim is that doing single-threaded benchmarks is unrealistic because everybody uses multiple threads for their IO in the "real world"?

How, precisely, would you like this tested? You appear to be unhappy with anything I do, so why don't you just specify what to do, and I can go do it. Seems more efficient for everyone.

gmelikov · 2022-01-30T21:40:20Z

reads (de)compression results for zstd-3
Speed:
15322 MB/s

Sorry to interrupt, but this number looks like cached read from ARC @BrainSlayer

According to https://github.com/facebook/zstd , zstd-3 decompression example:

2250 MB/s

+1 for desired method of benchmark, would be interesting to see it too.

BrainSlayer · 2022-01-31T07:46:33Z

reads (de)compression results for zstd-3
Speed:
15322 MB/s

Sorry to interrupt, but this number looks like cached read from ARC @BrainSlayer

According to https://github.com/facebook/zstd , zstd-3 decompression example:

2250 MB/s

+1 for desired method of benchmark, would be interesting to see it too.

the test is running with 16 threads, if that makes more sense to you. 2250 MB/s is just single threaded

BrainSlayer · 2022-01-31T07:52:57Z

So your claim is that doing single-threaded benchmarks is unrealistic because everybody uses multiple threads for their IO in the "real world"?

How, precisely, would you like this tested? You appear to be unhappy with anything I do, so why don't you just specify what to do, and I can go do it. Seems more efficient for everyone.

not everybody. but zfs is designed to run in threaded workers. but using a sequential write with dd will never be able to test the zfs performance since you need multiple file operations for this (this is the default how the real world works). in addition i already told you where you can find the benchmark script. and you said you are already using it, just with some modifications for the record size. in fact i see now that you did also replace the whole benchmarking process. zfs has defined performance tests scripts in the tests folder. these are used for the benchmark.

rincebrain · 2022-01-31T08:54:59Z

I'm aware that ZFS runs with multiple workers handling multiple reads/writes. Every instance I've previously seen of benchmarking compression performance has provided single-threaded benchmark numbers as the baseline, so that's what I provided here.

Can you please provide more concrete details for what, specifically, you want? I tried doing something reasonable the last time you said "there's a script, go use it" without further specification and that did not end constructively.

I'd like to come to a consensus on this, either convincing me that it's a bad idea or convincing you it's not, but it's proving difficult to do that when you won't be more specific.