-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explode zstd into separate upstream files #12978
Conversation
@rincebrain: thank you. Is there a test in ZFS to verify that compressed data matches the sums of known compressed data with prior implementation? My understanding is that this is the existing ZSTD code just moved to another location so it should produce the same buffers.... but thats been coming up a bunch lately with compressor-related issues so figure i'd ask. |
Not aware of one, no, and it'd be pretty limited if one added one. (One also couldn't add one right now for, say, gzip, given the QAT and Linux/FBSD differences, and it'd fail if you didn't include differing versions for different endiannesses...) |
Yike. So basically deduplicated pools are not portable in OpenZFS. |
I believe, if memory serves, that was true before OpenZFS, even - BE and LE
hosts would have resulted in different blocks, nevermind which compressor
you picked.
I am familiar, yes. My point remains that it'd be a combinatoric nightmare
to test them all on any one platform, and then we'd be only testing subsets
on any given run...to say nothing of opaque implementations we can't test
against, like QAT's gzip, or the accelerator cards someone mentioned in a
recent OZFS Leadership Meeting.
Not to say that we shouldn't still test things, just that I'm not sure
where the right balance of cost/benefit would be.
…On Sun, Jan 16, 2022 at 1:40 PM RageLtMan ***@***.***> wrote:
Yike. So basically deduplicated pools are not portable in OpenZFS.
Crypto implementations sometimes use well-known vectors against which they
test their computational product, so i figure that having a well-known
bufffer of bytes that compress to X, Y, and Z on lz4, gzip, and zstd for
comparison might be handy... point taken though, its a bit of a
free-for-all right now.
—
Reply to this email directly, view it on GitHub
<#12978 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABUI7IHYGLMQ3BV7JMHHJ3UWMGJRANCNFSM5MBDLJOA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
d5675be
to
b74ddda
Compare
@BrainSlayer you might want to go over this |
compiling the zstd as monolithic file does take longer, but it also has a great benefit. it allows the compiler to optimize the code much better for performance. such monolithic compiling approaches are also named "the poor mans LTO". this should be considered. splitting the zstd code will decrease performance significant. the monolithic zstd code way was invented exactly for that performance approach. future spoiler. you will find a 1.5.1 based zfs variant on my github account already. |
lib/libzstd/Makefile.am
Outdated
# SSE register return with SSE disabled if -march=znverX is passed | ||
AM_CFLAGS += -U__BMI__ | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing this will lead to compile errors on some X64 architectures, depending on your -march option. i added this workaround since i was running into that issue. i know that the origin of that problem is a upstream toolchain bug which might be resolved with newer compiler versions. but we should consider all supported toolchains
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that being in this rev was just an oversight, I had removed it at one point and didn't revert it before pushing.
If only compilers had some form of LTO flags to get LTO optimizations without concatenating every source file. Do you have benchmarks showing that the performance is markedly worse? |
the compiler has lto optimization flags, but |
i have no benchmark on my table yet. consider that i finished the work on the zstd patch a long time ago. but @Ornias1993 has a good working benchmark script. you can use it to run performance tests with it using your patch and without. i can only run tests on my ryzen 16 core machine using gcc 11. |
Great, so LTO flags won't work in the kernel. I come back to "do you have benchmarks showing this is markedly worse", because I don't think "let's put everything in one file for performance" is the balance that the entire rest of the source tree takes. e: Ah, good, replies criss-crossing. I have a modified version of that script with some functionality I missed, like setting different recordsizes, that I can indeed use to benchmark this. But if your argument is both "this is slower" and "using separate source files is bad", that only covers one of those objections, and I don't particularly see me convincing you that using multiple source files is a reasonable choice if that's your starting position. |
i dont have a issue with splitting source files in its original form. but zstd or compression at all is performance critical. the only argument i have is raw based on performance. i think you understand how a compiler works and that compiling all as single source does allow the compiler todo more aggressive optimization. the best example is "inlining stragedy". the compiler cannot auto inline code which is placed in different objects files. this is only possible if the compiler sees all in one source. processing several object files into a resulting binary is a later job by the linker and the linker cannot inline affected code later anymore. |
I'm familiar with the fact that you can't do certain optimizations with the loss of information from compiling the object files unless you embed additional information to reconstruct that state at a later stage, yes. I remain unconvinced it makes a notable difference in this case. It's funny you say that...I did do that experiment, actually, on one test platform (Debian bullseye x86_64).
So about 60 KiB. Not nothing, but probably not the difference between fitting in a cache line and not, for example. |
the compiler does know the cacheline size depeching with march size is set. so the compiler optimizes the code in a way that it runs best for the desired cachlines. it you hack the kernel for forcefully enabling the optimize_inline option (cannot be enabled on most architectures). it has even better results. (kernel source/init/Kconfig, remove the dep for OPTIMIZE_INLINING option) |
an and consider that default kernel option for debian is "optimize for size" if i remember correct. i do personally optimize for speed on my local environment. you can also overwrite the default options in the CFLAGS in the zstd directory. i believe i did this in my original patch. might be still in there |
@Ornias1993 i did run your bench script. but its not happy with the current fio version |
No idea, not working on it or ZFS anymore. |
One is from this PR, one is from stock git master, both compiled without --enable-debug on Debian buster, run on my Ryzen 5900X, which was idling except for this. There are differences (which may be runtime variation, I didn't run this more than once and average or anything, at the moment), but neither of them seems to be a strict win over the other? e: These are all from the compression-bench.sh script mentioned before, I just modified it to take a recordsize option to configure before running its dd over, in this case, the mediawiki-based test file. edit 2: Also, Debian compiles its kernels with CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y, and on buster also uses CONFIG_OPTIMIZE_INLINING=y. |
which chart is which? i mean your benchmark shows that your PR is overall much faster. this doesnt look sane. but can also be caused by the BMI flag change. since the benchmark script doesnt work on my opensuse system and just spits errors, i cannot do a counter benchmark right now |
The first chart is the PR. Where are you looking? The only case I see being noticably different like that is zstd-fast-50 decompression? Also the flag difference you mention is only in the userland bit - if you look in module/zstd/Makefile.in, it does include -U__BMI__, and that should be the one we care about here. |
basicly every single value of your first chart is faster than in the second chart |
and basicly every single value in your chart is 10 times slower than on my ryzen 5950X. just as a side note. can you provide me the script you're using for verification? |
Well, my 5900X's VM could be 10 times slower than your 5950X on baremetal, but that seems at least minorly improbable. As an experiment, I tried it a couple of data points quickly on my 8700k baremetal, which is running stock 2.0.6, and did not get a factor of 10 speedup.
The speeds don't seem particularly surprising to me, zstd in ZFS has never been fast, just faster than gzip. https://gist.github.com/rincebrain/2c39e981104a4f50d6d8364e1c17b6a7 is what I was using, I gave it a larger mediawiki dump file to copy around as I was concerned 1 GB wasn't running long enough (and increased the ramdisk used appropriately), you can find that file compressed here or just take the first few GB out of a mediawiki dump. |
i think your recordsize math or your moded benchmark is broken. looks like your benchmark is not running threaded. but is just doing single writes. so yours is not using real world writes writes (de)compression results for zstd-3 reads (de)compression results for zstd-3 readwrite (de)compression results for zstd-3 zstd is lightning fast. (with default zstd 3 level). it normally outperformance standard io speed, so it writes in realtime |
So your claim is that doing single-threaded benchmarks is unrealistic because everybody uses multiple threads for their IO in the "real world"? How, precisely, would you like this tested? You appear to be unhappy with anything I do, so why don't you just specify what to do, and I can go do it. Seems more efficient for everyone. |
Sorry to interrupt, but this number looks like cached read from ARC @BrainSlayer According to https://github.com/facebook/zstd , zstd-3 decompression example:
+1 for desired method of benchmark, would be interesting to see it too. |
the test is running with 16 threads, if that makes more sense to you. 2250 MB/s is just single threaded |
not everybody. but zfs is designed to run in threaded workers. but using a sequential write with dd will never be able to test the zfs performance since you need multiple file operations for this (this is the default how the real world works). in addition i already told you where you can find the benchmark script. and you said you are already using it, just with some modifications for the record size. in fact i see now that you did also replace the whole benchmarking process. zfs has defined performance tests scripts in the tests folder. these are used for the benchmark. |
I'm aware that ZFS runs with multiple workers handling multiple reads/writes. Every instance I've previously seen of benchmarking compression performance has provided single-threaded benchmark numbers as the baseline, so that's what I provided here. Can you please provide more concrete details for what, specifically, you want? I tried doing something reasonable the last time you said "there's a script, go use it" without further specification and that did not end constructively. I'd like to come to a consensus on this, either convincing me that it's a bad idea or convincing you it's not, but it's proving difficult to do that when you won't be more specific. |
It's much nicer to import from upstream this way, and compiles faster too. Everything in lib/ is unmodified 1.4.5. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
It's much nicer to import from upstream this way, and compiles faster too. Everything in lib/ is unmodified 1.4.5. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
It's much nicer to import from upstream this way, and compiles faster too. Everything in lib/ is unmodified 1.4.5. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
It's much nicer to import from upstream this way, and compiles faster too. Everything in lib/ is unmodified 1.4.5. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
It's much nicer to import from upstream this way, and compiles faster too. Everything in lib/ is unmodified 1.4.5. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
It's much nicer to import from upstream this way, and compiles faster too. Everything in lib/ is unmodified 1.4.5. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
It's much nicer to import from upstream this way, and compiles faster too. Everything in lib/ is unmodified 1.4.5. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
It's much nicer to import from upstream this way, and compiles faster too. Everything in lib/ is unmodified 1.4.5. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
It's much nicer to import from upstream this way, and compiles faster too. Everything in lib/ is unmodified 1.4.5. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
It's much nicer to import from upstream this way, and compiles faster too. Everything in lib/ is unmodified 1.4.5. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
It's much nicer to import from upstream this way, and compiles faster too. Everything in lib/ is unmodified 1.4.5. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12978
Motivation and Context
A few things.
Ryzen 5900X,
make -j12
(6c VM):Broadwell-DE,
make -j12
(8c16t, native):Raspberry Pi 4,
make -j8
(4c, native):(I find this one confusing too - to the point that I reran it. I guess it's bottlenecking elsewhere? Maybe -j8 is too much.)
T5140,
make -j100
(12c96t, native)Description
As the README now says, I got a list of the files to include, moved them over, grabbed the #defines from the old zstd.c, pared off a few included files that obviously weren't relevant, and fixed compile errors.
(And reapplied a78f19d.)
Tada.
How Has This Been Tested?
Built on the above platforms, tested on Linux/amd64 and FBSD/amd64 to produce identical results on a test file as before the PR (enwik9, if curious to reproduce).
Otherwise, the CI can go to town.
Types of changes
Checklist:
Signed-off-by
.