-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dmu_objset: replace dnode_hash
impl with cityhash4
#16483
Conversation
a74282e
to
45b276f
Compare
I have no particular objection, I'm just wondering if you have a test case where you saw this showing up a lot in your performance bottlenecks? |
@rincebrain I did not run a benchmark. But I can do a micro-benchmark if needed. And there is no obvious evidence (e.g. profiling result) indicating this hashing is a bottleneck either. I am doing this just because @amotin mentioned it, and it did make the code cleaner. As for performance, the replacement would eliminate all memory access to the CRC constant table, as well as emitting fewer instructions. So intuitively I think it would provide some (maybe slight) boost, if observable. |
This seems like a fascinating question to ask @allanjude with #16486 and #16487. (Mostly because I'm suspecting that with that work he's hitting workloads with enough scale to notice a difference.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks much cleaner for me, but I slightly worry about number of 64bit multiplications in cityhash4()
and their time on older CPUs. I wonder if we could create reduced versions of cityhash4()
for smaller number of arguments, since I see few other use cases not using all 4.
Thanks for your review. Do you mean performing some manual "specialization" on |
Yes.
Yes, I think modern compilers could be clever enough for that. The only question is whether I want it to be inlined everywhere. Though it might be not that bad. |
Just have a quick (and rough) try on compiler explorer. On x86_64 with gcc 14, specializing So I think we indeed need a specialized |
I don't think the code size should be very important. But effect on stack size of the calling function could be verified. |
Per my understanding, inlining calls to a function will always lead to less (or at least equal) stack usage than caller + callee combined. Inlining only has (potential) negative effects on compilation speed, code size and i-cache. |
It may be true in some cases due to compiler's better ability to optimize things. But it leaks on-stack variables of the child function, that would exist only during its execution, into the parent, where they may increase stack usage for other children the parent calls. |
Thanks for explaining, I understand what you mean now. It seems not that easy to diagnosis stack usage in compiling kernel modules, especially due to the inability to add
If so, I can just create and export specialized |
I would not object. And may be |
Specializing cityhash4 on 32-bit architectures can reduce the size of stack frames as well as instruction count. This is a tiny but useful optimization, since some callsites invoke it frequently. When specializing into 1/2/3/4-arg versions, the stack usage (in bytes) on some 32-bit arches are listed as follows: - x86: 32, 32, 32, 40 - arm-v7a: 20, 20, 28, 36 - riscv: 0, 0, 0, 16 - power: 16, 16, 16, 32 - mipsel: 8, 8, 8, 24 Same tendency applies to the count of instructions. Therefore 1-arg version is defined as a macro to the 2-arg one. On all 64-bit arches, the differences are negligible. See more discussion at openzfs#16483. Acked-by: Alexander Motin <[email protected]> Signed-off-by: Shengqi Chen <[email protected]>
So that we can get actual benefit from last commit. See more discussion at openzfs#16483. Acked-by: Alexander Motin <[email protected]> Signed-off-by: Shengqi Chen <[email protected]>
45b276f
to
82abe67
Compare
Specializing cityhash4 on 32-bit architectures can reduce the size of stack frames as well as instruction count. This is a tiny but useful optimization, since some callsites invoke it frequently. When specializing into 1/2/3/4-arg versions, the stack usage (in bytes) on some 32-bit arches are listed as follows: - x86: 32, 32, 32, 40 - arm-v7a: 20, 20, 28, 36 - riscv: 0, 0, 0, 16 - power: 16, 16, 16, 32 - mipsel: 8, 8, 8, 24 Same tendency applies to the count of instructions. Therefore 1-arg version is defined as a macro to the 2-arg one. On all 64-bit arches, the differences are negligible. See more discussion at openzfs#16483. Acked-by: Alexander Motin <[email protected]> Signed-off-by: Shengqi Chen <[email protected]>
So that we can get actual benefit from last commit. See more discussion at openzfs#16483. Acked-by: Alexander Motin <[email protected]> Signed-off-by: Shengqi Chen <[email protected]>
82abe67
to
c9072e2
Compare
Specializing cityhash4 on 32-bit architectures can reduce the size of stack frames as well as instruction count. This is a tiny but useful optimization, since some callsites invoke it frequently. When specializing into 1/2/3/4-arg versions, the stack usage (in bytes) on some 32-bit arches are listed as follows: - x86: 32, 32, 32, 40 - arm-v7a: 20, 20, 28, 36 - riscv: 0, 0, 0, 16 - power: 16, 16, 16, 32 - mipsel: 8, 8, 8, 24 Same tendency applies to the count of instructions. Therefore 1-arg version is defined as a macro to the 2-arg one. On all 64-bit arches, the differences are negligible. See more discussion at openzfs#16483. Acked-by: Alexander Motin <[email protected]> Signed-off-by: Shengqi Chen <[email protected]>
So that we can get actual benefit from last commit. See more discussion at openzfs#16483. Acked-by: Alexander Motin <[email protected]> Signed-off-by: Shengqi Chen <[email protected]>
c9072e2
to
e0f7ed0
Compare
@amotin I modified the PR to include specialization versions of |
Specializing cityhash4 on 32-bit architectures can reduce the size of stack frames as well as instruction count. This is a tiny but useful optimization, since some callers invoke it frequently. When specializing into 1/2/3/4-arg versions, the stack usage (in bytes) on some 32-bit arches are listed as follows: - x86: 32, 32, 32, 40 - arm-v7a: 20, 20, 28, 36 - riscv: 0, 0, 0, 16 - power: 16, 16, 16, 32 - mipsel: 8, 8, 8, 24 And each actual argument (even if passing 0) contributes evenly to the number of multiplication instructions generated: - x86: 9, 12, 15 ,18 - arm-v7a: 6, 8, 10, 12 - riscv / power: 12, 18, 20, 24 - mipsel: 9, 12, 15, 19 On 64-bit architectures, the tendencies are similar. But both stack sizes and instruction counts are significantly smaller thus negligible. See more discussion at openzfs#16483. Acked-by: Alexander Motin <[email protected]> Signed-off-by: Shengqi Chen <[email protected]>
So that we can get actual benefit from last commit. See more discussion at openzfs#16483. Acked-by: Alexander Motin <[email protected]> Signed-off-by: Shengqi Chen <[email protected]>
e0f7ed0
to
2f96aa2
Compare
@amotin After re-evaluating, I decided to add all 1/2/3-arg versions of Also @behlendorf would you please review and consider merging? |
module/zfs/dmu_objset.c
Outdated
crc ^= (osv>>14) ^ (obj>>24); | ||
|
||
return (crc); | ||
return (cityhash2((uint64_t)os, obj)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return (cityhash2((uint64_t)os, obj)); | |
return (cityhash2((uintptr_t)os, obj)); |
This is what done at dbuf_hash()
. I worry compiler may not like conversion from 32bit pointer to 64bit integer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if we keep it uintptr_t
, it would be cast to uint64_t
when passing to cityhash2
. And compilers will not issue any warning on this cast even with -Wall -Wextra -pedantic
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are architectures with 128bit pointers. And I wonder if this code may produce there "cast from pointer to integer of different size". Though I won't argue more about this, not my strongest area. Leave to somebody else. Just think coherency with dbuf_hash()
would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out the 128b pointer case. I have changed it back and left two explicit casts in the code.
Specializing cityhash4 on 32-bit architectures can reduce the size of stack frames as well as instruction count. This is a tiny but useful optimization, since some callers invoke it frequently. When specializing into 1/2/3/4-arg versions, the stack usage (in bytes) on some 32-bit arches are listed as follows: - x86: 32, 32, 32, 40 - arm-v7a: 20, 20, 28, 36 - riscv: 0, 0, 0, 16 - power: 16, 16, 16, 32 - mipsel: 8, 8, 8, 24 And each actual argument (even if passing 0) contributes evenly to the number of multiplication instructions generated: - x86: 9, 12, 15 ,18 - arm-v7a: 6, 8, 10, 12 - riscv / power: 12, 18, 20, 24 - mipsel: 9, 12, 15, 19 On 64-bit architectures, the tendencies are similar. But both stack sizes and instruction counts are significantly smaller thus negligible. See more discussion at openzfs#16483. Acked-by: Alexander Motin <[email protected]> Signed-off-by: Shengqi Chen <[email protected]>
So that we can get actual benefit from last commit. See more discussion at openzfs#16483. Acked-by: Alexander Motin <[email protected]> Signed-off-by: Shengqi Chen <[email protected]>
2f96aa2
to
af3228c
Compare
I like the idea, but without benchmarks I would not say yes here. I will test compile this then on PPC64 and AARCH64 also and publish the results. |
@mcmilk I wrote a really small & simple micro-benchmark here: https://gist.github.com/Harry-Chen/0d1eee1a4d8a7685b50c31316ca75671. And here's some rough test results (order: crc64, cityhash2, cityhash4, unit: us) on some of my devices:
To me it is safe to conclude cityhash does make hashing faster (if observable). Unfortunately I do not have more platforms (e.g. armhf, mipsel, riscv64) to test, on which specializing cityhash with <4 arguments might help further. |
Just for fun ran it on
|
I wonder why is the difference between city2 and city4 is so insignificant. It would be good to verify that inlining optimization does what is expected, now that all the functions are in the same file and compiler has ability to inline more. May be some |
Yea. At least building with Clang on FreeBSD
The difference may be not huge, but reproducible. And on 32bits is predictably even bigger:
|
Thank you for the benchmark 👍 On my local Ryzen 7 PRO 5850U, fixed to 1,4GHz it looks like this:
On ARM Neoverse-N1 it looks like this:
On PPC64LE (POWER10, pvr 0080 0200) - these is no big difference:
So yes, it looks okay. |
Just for fun:
|
Thanks! I did not read the emitted assembly of different tests very carefully. So yes, specializing cityhash does some good stuff. |
Good stuff, thanks for providing the micro benchmark for this. If you can rebase this on master to resolve the conflict it should be good to go. |
As mentioned in PR openzfs#16131, replacing CRC-based hash with cityhash4 could slightly improve the performance by eliminating memory access. Replacing algorightm is safe since the hash result is not persisted. See: openzfs#16131 Signed-off-by: Shengqi Chen <[email protected]>
Specializing cityhash4 on 32-bit architectures can reduce the size of stack frames as well as instruction count. This is a tiny but useful optimization, since some callers invoke it frequently. When specializing into 1/2/3/4-arg versions, the stack usage (in bytes) on some 32-bit arches are listed as follows: - x86: 32, 32, 32, 40 - arm-v7a: 20, 20, 28, 36 - riscv: 0, 0, 0, 16 - power: 16, 16, 16, 32 - mipsel: 8, 8, 8, 24 And each actual argument (even if passing 0) contributes evenly to the number of multiplication instructions generated: - x86: 9, 12, 15 ,18 - arm-v7a: 6, 8, 10, 12 - riscv / power: 12, 18, 20, 24 - mipsel: 9, 12, 15, 19 On 64-bit architectures, the tendencies are similar. But both stack sizes and instruction counts are significantly smaller thus negligible. See more discussion at openzfs#16483. Acked-by: Alexander Motin <[email protected]> Signed-off-by: Shengqi Chen <[email protected]>
So that we can get actual benefit from last commit. See more discussion at openzfs#16483. Acked-by: Alexander Motin <[email protected]> Signed-off-by: Shengqi Chen <[email protected]>
af3228c
to
e958069
Compare
I've rebased. Seems now CI takes much longer time... |
Specializing cityhash4 on 32-bit architectures can reduce the size of stack frames as well as instruction count. This is a tiny but useful optimization, since some callers invoke it frequently. When specializing into 1/2/3/4-arg versions, the stack usage (in bytes) on some 32-bit arches are listed as follows: - x86: 32, 32, 32, 40 - arm-v7a: 20, 20, 28, 36 - riscv: 0, 0, 0, 16 - power: 16, 16, 16, 32 - mipsel: 8, 8, 8, 24 And each actual argument (even if passing 0) contributes evenly to the number of multiplication instructions generated: - x86: 9, 12, 15 ,18 - arm-v7a: 6, 8, 10, 12 - riscv / power: 12, 18, 20, 24 - mipsel: 9, 12, 15, 19 On 64-bit architectures, the tendencies are similar. But both stack sizes and instruction counts are significantly smaller thus negligible. Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Shengqi Chen <[email protected]> Closes #16131 Closes #16483
So that we can get actual benefit from last commit. Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Shengqi Chen <[email protected]> Closes #16131 Closes #16483
Motivation and Context
As mentioned in PR #16131, replacing CRC-based hash with cityhash4 could slightly improve the performance by eliminating memory access. Replacing the algorithm is safe since the hash result is not persisted.
Moreover,
cityhash4
yields more instructions and larger stack frame on 32-bit arches. Replacing it with specialized versions (i.e. setting some arguments to 0) can also provide boost.See: #16131
Description
cityhash4
.cityhash2
andcityhash3
.cityhash4
with specialized versions whenever possible.Please see commit message for more details.
How Has This Been Tested?
Compiled on my machine. Let's see the ZTS results on CI.
Types of changes
Checklist:
Signed-off-by
.