-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support 64-bit limbs on no-asm platforms #109
base: master
Are you sure you want to change the base?
Conversation
For all platforms that don't support native 64x64->128 multiply (does MIPS or Riscv5 for example?) is the GCC/Clang fallback constant-time or does it use ternary or comparisons for carries? |
The 64-bit versions of MIPS and RISC-V support a native 64x64->128 multiply (either as a single instruction or as a pair of instructions). I can find only one 64-bit target supported by LLVM that doesn't have a native implementation (sparc64), and this uses a call to the __multi3 library routine. I guess it depends on where this library is taken from, but at least the LLVM (compiler-rt) default implementation of it doesn't appear to use conditional code paths: If this is a concern, we could make use of 64-bit limbs conditional on the architecture, with s390x (and possibly others where it makes sense) opting in? |
Ping? Any update on this? |
I know this PR is over a year old now, but I recently was porting Chia to a VisionFive2 riscv64 SBC (StarFive JH7110 with RISC-V quad-core CPU with 2 MB L2 cache and a monitor core, supporting RV64GC ISA, working up to 1.5 GHz ) and I decided to try this patch to see if there were any speed improvements. Unfortunately, our wrapper library and test code crashes and dumps core when using this patch. (https://github.com/Chia-Network/bls-signatures) |
I was able to reproduce this on s390x. The problem is this
which was added by commit 7cc9267 after I originally created and tested this PR. This assumes that
so If I simply remove the I think removing this part of the assumption (or even the whole assumption) should be fine on other platforms too, as this function will likely be inlined into its two call sites where |
Took me some time to get back to this. but YES, applying your suggestion and all tests pass in the VisionFive2 SBC and it is considerably faster at roughly 2.5 times faster for signing and verification - pretty much everything is 2.5 times faster I would LOVE for this to get merged in - until we get someone to write some actual riscv assembler this is likely as good as it can get. |
Currently, platforms without assembler support always use 32-bit limbs, but the Rust bindings always assume 64-bit limbs. This breaks on big-endian platforms like our IBM Z (s390x). This patch enables 64-bit limbs on 64-bit platforms even if there is no hand-written assembler, by using a 128-bit integer type in the C implementation (this is an extension that is widely supported on 64-bit platforms with GCC or LLVM). Note that this means that the argument "n" to quot_rem_n is no longer guaranteed to always be a multiple of 2, so the corresponding assertion needs to be removed as well. This fixes the broken Rust bindings on IBM Z, and also improves performance by a factor or 3 or more, because compiler-generated code handling __int128 already uses the 64x64->128 multiply instruction our ISA provides. To improve performance of compiler-generated code a bit more, this also switches to the -O3 optimization level, which helps with unrolling of the Montgomery multiply core loop.
0333c0e
to
c2407b5
Compare
Thanks for verifying! I've now updated this to fix the incorrect assertion. |
Any chance this can get in with the next major release? |
Currently, platforms without assembler support always use 32-bit limbs,
but the Rust bindings always assume 64-bit limbs. This breaks on
big-endian platforms like our IBM Z (s390x).
This patch enables 64-bit limbs on 64-bit platforms even if there is
no hand-written assembler, by using a 128-bit integer type in the
C implementation (this is an extension that is widely supported on
64-bit platforms with GCC or LLVM).
This fixes the broken Rust bindings on IBM Z, and also improves
performance by a factor or 3 or more, because compiler-generated
code handling __int128 already uses the 64x64->128 multiply
instruction our ISA provides.
To improve performance of compiler-generated code a bit more, this
also switches to the -O3 optimization level, which helps with
unrolling of the Montgomery multiply core loop.