-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Risc-V Vector Intrinsic Support #642
base: master
Are you sure you want to change the base?
Conversation
The Risc-V ISA has an optional "V" extension for vector support. This patch introduces vector accelerated routines for the following methods: + local_lpc_compute_autocorrelation + local_lpc_compute_residual_from_qlp_coefficients This patch disables building with Risc-V Vector support. It can be enabled using `--enable-riscv-vector-optimizations` in autotools or `-DRISCV_VECTOR=ON` using cmake. Building with vector support is disabled by default because the patch was tested only on QEMU for now for correctness. Limitations: + RiscV vector support is limited to very modern compilers (Clang 16 or later) for the time being + The width of each vector element on Risc-V is configurable by the silicon vendor. This patch assumes a reasonable width of at least 128 bits per vector register for now. Future Work: + Only local_lpc_compute_residual_from_qlp_coefficients has been optimized for now and the implementation was based heavily on the Intel AVX implementation. There is likely a more idiomatic Risc-V implementation that is feasible.
Hi Flac maintainers, I hope I'm not overstepping my bounds with this PR. I've taken the liberty of porting a handful of functions into their vectorized equivalents in Risc-V. I've started with just the two methods to get some early feedback from you folks to make sure that these are patches that you'd be interested in and that I'm on the right track. Any feedback and comments would be appreciated! |
Hi, Thanks for wanting to contribute to FLAC! I do have a few questions though
The main problem I have with this PR is that I cannot test it. Also, it is not covered by CI nor by fuzzing. So, the performance improvement must be pretty good over autovectorized code for this to get merged. I would be much more inclined to merge something with only a little platform specific code, like this: https://github.com/xiph/flac/blob/master/src/libFLAC/lpc_intrin_fma.c But, that would only make sense with run-time detection. With that approach, most of the code is 'shared' and thus covered by CI and fuzzing. |
Thanks for the prompt and thoughtful response! To answer your questions inline:
Yes, good point. I'll implement that and send you a follow up patch.
Unfortunately not directly. I don't have access to any hardware that supports the Risc-V Vector extension at the moment. The best I've been able to do so far is estimating performance based on instruction counts. I haven't had a chance to compare how well it does against Clang's auto-vectorization but I can certainly look into it.
So far I've been testing my changes using QEMU. The FLAC tests seem to pass with my changes. I also implemented a test harness that compares the results from my vectorized routines against the canonical C implementations. The results also seem to match up. I'm happy to share that test harness with you if that'd be helpful. Is emulation an acceptable test bench?
That sounds reasonable to me. Let me get back to you with run-time detection and some considerations for minimizing platform specific code. Thanks again! |
pro tip: it turns out that |
I have considered buying a MangoPi sometime ago, just for fun. I read however that that CPU (Allwinner D1 C906) was designed prior to the vector ISA extension being finalized. Is there any similar cheap hardware available that is 'up-to-date'? |
@enh-google Thanks for the tip! |
Okay, then we will have a bit of a problem here I guess. Recently, some ARM64 intrinsics were merged, and they were slower than plain C on the first try and only slightly faster on the second. Furthermore, that 75% speed increase is no longer applicable as Clang 16 had some improvements specific to that function. I'm seriously considering benchmarking again and dropping some functions there. I consider code that is not an improvement a liability, really, especially if it is not covered by oss-fuzz. In other words: using intrinsics does not automatically result in improvements. I really need some measurements before being able to merge this. I've dropped functions specific to ppc64 in the last release because after spending a lot of time getting hold of someone with barely-working ppc64 hardware, it turned out these functions were actually no improvement at all. And for these measurements, someone needs hardware to measure... edit: just to be clear, I feel no need to make these measurements myself, as long as they seem reasonably reliable. I've used gprof in the past with seemingly good results. |
i don't think riscv64 auto-vectorization is anything like as well developed as arm64 atm, so it's quite possible that the optimal strategy will be to use intrinsics in the short term and delete them when the compilers[1] catch up.
|
This patch dynamically detects whether the RiscV vector unit is available and only enables the intrinsic routines if it is. Tested by launching QEMU with the Vector Extensions enabled and disabled and observed that the intrinsic routines were only patched in when vector was enabled.
That makes sense. I've converted this patch to a draft until I can obtain some measurements. I suspect the best I'll be able to do in the immediate term is a comparison of retired instruction counts using an emulator. I think that might be a strong proxy for performance gains, especially if the difference is significant -- however I suppose that's a discussion we can have once I have some data :) In the meantime I've added support for dynamically detecting the presence of the vector unit on the CPU. Cheers! |
The main problem fixed with clang 16 for arm64 (and probably x86 as well, but I haven't checked) was resolving dependencies and generating proper IR. AFAIK that is mostly architecture independent and the hardest part of autovectorization. Anyway, we'll see how this turns out. |
(yeah, that lgtm.)
i'm not an expert (but can connect you with folks who are if you like), but aiui riscv64 autovectorization still has two competing proposals for reasons i never understood :-( actually, @appujee because i know autovectorization is one of his favorite subjects, and he might be interested to have a real-world example to test with (with at least the eventual goal of being able to delete the intrinsic implementation, even if that takes a couple more years and we need the intrinsics in the meantime!). |
The llvm vectorizer is target independent (for the most part) so any improvements, like dependency analysis, translate to all architecture. However, RISCV vectorization is slightly different because the vectors are variable length (similar to SVE of AArch64). So questions like:
remain unanswered. (Side note: I'm not sure if we have tested aarch64-sve for flac Even though RISCV autovectorization in clang has made decent progress, it is a WIP with some interesting optimizations still in review. Things also break in weird ways as vectorizer evolves with contributions from multiple parties. Because of limited availability of hardware, it is difficult to make a case for autovectorization without measuring meaningful workloads. I agree with @enh-google that it is better to use intrinsics for now and move to autovectorization later when it makes sense. |
(fwiw i did try that earlier this year [https://android-review.googlesource.com/q/topic:%22armv9%22] but all i got for my troubles was an lld crash. someone should try that again, though i suspect that the flac maintainer here is claiming that arm64 autovectorization works well for arm64 ASIMD, not arm64 SVE?) |
I tried building this MR so I could collect some performance statistics. Unfortunately, the patch does not build with gcc-13.2.0:
I believe the RISC-V intrinsics are still evolving and this may be a discrepancy between version 0.11 and 0.12. |
This was causing a build failure if riscv_vector was not available on the system.
Thanks @negge -- looks like I was calling |
i think Android will say the equivalent of i wouldn't hard-code any vector size, but especially not 64, which i wouldn't expect to be common. (128 seems like the likely sweet spot for the foreseeable future, based on arm64 experience, so for qemu where we did have to hard-code something, that's the vector length we've told qemu to assume for now.) |
When detecting "riscv_vector.h" using autotools and cmake, invoke the toolchain with -march=rv64gcv.
@enh-google Thanks, that's helpful! |
Now that there are two RVV 1.0 devboards available, this can probably be revisited. (CanMV k230 with XuanTie C908, and Banana Pi BPI-F3 with SpacemiT X60, I've got them both and can help benchmark) I've got some comments on the current implementation:
A general note: I think instead of the prefix benchmark
As predicted, the vectorized autocorrelation doesn't perform well yet. Without it the performance seems to match between RVV and scalar. The RVV implementation on the C908 isn't very powerful, I'll try to run it on the X60, which has twice the VLEN and execution unit width, once I've got some more time. Do you have a standard benchmark suite? I'd like to look into some alternative RVV implementations, once I have the time. |
The Risc-V ISA has an optional "V" extension for vector support. This patch introduces vector accelerated routines for the following methods:
This patch disables building with Risc-V Vector support. It can be enabled using
--enable-riscv-vector-optimizations
in autotools or-DRISCV_VECTOR=ON
using cmake.Building with vector support is disabled by default because the patch was tested only on QEMU for now for correctness.
Limitations:
Future Work: