Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preliminary PR to merge ARM SVE branch #315

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

rdolbeau
Copy link
Contributor

@rdolbeau rdolbeau commented Mar 2, 2023

This is primarily for discussion ad probably need some cleaning up.

The current scheme using SVE (and RISC-V V) doesn't leverage the scalability; instead, it produces sets of codelets for all possible power-of-2 sizes, using masking. Only sets of a size <= to the hardware implementation width are enabled. See here.

It also auto-generate the simd-support/vtw.h file that defines the various VTW macro. This is useful as while SVE is limited to 2048 bits wide vector, RISC-V V is more or less unbounded and there's a least a 16384 bits wide implementation being developed.

So using SVE (or V) creates a fairly large, though versatile, library.

I has been tested in QEmu; using the Arm instruction Emulator; and on real hardware (Fujitsu A64FX, AWS Graviton 3).

@rdolbeau
Copy link
Contributor Author

rdolbeau commented Mar 3, 2023

I've created release package for testing this w/o the 'maintainer mode' requirements. See there: https://github.com/rdolbeau/fftw3/releases/tag/sve-test-release-001

@rdolbeau
Copy link
Contributor Author

@stevengj @matteo-frigo Can you comment on this? In particular the "used fixed-width implementation and enable all those of equal-or-narrower to the hardware width". It's a bit hackish but it's the best we have at this time. RISC-V is using the same principle for now.

@ggouaillardet
Copy link
Contributor

I would like to encourage the FFTW folks to consider this PR so it can be included (after some cleanup) into the next FFTW release.

At this stage, my main interest is only about ARM SVE.
SVE enabled processors are currently featured by several vendors:

  • Fujitsu A64fx (512 bits)
  • Amazon Graviton 3 (256 bits)
  • NVIDIA Grace (128 bits)
    so there is naturally a growing interest into upstream SVE support in FFTW.

Though SVE vector length can scale from 128 to 2048 bits (via 128 bits increments), only 3 sizes are available today (128, 256 and 512. Fujitsu A64fx does not support 384 bits). So even if vector length agnostic FFTW might be achievable, I do not see this as mandatory because of the currently limited available vector lengths and the fact VLA code is (generally) slightly slower than vector length specific code. If needed, I can provide an option to set a maximum supported vector length in order to reduce the size of the FFTW library.

Bottom line, I approve the approach of the PR and hope it can make its way upstream.

@rdolbeau
Copy link
Contributor Author

Though SVE vector length can scale from 128 to 2048 bits (via 128 bits increments),
only 3 sizes are available today (128, 256 and 512. Fujitsu A64fx does not support 384 bits)

Also, in its current incarnation SVE doesn't allow for non-power-of-2 multiple of 128 bits anymore, only power-of-2 multiple of 128 bit are. This is visible for instance in DDI0487J for the register ZCR_EL1:
"The Non-streaming SVE vector length can be any power of two from 128 bits to 2048 bits inclusive."

So this implementation should cover all cases, and I agree with you that 1024 and 2048 bits are currently only for emulators like QEmu and could (should?) be optional. 512 and 256 could also be optional for site-specific deployment.

@rdolbeau
Copy link
Contributor Author

rdolbeau commented Mar 4, 2024

Done some minor clean-ups, minor improvement, and rebase to current HEAD.

@ggouaillardet if you have the time to confirm this is still OK for you

@rdolbeau rdolbeau mentioned this pull request Mar 4, 2024
@rdolbeau
Copy link
Contributor Author

rdolbeau commented Mar 4, 2024

I've created a new package with pre-generated files, so maintainer mode is not required to test this on Arm+SVE, see https://github.com/rdolbeau/fftw3/releases/tag/sve-test-release-002

@ggouaillardet
Copy link
Contributor

Thanks @rdolbeau and my apologies for the late reply.

I noted configure --enable-sve works but make fails out of the box with Arm compilers, so I added a test in order to abort at configure instead of failing at make. Feel free to cherry-pick ggouaillardet/fftw3@c32583c

@rdolbeau
Copy link
Contributor Author

I noted configure --enable-sve works but make fails out of the box with Arm compilers

Probably with all compilers that do not enable SVE by default (all of them, probably!), currently configure doesn't enable it explicitly so it has to be done in {C/CXX/F}FLAGS.

so I added a test in order to abort at configure instead of failing at make. Feel free to cherry-pick ggouaillardet/fftw3@c32583c

Good idea. I'll try that and add it to the PR ASAP.

@juntangc
Copy link

juntangc commented May 3, 2024

Though SVE vector length can scale from 128 to 2048 bits (via 128 bits increments),
only 3 sizes are available today (128, 256 and 512. Fujitsu A64fx does not support 384 bits)

Also, in its current incarnation SVE doesn't allow for non-power-of-2 multiple of 128 bits anymore, only power-of-2 multiple of 128 bit are. This is visible for instance in DDI0487J for the register ZCR_EL1: "The Non-streaming SVE vector length can be any power of two from 128 bits to 2048 bits inclusive."

So this implementation should cover all cases, and I agree with you that 1024 and 2048 bits are currently only for emulators like QEmu and could (should?) be optional. 512 and 256 could also be optional for site-specific deployment.

I can see performance benefits by extending to 256 bits and above. Do you happen to have performance data for the sve version comparing to the SIMD NEON version?

@ggouaillardet
Copy link
Contributor

Here is some performance number I collected a while ago for fftw_plan_dft_1d(signal_length, original_signal, fft_applied_signal, FFTW_FORWARD, FFTW_ESTIMATE)
It shows the performance improvement between NEON and SVE on A64fx processor (512 bits SVE vectors)

fftw

Feel free to suggest other/better benchmarks if appropriate.

@antoine-morvan
Copy link

antoine-morvan commented May 7, 2024

Hello,

I ran few experiments using the included benchmark tests/bench, built with double support. My commands were binding the execution to the 4th core on each of the machines (arbitrary choice) :

numactl -C 3 ./tests/bench -v2 -r 20 -s $PB_DEF

Where PB_DEF within the few benchmark settings ocf2048, icf32x32, ocf128x128x128, icf256x256x256 .

image

Baseline stands for the current release 3.3.10 built with --enable-neon, and the SVE gain is this PR built with --enable-neon --enable-sve (on Neoverse N1 only --enable-neon was used). The charts are comparing the "mflops" value output by the benchmark utility.

We can observe significant gain on all SVE machines (A64FX, Neoverse V1 and V2), as well as no degradation on NEON only machine (Neoverse N1).

Best regards.

@jlinford
Copy link

These results are very exciting. FFTW performance is extremely important to users of NVIDIA Grace (Neoverse V2), which is now the dominant architecture on the Green500. A great many people would be glad to see this commit in the next FFTW release.

Copy link

@keeranroth keeranroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of nits. Feel free to address them or not. The structure looks good overall, and thanks for the benchmarking results as well. They speak for themsleves

@@ -0,0 +1,84 @@
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You gate the include later in this file, so remove this

Comment on lines +29 to +30
unsigned int size = rp2(osize);
if (osize != size)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really important, but checking for a power of two can be done with:

if (!(osize & (osize - 1))

just a nice tidbit from Hacker's Delight

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not to be included directly (accessed through, e.g. simd-maskedsve1024.h), so should it be named something else, e.g. simd-maskedsve.h.template?

rdolbeau and others added 13 commits July 3, 2024 07:07
and add simd-support/{generate_vtw.sh,vtw.h} into the dist tarball
ignore all files automatically generated for SVE support
ADD/SUB/MUL are three-addresses in SVE, but the masked form is only
two-adresses. And there's a lot of reuse in FFTW3 (and complex
arithmetic). But ACLE/SVE (i.e. intrinsics) don't have the non-masked
form :-(
So used inline ASM for force the non-masked version to be used.
Masked-out lanes should be mostly zero, and are never stored anyway, so
computing on them should be fine.

This one will be reversed if it's not a performance win.
…mpiler/hardware dependent, and more tests are needed before settling on some defaults.
When configure'd with --enable-sve, try to build a sample SVE program
and abort on failure, otherwise configure successes but make will fail.
@rdolbeau
Copy link
Contributor Author

rdolbeau commented Jul 3, 2024

I've created a new package with pre-generated files, so maintainer mode is not required to test this on Arm+SVE, see https://github.com/rdolbeau/fftw3/releases/tag/sve-test-release-003

@rdolbeau
Copy link
Contributor Author

rdolbeau commented Jul 3, 2024

@stevengj @matteo-frigo any comment on this? Is this OK to think about merging ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants