Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEON: more fp16 using intrinsics supported by architecture v7 #1075

Closed
wants to merge 44 commits into from

Conversation

yyctw
Copy link
Contributor

@yyctw yyctw commented Oct 13, 2023

Hi all, this is Eric from Andes Technology Corporation. This PR includes

Add the types simde_float16x4x{3/4}_t and simde_float16x8x{3/4}_t
Add 351 initial implementations and corresponding test cases in 63 families which are listed below:

  • abal, abal_high, cale, calt, create, cvt, cvt_n, cvtn, dup_lane, ext
  • fma, fma_lane, fma_n, fms, fms_lane, fms_n, get_lane
  • ld1_dup, ld1_lane, ld1_x2, ld1_x3, ld1_x4, ld1q_x2, ld1q_x3, ld1q_x4
  • ld2, ld2_dup, ld2_lane, ld3, ld3_dup, ld3_lane, ld4, ld4_dup, ld4_lane
  • mla_lane, mlal_high_lane, mls_lane, mlsl_high_lane, mul_lane, neg
  • qdmlal, qdmlal_high, qdmlal_high_lane, qdmlal_high_n, qdmlal_lane, qdmlal_n
  • qdmlsl, qdmlsl_high, qdmlsl_high_lane, qdmlsl_high_n, qdmlsl_lane, qdmlsl_n
  • qdmull, qdmull_high, qdmull_high_lane, qdmull_high_n, qdmull_lane, qdmull_n
  • qdmulh, qdmulh_lane, qshl, reinterpret, sqrt

"macOS (version 14.2, macos-13)" was the only test that failed on my fork, and it occurred during the "Install Homebrew Dependencies" stage, but all the other CI tests passed smoothly.
Thanks for reading and any recommendations are welcome!

@mr-c
Copy link
Collaborator

mr-c commented Oct 13, 2023

Thank you @yyctw !

Please review https://app.circleci.com/pipelines/github/simd-everywhere/simde/1139/workflows/b6f035be-5458-4865-b49c-fb22d4d49335/jobs/3138/parallel-runs/0/steps/0-112

@mr-c
Copy link
Collaborator

mr-c commented Oct 13, 2023

Looks like the msvc build also has compliants: https://ci.appveyor.com/project/nemequ/simde/builds/48267547/job/vgv72gurd4e0s202#L1856

@mr-c
Copy link
Collaborator

mr-c commented Oct 13, 2023

Test errors on Fedora i386 (ignore the avx512 failures)
https://download.copr.fedorainfracloud.org/results/packit/simd-everywhere-simde-1075/fedora-rawhide-i386/06522193-simde/builder-live.log.gz (source)

CircleCI got the x86 32 bit build finished, but experienced test failures:
https://app.circleci.com/pipelines/github/simd-everywhere/simde/1139/workflows/b6f035be-5458-4865-b49c-fb22d4d49335/jobs/3138/parallel-runs/0/steps/0-112

@yyctw
Copy link
Contributor Author

yyctw commented Oct 13, 2023

Thank you @yyctw !

Please review https://app.circleci.com/pipelines/github/simd-everywhere/simde/1139/workflows/b6f035be-5458-4865-b49c-fb22d4d49335/jobs/3138/parallel-runs/0/steps/0-112

I attempted to build these two failing test cases using the "aarch64-linux-gnu-g++" toolchain with the same compile options that "circleci: i686-gcc11-O2" uses. However, I observed that these two test cases passed successfully on my x86 machine.

Upon further investigation, I noticed that the test cases only fail when built using the "i686-linux-gnu-g++-11" toolchain, while they pass when compiled with "i686-linux-gnu-gcc-11". I guess that there might be some issues or bugs in the "i686-linux-gnu-g++-11" toolchain when using the O2 optimization option.

@mr-c mr-c changed the title NEON: Implements some intrinsics supported by architecture v7. NEON: more fp16 using intrinsics supported by architecture v7 Oct 13, 2023
@mr-c
Copy link
Collaborator

mr-c commented Oct 13, 2023

Upon further investigation, I noticed that the test cases only fail when built using the "i686-linux-gnu-g++-11" toolchain, while they pass when compiled with "i686-linux-gnu-gcc-11". I guess that there might be some issues or bugs in the "i686-linux-gnu-g++-11" toolchain when using the O2 optimization option.

Yeah, this project often finds new compiler bugs. Can you report this bug to GCC? We'll need a workaround for the affected functions in SIMDe

Copy link
Collaborator

@mr-c mr-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review of the 1st 50 files changed

simde/arm/neon/cvt.h Show resolved Hide resolved
simde/arm/neon/qdmull.h Show resolved Hide resolved
simde/arm/neon/qshl.h Show resolved Hide resolved
@@ -37,7 +37,7 @@ SIMDE_FUNCTION_ATTRIBUTES
simde_float16
simde_vsqrth_f16(simde_float16 a) {
#if defined(SIMDE_ARM_NEON_A32V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16)
return vsqrth_f16(a, b);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops. This is worrisome, means we are missing this architecture from our testing matrix..

simde/arm/neon/create.h Outdated Show resolved Hide resolved
Comment on lines -1185 to -1186
// Round to Nearest with Ties to Away (a.k.a Rounding away from zero) rounding mode.
// For example, 23.2 gets rounded to 24, and −23.2 gets rounded to −24.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not true anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definition of the rounding mode is correct, but it only occurs when the number is at the midpoint. Otherwise, it rounds to the nearest integer. The example below is incorrect. For example, 23.2 gets rounded to 23, and 23.5 gets rounded to 24.

test/arm/neon/abal.c Show resolved Hide resolved
@@ -276,7 +276,7 @@ simde_vgetq_lane_f16(simde_float16x8_t v, const int lane)
simde_float16_t r;

#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(SIMDE_ARM_NEON_FP16)
SIMDE_CONSTIFY_8_(vget_lane_f16, r, (HEDLEY_UNREACHABLE(), SIMDE_FLOAT16_VALUE(0.0)), lane, v);
SIMDE_CONSTIFY_8_(vgetq_lane_f16, r, (HEDLEY_UNREACHABLE(), SIMDE_FLOAT16_VALUE(0.0)), lane, v);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, are we missing a test case for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, even though the HEDLEY_UNREACHABLE() function is encountered during the test case, it still passes successfully.

@yyctw
Copy link
Contributor Author

yyctw commented Oct 13, 2023

Upon further investigation, I noticed that the test cases only fail when built using the "i686-linux-gnu-g++-11" toolchain, while they pass when compiled with "i686-linux-gnu-gcc-11". I guess that there might be some issues or bugs in the "i686-linux-gnu-g++-11" toolchain when using the O2 optimization option.

Yeah, this project often finds new compiler bugs. Can you report this bug to GCC? We'll need a workaround for the affected functions in SIMDe

Sure, I will report it as soon as possible.

Looks like the msvc build also has compliants: https://ci.appveyor.com/project/nemequ/simde/builds/48267547/job/vgv72gurd4e0s202#L1856

It appears that there are some bugs when expanding nested macros, such as SIMDE_CONSITIFY and simde_mla_lane_*. I've manually expanded SIMDE_CONSITIFY and resolved the issue. Many other implementations, like {qd}mls{l}_lane, have similar problems, and I will fix all of them as soon as possible.

@mr-c
Copy link
Collaborator

mr-c commented Oct 13, 2023

It appears that there are some bugs when expanding nested macros, such as SIMDE_CONSITIFY and simde_mla_lane_*. I've manually expanded SIMDE_CONSITIFY and resolved the issue. Many other implementations, like {qq}mls{l}_lane, have similar problems, and I will fix all of them as soon as possible.

Yeah, the SIMDE_CONSTIFY_ macros work in the headers, but for MSVC they cause problems in the tests.

@mr-c
Copy link
Collaborator

mr-c commented Oct 13, 2023

FYI, see simd-everywhere/implementation-status@916de72 for the status of f16 type NEON intrinsics prior to this PR

(I updated the script that generates the implementation status, as it was ignoring functions that use 16-bit floating point types)

@yyctw
Copy link
Contributor Author

yyctw commented Oct 17, 2023

Upon further investigation, I noticed that the test cases only fail when built using the "i686-linux-gnu-g++-11" toolchain, while they pass when compiled with "i686-linux-gnu-gcc-11". I guess that there might be some issues or bugs in the "i686-linux-gnu-g++-11" toolchain when using the O2 optimization option.

Yeah, this project often finds new compiler bugs. Can you report this bug to GCC? We'll need a workaround for the affected functions in SIMDe

Sure, I will report it as soon as possible.

I found that this problem may be caused by variations in the precision of double across different processors [ref]. I resolved it by adding the -ffloat-store flag in the i686-gcc-11-qemu.cross file.

Looks like the msvc build also has compliants: https://ci.appveyor.com/project/nemequ/simde/builds/48267547/job/vgv72gurd4e0s202#L1856

It appears that there are some bugs when expanding nested macros, such as SIMDE_CONSITIFY and simde_mla_lane_*. I've manually expanded SIMDE_CONSITIFY and resolved the issue. Many other implementations, like {qd}mls{l}_lane, have similar problems, and I will fix all of them as soon as possible.

Solved.

@yyctw yyctw requested a review from mr-c October 17, 2023 01:15
@yyctw yyctw closed this Oct 17, 2023
@yyctw yyctw reopened this Oct 17, 2023
Copy link
Collaborator

@mr-c mr-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that this problem may be caused by variations in the precision of double across different processors [ref]. I resolved it by adding the -ffloat-store flag in the i686-gcc-11-qemu.cross file.

This is a good workaround to document in the README for x86 (32-bit) users, but it is still a compiler bug if different -O optimizations levels produce different math. So we'll need to get a minimal reproducer and file a bug with GCC. Hopefully the failing tests cases will make developing a minimal reproducer easier. Let me know if you need help with that.

As for a workaround, perhaps one of the following applied only for the problematic GCC versions will help:
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-optimize-function-attribute
https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#index-sseregparm-function-attribute_002c-x86
https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#index-target-function-attribute-5 with one or more of no-mmx, no-fancy-math-387, fpmath=sse

If you want to get the bulk of this merged first, feel free to open a new PR that skips the functions that triggers the compiler bug. Then this PR can be rebased and kept until we implement a workaround.

@yyctw
Copy link
Contributor Author

yyctw commented Oct 17, 2023

I found that this problem may be caused by variations in the precision of double across different processors [ref]. I resolved it by adding the -ffloat-store flag in the i686-gcc-11-qemu.cross file.

This is a good workaround to document in the README for x86 (32-bit) users, but it is still a compiler bug if different -O optimizations levels produce different math. So we'll need to get a minimal reproducer and file a bug with GCC. Hopefully the failing tests cases will make developing a minimal reproducer easier. Let me know if you need help with that.

As for a workaround, perhaps one of the following applied only for the problematic GCC versions will help: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-optimize-function-attribute https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#index-sseregparm-function-attribute_002c-x86 https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#index-target-function-attribute-5 with one or more of no-mmx, no-fancy-math-387, fpmath=sse

If you want to get the bulk of this merged first, feel free to open a new PR that skips the functions that triggers the compiler bug. Then this PR can be rebased and kept until we implement a workaround.

Sure, I'll start by opening a new PR without the functions that trigger the compile errors. After that, I'll report this compilation bug to GCC and look for a workaround for SIMDe.

@mr-c
Copy link
Collaborator

mr-c commented Oct 18, 2023

@yyctw Now that #1081 is merged,: do you want to keep this PR to develop the workaround, or will you open a new one?

@yyctw
Copy link
Contributor Author

yyctw commented Oct 19, 2023

@yyctw Now that #1081 is merged,: do you want to keep this PR to develop the workaround, or will you open a new one?

I'll open the new one, this PR can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants