NEON: more fp16 using intrinsics supported by architecture v7 #1075

yyctw · 2023-10-13T00:51:44Z

Hi all, this is Eric from Andes Technology Corporation. This PR includes

Add the types simde_float16x4x{3/4}_t and simde_float16x8x{3/4}_t
Add 351 initial implementations and corresponding test cases in 63 families which are listed below:

abal, abal_high, cale, calt, create, cvt, cvt_n, cvtn, dup_lane, ext
fma, fma_lane, fma_n, fms, fms_lane, fms_n, get_lane
ld1_dup, ld1_lane, ld1_x2, ld1_x3, ld1_x4, ld1q_x2, ld1q_x3, ld1q_x4
ld2, ld2_dup, ld2_lane, ld3, ld3_dup, ld3_lane, ld4, ld4_dup, ld4_lane
mla_lane, mlal_high_lane, mls_lane, mlsl_high_lane, mul_lane, neg
qdmlal, qdmlal_high, qdmlal_high_lane, qdmlal_high_n, qdmlal_lane, qdmlal_n
qdmlsl, qdmlsl_high, qdmlsl_high_lane, qdmlsl_high_n, qdmlsl_lane, qdmlsl_n
qdmull, qdmull_high, qdmull_high_lane, qdmull_high_n, qdmull_lane, qdmull_n
qdmulh, qdmulh_lane, qshl, reinterpret, sqrt

"macOS (version 14.2, macos-13)" was the only test that failed on my fork, and it occurred during the "Install Homebrew Dependencies" stage, but all the other CI tests passed smoothly.
Thanks for reading and any recommendations are welcome!

mr-c · 2023-10-13T06:10:08Z

Thank you @yyctw !

Please review https://app.circleci.com/pipelines/github/simd-everywhere/simde/1139/workflows/b6f035be-5458-4865-b49c-fb22d4d49335/jobs/3138/parallel-runs/0/steps/0-112

mr-c · 2023-10-13T06:48:50Z

Looks like the msvc build also has compliants: https://ci.appveyor.com/project/nemequ/simde/builds/48267547/job/vgv72gurd4e0s202#L1856

mr-c · 2023-10-13T06:51:34Z

Test errors on Fedora i386 (ignore the avx512 failures)
https://download.copr.fedorainfracloud.org/results/packit/simd-everywhere-simde-1075/fedora-rawhide-i386/06522193-simde/builder-live.log.gz (source)

CircleCI got the x86 32 bit build finished, but experienced test failures:
https://app.circleci.com/pipelines/github/simd-everywhere/simde/1139/workflows/b6f035be-5458-4865-b49c-fb22d4d49335/jobs/3138/parallel-runs/0/steps/0-112

yyctw · 2023-10-13T07:06:46Z

Thank you @yyctw !

Please review https://app.circleci.com/pipelines/github/simd-everywhere/simde/1139/workflows/b6f035be-5458-4865-b49c-fb22d4d49335/jobs/3138/parallel-runs/0/steps/0-112

I attempted to build these two failing test cases using the "aarch64-linux-gnu-g++" toolchain with the same compile options that "circleci: i686-gcc11-O2" uses. However, I observed that these two test cases passed successfully on my x86 machine.

Upon further investigation, I noticed that the test cases only fail when built using the "i686-linux-gnu-g++-11" toolchain, while they pass when compiled with "i686-linux-gnu-gcc-11". I guess that there might be some issues or bugs in the "i686-linux-gnu-g++-11" toolchain when using the O2 optimization option.

mr-c · 2023-10-13T08:00:14Z

Upon further investigation, I noticed that the test cases only fail when built using the "i686-linux-gnu-g++-11" toolchain, while they pass when compiled with "i686-linux-gnu-gcc-11". I guess that there might be some issues or bugs in the "i686-linux-gnu-g++-11" toolchain when using the O2 optimization option.

Yeah, this project often finds new compiler bugs. Can you report this bug to GCC? We'll need a workaround for the affected functions in SIMDe

mr-c

Initial review of the 1st 50 files changed

simde/arm/neon/cvt.h

simde/arm/neon/qdmull.h

simde/arm/neon/qshl.h

mr-c · 2023-10-13T06:29:43Z

simde/arm/neon/sqrt.h

@@ -37,7 +37,7 @@ SIMDE_FUNCTION_ATTRIBUTES
 simde_float16
 simde_vsqrth_f16(simde_float16 a) {
  #if defined(SIMDE_ARM_NEON_A32V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16)
-    return vsqrth_f16(a, b);


Whoops. This is worrisome, means we are missing this architecture from our testing matrix..

simde/arm/neon/create.h

mr-c · 2023-10-13T08:11:25Z

simde/arm/neon/cvt.h

-      // Round to Nearest with Ties to Away (a.k.a Rounding away from zero) rounding mode.
-      // For example, 23.2 gets rounded to 24, and −23.2 gets rounded to −24.


Is this not true anymore?

The definition of the rounding mode is correct, but it only occurs when the number is at the midpoint. Otherwise, it rounds to the nearest integer. The example below is incorrect. For example, 23.2 gets rounded to 23, and 23.5 gets rounded to 24.

test/arm/neon/abal.c

mr-c · 2023-10-13T08:17:23Z

simde/arm/neon/get_lane.h

@@ -276,7 +276,7 @@ simde_vgetq_lane_f16(simde_float16x8_t v, const int lane)
  simde_float16_t r;

  #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(SIMDE_ARM_NEON_FP16)
-    SIMDE_CONSTIFY_8_(vget_lane_f16, r, (HEDLEY_UNREACHABLE(), SIMDE_FLOAT16_VALUE(0.0)), lane, v);
+    SIMDE_CONSTIFY_8_(vgetq_lane_f16, r, (HEDLEY_UNREACHABLE(), SIMDE_FLOAT16_VALUE(0.0)), lane, v);


Huh, are we missing a test case for this?

Yes, even though the HEDLEY_UNREACHABLE() function is encountered during the test case, it still passes successfully.

yyctw · 2023-10-13T08:54:28Z

Upon further investigation, I noticed that the test cases only fail when built using the "i686-linux-gnu-g++-11" toolchain, while they pass when compiled with "i686-linux-gnu-gcc-11". I guess that there might be some issues or bugs in the "i686-linux-gnu-g++-11" toolchain when using the O2 optimization option.

Yeah, this project often finds new compiler bugs. Can you report this bug to GCC? We'll need a workaround for the affected functions in SIMDe

Sure, I will report it as soon as possible.

Looks like the msvc build also has compliants: https://ci.appveyor.com/project/nemequ/simde/builds/48267547/job/vgv72gurd4e0s202#L1856

It appears that there are some bugs when expanding nested macros, such as SIMDE_CONSITIFY and simde_mla_lane_*. I've manually expanded SIMDE_CONSITIFY and resolved the issue. Many other implementations, like {qd}mls{l}_lane, have similar problems, and I will fix all of them as soon as possible.

mr-c · 2023-10-13T09:54:03Z

It appears that there are some bugs when expanding nested macros, such as SIMDE_CONSITIFY and simde_mla_lane_*. I've manually expanded SIMDE_CONSITIFY and resolved the issue. Many other implementations, like {qq}mls{l}_lane, have similar problems, and I will fix all of them as soon as possible.

Yeah, the SIMDE_CONSTIFY_ macros work in the headers, but for MSVC they cause problems in the tests.

mr-c · 2023-10-13T10:33:21Z

FYI, see simd-everywhere/implementation-status@916de72 for the status of f16 type NEON intrinsics prior to this PR

(I updated the script that generates the implementation status, as it was ignoring functions that use 16-bit floating point types)

one ld2_f16, twenty-two ld2_lane series, and twenty-two ld2_dup series.

yyctw · 2023-10-17T01:15:03Z

Upon further investigation, I noticed that the test cases only fail when built using the "i686-linux-gnu-g++-11" toolchain, while they pass when compiled with "i686-linux-gnu-gcc-11". I guess that there might be some issues or bugs in the "i686-linux-gnu-g++-11" toolchain when using the O2 optimization option.

Yeah, this project often finds new compiler bugs. Can you report this bug to GCC? We'll need a workaround for the affected functions in SIMDe

Sure, I will report it as soon as possible.

I found that this problem may be caused by variations in the precision of double across different processors [ref]. I resolved it by adding the -ffloat-store flag in the i686-gcc-11-qemu.cross file.

Looks like the msvc build also has compliants: https://ci.appveyor.com/project/nemequ/simde/builds/48267547/job/vgv72gurd4e0s202#L1856

It appears that there are some bugs when expanding nested macros, such as SIMDE_CONSITIFY and simde_mla_lane_*. I've manually expanded SIMDE_CONSITIFY and resolved the issue. Many other implementations, like {qd}mls{l}_lane, have similar problems, and I will fix all of them as soon as possible.

Solved.

mr-c

I found that this problem may be caused by variations in the precision of double across different processors [ref]. I resolved it by adding the -ffloat-store flag in the i686-gcc-11-qemu.cross file.

This is a good workaround to document in the README for x86 (32-bit) users, but it is still a compiler bug if different -O optimizations levels produce different math. So we'll need to get a minimal reproducer and file a bug with GCC. Hopefully the failing tests cases will make developing a minimal reproducer easier. Let me know if you need help with that.

As for a workaround, perhaps one of the following applied only for the problematic GCC versions will help:
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-optimize-function-attribute
https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#index-sseregparm-function-attribute_002c-x86
https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#index-target-function-attribute-5 with one or more of no-mmx, no-fancy-math-387, fpmath=sse

If you want to get the bulk of this merged first, feel free to open a new PR that skips the functions that triggers the compiler bug. Then this PR can be rebased and kept until we implement a workaround.

yyctw · 2023-10-17T10:51:14Z

I found that this problem may be caused by variations in the precision of double across different processors [ref]. I resolved it by adding the -ffloat-store flag in the i686-gcc-11-qemu.cross file.

This is a good workaround to document in the README for x86 (32-bit) users, but it is still a compiler bug if different -O optimizations levels produce different math. So we'll need to get a minimal reproducer and file a bug with GCC. Hopefully the failing tests cases will make developing a minimal reproducer easier. Let me know if you need help with that.

As for a workaround, perhaps one of the following applied only for the problematic GCC versions will help: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-optimize-function-attribute https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#index-sseregparm-function-attribute_002c-x86 https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#index-target-function-attribute-5 with one or more of no-mmx, no-fancy-math-387, fpmath=sse

If you want to get the bulk of this merged first, feel free to open a new PR that skips the functions that triggers the compiler bug. Then this PR can be rebased and kept until we implement a workaround.

Sure, I'll start by opening a new PR without the functions that trigger the compile errors. After that, I'll report this compilation bug to GCC and look for a workaround for SIMDe.

mr-c · 2023-10-18T21:35:09Z

@yyctw Now that #1081 is merged,: do you want to keep this PR to develop the workaround, or will you open a new one?

yyctw · 2023-10-19T01:52:58Z

@yyctw Now that #1081 is merged,: do you want to keep this PR to develop the workaround, or will you open a new one?

I'll open the new one, this PR can be closed.

yyctw force-pushed the v7 branch from ed76fb8 to 6129b87 Compare October 13, 2023 05:27

yyctw force-pushed the v7 branch from 6129b87 to 4e3445f Compare October 13, 2023 07:49

mr-c changed the title ~~NEON: Implements some intrinsics supported by architecture v7.~~ NEON: more fp16 using intrinsics supported by architecture v7 Oct 13, 2023

yyctw force-pushed the v7 branch from 4e3445f to e6060b0 Compare October 13, 2023 07:58

mr-c requested changes Oct 13, 2023

View reviewed changes

mr-c force-pushed the v7 branch from e6060b0 to dbe0d24 Compare October 13, 2023 22:53

yyctw force-pushed the v7 branch 2 times, most recently from 65c2ee8 to eb580cc Compare October 16, 2023 03:39

yyctw added 14 commits October 16, 2023 14:08

[NEON] Add vabal_{s/u}{8/16/32}

8856f0e

[NEON] Add vabal_high_{s/u}{8/16/32}

f8c778a

[NEON] Add all vcale* intrinsics (9)

ff2ec4d

[NEON] Add all vcalt intrinsics (9)

755cb1d

[NEON] Add vcreate_f16

65847b2

[NEON] Add vreinterpret_u64_f16

c51141b

[NEON] Add vcvth_f16_s16 and vcvth_f16_u16

c3372b2

[NEON] Add vduph_lane_f16, vdup_lane_f16, and vdupq_lane_f16

1f0b8ff

[NEON] Add vext_f16

e655a5c

[NEON] Add 16 vcvt{q}_n_* intrinsics

a6edcd7

[Fix] Correct function input parameters

71ad453

[NEON] Add 6 vcvtn_{s/u}{16/32/64}_f{*} intrinsics

dfe46c2

[Fix] Correct vdup_lane_f16 and vdupq_lane_f16.

978095e

[Fix] Correct function input parameters.

0c6df69

yyctw added 16 commits October 16, 2023 14:08

[Refactor] Remove redundant functions.

a0be5a2

[NEON] Add 45 ld2 related intrinsics

077d5ff

one ld2_f16, twenty-two ld2_lane series, and twenty-two ld2_dup series.

[NEON] Add ld3_dup, ld3_lane, and ld4_dup

19c5191

[NEON] Add vld3_f16 and vld4_f16.

e10760e

[NEON] Add vld{3/4}_{dup/lane} series intrinsics

e10263b

[NEON] Add mla_{high}_lane series intrinsics

5bf20e9

[NEON] Add qdmlal_{high}_{lane} series intrinsics.

03df636

[NEON] Add qdmlal_lane and qdmlal_n series intrinsics

b1b4a1e

[NEON] Add mls_lane and mlsl_high_lane series intrinsics

def93bf

[NEON] Add 22 qdmlsl series intrinsics

72fbd7d

[NEON] Add 10 qdmull_* series intrinsics

670aafd

[NEON] Add 3 qdmulh series intrinsics

e54669b

[Fix] Fix wrong function name.

999c394

[Fix] Correct the wrong alias function name.

79cda85

[NEON] Add qdmullh_lane{q}_s{16/32} related intrinsics

cd400ad

[NEON] Add qdmull_n and qdmull_high_lane series intrinsics

36e40d3

yyctw force-pushed the v7 branch from eb580cc to 434cfa1 Compare October 16, 2023 06:08

[Fix] Add conditions for fp16 intrinsics

675c697

yyctw force-pushed the v7 branch from 434cfa1 to 675c697 Compare October 17, 2023 00:46

yyctw requested a review from mr-c October 17, 2023 01:15

yyctw closed this Oct 17, 2023

yyctw reopened this Oct 17, 2023

mr-c requested changes Oct 17, 2023

View reviewed changes

yyctw mentioned this pull request Oct 17, 2023

NEON: more fp16 using intrinsics supported by architecture v7 (skip version) #1081

Merged

yyctw closed this Oct 19, 2023

yyctw mentioned this pull request Nov 3, 2023

[NEON] Add the functions which will trigger the i686 compiler error. #1101

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NEON: more fp16 using intrinsics supported by architecture v7 #1075

NEON: more fp16 using intrinsics supported by architecture v7 #1075

yyctw commented Oct 13, 2023

mr-c commented Oct 13, 2023

mr-c commented Oct 13, 2023

mr-c commented Oct 13, 2023 •

edited

Loading

yyctw commented Oct 13, 2023

mr-c commented Oct 13, 2023

mr-c left a comment

mr-c Oct 13, 2023

mr-c Oct 13, 2023

yyctw Oct 13, 2023

mr-c Oct 13, 2023

yyctw Oct 13, 2023

yyctw commented Oct 13, 2023 •

edited

Loading

mr-c commented Oct 13, 2023

mr-c commented Oct 13, 2023

yyctw commented Oct 17, 2023

mr-c left a comment

yyctw commented Oct 17, 2023

mr-c commented Oct 18, 2023

yyctw commented Oct 19, 2023

		// Round to Nearest with Ties to Away (a.k.a Rounding away from zero) rounding mode.
		// For example, 23.2 gets rounded to 24, and −23.2 gets rounded to −24.

NEON: more fp16 using intrinsics supported by architecture v7 #1075

NEON: more fp16 using intrinsics supported by architecture v7 #1075

Conversation

yyctw commented Oct 13, 2023

mr-c commented Oct 13, 2023

mr-c commented Oct 13, 2023

mr-c commented Oct 13, 2023 • edited Loading

yyctw commented Oct 13, 2023

mr-c commented Oct 13, 2023

mr-c left a comment

Choose a reason for hiding this comment

mr-c Oct 13, 2023

Choose a reason for hiding this comment

mr-c Oct 13, 2023

Choose a reason for hiding this comment

yyctw Oct 13, 2023

Choose a reason for hiding this comment

mr-c Oct 13, 2023

Choose a reason for hiding this comment

yyctw Oct 13, 2023

Choose a reason for hiding this comment

yyctw commented Oct 13, 2023 • edited Loading

mr-c commented Oct 13, 2023

mr-c commented Oct 13, 2023

yyctw commented Oct 17, 2023

mr-c left a comment

Choose a reason for hiding this comment

yyctw commented Oct 17, 2023

mr-c commented Oct 18, 2023

yyctw commented Oct 19, 2023

mr-c commented Oct 13, 2023 •

edited

Loading

yyctw commented Oct 13, 2023 •

edited

Loading