-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NEON: more fp16 using intrinsics supported by architecture v7 #1075
Conversation
Looks like the msvc build also has compliants: https://ci.appveyor.com/project/nemequ/simde/builds/48267547/job/vgv72gurd4e0s202#L1856 |
Test errors on Fedora i386 (ignore the avx512 failures) CircleCI got the x86 32 bit build finished, but experienced test failures: |
I attempted to build these two failing test cases using the "aarch64-linux-gnu-g++" toolchain with the same compile options that "circleci: i686-gcc11-O2" uses. However, I observed that these two test cases passed successfully on my x86 machine. Upon further investigation, I noticed that the test cases only fail when built using the "i686-linux-gnu-g++-11" toolchain, while they pass when compiled with "i686-linux-gnu-gcc-11". I guess that there might be some issues or bugs in the "i686-linux-gnu-g++-11" toolchain when using the O2 optimization option. |
Yeah, this project often finds new compiler bugs. Can you report this bug to GCC? We'll need a workaround for the affected functions in SIMDe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial review of the 1st 50 files changed
@@ -37,7 +37,7 @@ SIMDE_FUNCTION_ATTRIBUTES | |||
simde_float16 | |||
simde_vsqrth_f16(simde_float16 a) { | |||
#if defined(SIMDE_ARM_NEON_A32V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) | |||
return vsqrth_f16(a, b); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops. This is worrisome, means we are missing this architecture from our testing matrix..
// Round to Nearest with Ties to Away (a.k.a Rounding away from zero) rounding mode. | ||
// For example, 23.2 gets rounded to 24, and −23.2 gets rounded to −24. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not true anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The definition of the rounding mode is correct, but it only occurs when the number is at the midpoint. Otherwise, it rounds to the nearest integer. The example below is incorrect. For example, 23.2
gets rounded to 23
, and 23.5
gets rounded to 24
.
@@ -276,7 +276,7 @@ simde_vgetq_lane_f16(simde_float16x8_t v, const int lane) | |||
simde_float16_t r; | |||
|
|||
#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(SIMDE_ARM_NEON_FP16) | |||
SIMDE_CONSTIFY_8_(vget_lane_f16, r, (HEDLEY_UNREACHABLE(), SIMDE_FLOAT16_VALUE(0.0)), lane, v); | |||
SIMDE_CONSTIFY_8_(vgetq_lane_f16, r, (HEDLEY_UNREACHABLE(), SIMDE_FLOAT16_VALUE(0.0)), lane, v); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, are we missing a test case for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, even though the HEDLEY_UNREACHABLE()
function is encountered during the test case, it still passes successfully.
Sure, I will report it as soon as possible.
It appears that there are some bugs when expanding nested macros, such as |
Yeah, the |
FYI, see simd-everywhere/implementation-status@916de72 for the status of (I updated the script that generates the implementation status, as it was ignoring functions that use 16-bit floating point types) |
65c2ee8
to
eb580cc
Compare
one ld2_f16, twenty-two ld2_lane series, and twenty-two ld2_dup series.
I found that this problem may be caused by variations in the precision of double across different processors [ref]. I resolved it by adding the
Solved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that this problem may be caused by variations in the precision of double across different processors [ref]. I resolved it by adding the
-ffloat-store
flag in thei686-gcc-11-qemu.cross
file.
This is a good workaround to document in the README for x86 (32-bit) users, but it is still a compiler bug if different -O
optimizations levels produce different math. So we'll need to get a minimal reproducer and file a bug with GCC. Hopefully the failing tests cases will make developing a minimal reproducer easier. Let me know if you need help with that.
As for a workaround, perhaps one of the following applied only for the problematic GCC versions will help:
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-optimize-function-attribute
https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#index-sseregparm-function-attribute_002c-x86
https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#index-target-function-attribute-5 with one or more of no-mmx
, no-fancy-math-387
, fpmath=sse
If you want to get the bulk of this merged first, feel free to open a new PR that skips the functions that triggers the compiler bug. Then this PR can be rebased and kept until we implement a workaround.
Sure, I'll start by opening a new PR without the functions that trigger the compile errors. After that, I'll report this compilation bug to GCC and look for a workaround for SIMDe. |
Hi all, this is Eric from Andes Technology Corporation. This PR includes
Add the types simde_float16x4x{3/4}_t and simde_float16x8x{3/4}_t
Add 351 initial implementations and corresponding test cases in 63 families which are listed below:
abal
,abal_high
,cale
,calt
,create
,cvt
,cvt_n
,cvtn
,dup_lane
,ext
fma
,fma_lane
,fma_n
,fms
,fms_lane
,fms_n
,get_lane
ld1_dup
,ld1_lane
,ld1_x2
,ld1_x3
,ld1_x4
,ld1q_x2
,ld1q_x3
,ld1q_x4
ld2
,ld2_dup
,ld2_lane
,ld3
,ld3_dup
,ld3_lane
,ld4
,ld4_dup
,ld4_lane
mla_lane
,mlal_high_lane
,mls_lane
,mlsl_high_lane
,mul_lane
,neg
qdmlal
,qdmlal_high
,qdmlal_high_lane
,qdmlal_high_n
,qdmlal_lane
,qdmlal_n
qdmlsl
,qdmlsl_high
,qdmlsl_high_lane
,qdmlsl_high_n
,qdmlsl_lane
,qdmlsl_n
qdmull
,qdmull_high
,qdmull_high_lane
,qdmull_high_n
,qdmull_lane
,qdmull_n
qdmulh
,qdmulh_lane
,qshl
,reinterpret
,sqrt
"macOS (version 14.2, macos-13)" was the only test that failed on my fork, and it occurred during the "Install Homebrew Dependencies" stage, but all the other CI tests passed smoothly.
Thanks for reading and any recommendations are welcome!