Clang optimizing SSE2NEON_PRECISE_MINMAX incorrectly #606

markreidvfx · 2023-08-13T05:35:23Z

This might be a bug in clang but figure I'd report it here first.

I have a technique I use to clamp NaN values to zero.
It's pretty simple, you exploit the fact, nan > 0.0f == false

#define MIN(a,b) ((a) > (b) ? (b) : (a))
#define MAX(a,b) ((a) > (b) ? (a) : (b))
MIN(amax, MAX(a, amin));

The MAX is done first on purpose.

The SSE2 code is this

_mm_min_ps(amax, _mm_max_ps(a, amin));

I'm having issues with clang's optimizer messing up this behaviour and nans still propagating.

The neon min/max instructions propagates NaNs and SSE2 ones don't (ish), so I've been defining SSE2NEON_PRECISE_MINMAX 1
the _mm_max_ps intrinsic becomes

vbslq_f32(vcgtq_f32(a, b), a, b);

This looks perfectly correct to me, but clang is optimizing this to the fmaxnm instruction. The fmaxnm instruction only deals with quiet NaNs, signalling NaNs still propagate. :(

NaNs are handled according to the IEEE 754-2008 standard. If one vector element is numeric and the other is a quiet NaN, the result placed in the vector is the numerical value, otherwise the result is identical to FMAX (scalar).

https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/FMAXNM--vector---Floating-point-Maximum-Number--vector--

Here is a small program illustrating this happening
https://godbolt.org/z/eE1G3Gcov

I'm currently working around this by using inline assembly.

The text was updated successfully, but these errors were encountered:

Cuda-Chen · 2023-08-14T13:01:37Z

Hi @markreidvfx ,

For my personal point-of-view, I think this may be an issue of Clang.
For GCC with -O3 flag, it uses fcmgt, and, and bsl.
Here is a small program (modified by your example) for illustration: https://godbolt.org/z/sfrKbx1e8

One more, thing, kindly leave the link for the discussion on Clang forum if possible.

markreidvfx · 2023-08-14T16:16:12Z

Yes, that's my opinion too, especially since if you compile in debug the code works.
I'll report it to clang and see what they say.

The same thing can also happen with scalar code.
https://godbolt.org/z/d4j9418Kx

I can trick the compiler by subtly changing the clamp function, but who know for how long that will last...
https://godbolt.org/z/rq36Trb4d

jserv · 2024-11-14T23:19:29Z

I am closing this issue since SSE2NEON recently added warning alerts for potential compiler misoptimizations. Unless we can find a better way to overcome these misoptimizations, no further action will be taken.

jserv assigned Cuda-Chen Aug 13, 2023

markreidvfx mentioned this issue Aug 19, 2023

Add AVX2/AVX/SSE2 SIMD accelerated 1D/3D LUTS AcademySoftwareFoundation/OpenColorIO#1687

Merged

jserv closed this as completed Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clang optimizing SSE2NEON_PRECISE_MINMAX incorrectly #606

Clang optimizing SSE2NEON_PRECISE_MINMAX incorrectly #606

markreidvfx commented Aug 13, 2023

Cuda-Chen commented Aug 14, 2023

markreidvfx commented Aug 14, 2023

jserv commented Nov 14, 2024

Clang optimizing SSE2NEON_PRECISE_MINMAX incorrectly #606

Clang optimizing SSE2NEON_PRECISE_MINMAX incorrectly #606

Comments

markreidvfx commented Aug 13, 2023

Cuda-Chen commented Aug 14, 2023

markreidvfx commented Aug 14, 2023

jserv commented Nov 14, 2024