Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clang optimizing SSE2NEON_PRECISE_MINMAX incorrectly #606

Closed
markreidvfx opened this issue Aug 13, 2023 · 3 comments
Closed

Clang optimizing SSE2NEON_PRECISE_MINMAX incorrectly #606

markreidvfx opened this issue Aug 13, 2023 · 3 comments
Assignees

Comments

@markreidvfx
Copy link

This might be a bug in clang but figure I'd report it here first.

I have a technique I use to clamp NaN values to zero.
It's pretty simple, you exploit the fact, nan > 0.0f == false

#define MIN(a,b) ((a) > (b) ? (b) : (a))
#define MAX(a,b) ((a) > (b) ? (a) : (b))
MIN(amax, MAX(a, amin));

The MAX is done first on purpose.

The SSE2 code is this

_mm_min_ps(amax, _mm_max_ps(a, amin));

I'm having issues with clang's optimizer messing up this behaviour and nans still propagating.

The neon min/max instructions propagates NaNs and SSE2 ones don't (ish), so I've been defining SSE2NEON_PRECISE_MINMAX 1
the _mm_max_ps intrinsic becomes

vbslq_f32(vcgtq_f32(a, b), a, b);

This looks perfectly correct to me, but clang is optimizing this to the fmaxnm instruction. The fmaxnm instruction only deals with quiet NaNs, signalling NaNs still propagate. :(

NaNs are handled according to the IEEE 754-2008 standard. If one vector element is numeric and the other is a quiet NaN, the result placed in the vector is the numerical value, otherwise the result is identical to FMAX (scalar).

https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/FMAXNM--vector---Floating-point-Maximum-Number--vector--

Here is a small program illustrating this happening
https://godbolt.org/z/eE1G3Gcov

I'm currently working around this by using inline assembly.

@Cuda-Chen
Copy link
Collaborator

Hi @markreidvfx ,

For my personal point-of-view, I think this may be an issue of Clang.
For GCC with -O3 flag, it uses fcmgt, and, and bsl.
Here is a small program (modified by your example) for illustration: https://godbolt.org/z/sfrKbx1e8

One more, thing, kindly leave the link for the discussion on Clang forum if possible.

@markreidvfx
Copy link
Author

Yes, that's my opinion too, especially since if you compile in debug the code works.
I'll report it to clang and see what they say.

The same thing can also happen with scalar code.
https://godbolt.org/z/d4j9418Kx

I can trick the compiler by subtly changing the clamp function, but who know for how long that will last...
https://godbolt.org/z/rq36Trb4d

@jserv
Copy link
Member

jserv commented Nov 14, 2024

I am closing this issue since SSE2NEON recently added warning alerts for potential compiler misoptimizations. Unless we can find a better way to overcome these misoptimizations, no further action will be taken.

@jserv jserv closed this as completed Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants