-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving dynamic dispatch for multiple targets for x86-64/AArch64/PPC64 #1782
Comments
Here is an example of Highway dynamic dispatch code updated to support multi-phase compilation (compiled more than once with different compiler options for the different compilation phases):
Here is a link to the above example on Compiler Explorer that shows the above code compiled with different options for HWY_IN_PER_TARGET_PHASE/HWY_IN_DYN_DISPATCH_PHASE: https://gcc.godbolt.org/z/63xTfh1bj |
Nice, I understand we want to compile with differing compile flags. This makes sense for MSVC; It seems reasonable to support something like this, at least as a stopgap. But one very important constraint: |
PiperOrigin-RevId: 579157014
PiperOrigin-RevId: 579157014
PiperOrigin-RevId: 579157014
PiperOrigin-RevId: 580161818
Is it possible to do dynamic dispatch across all targets with one step in Visual Studio when compiling with clang-cl, or does it have the same restrictions as the msvc compiler when it comes to vex code and thus would requite multiple compilation phases as described above? |
Hi @Pflugshaupt , we differentiate between HWY_COMPILER_MSVC and HWY_COMPILER_CLANGCL. I believe runtime dispatch would work with the latter, independently of whether invoked via Visual Studio or not. |
Thank you for your time and quick answer. It made me keep trying and I was able to find the true problem. I can confirm things work fine with visual studio driving clang-cl in general. But there appears to be an issue with templates. The problems I am seeing come from using templates for DRY and avoiding branches inside loops using templates. It appears visual studio insists on always creating instantiations for templates even if they are fully inlined. Often these would be removed during linking, but they just don't compile in this special case. I keep getting "always_inline function 'Load' requires target feature 'ssse3' but would be inlined into function (..) that is compiled without support for ssse3", as soon as I use templates inside the HWY_NAMESPACE inside my own namespace and instantiate these from other functions inside the same namespace. The kind of template I'm using should be 100% inlined. These are just shortcuts for writing less code. Maybe I'll find some magical compiler trick to get rid of the instatiation, but if not.. I'd probably still have to split everything into multiple compile units. And then I might as well not use clang-cl. |
Update: Just got it to work thanks to this: https://stackoverflow.com/questions/71720201/why-does-msvc-compiler-put-template-instantiation-binaries-in-assembly However my solution (msvc 2022 + clang-cl) so far is somewhat inelegant and seems to defy logic. It requires
This seems to get rid of the troublesome instances as long as the template is only used in the same compile unit. Hopefully there's a simpler way. |
hm, the "requires target feature" usually means we are missing a pragma. It is important for all of your SIMD-using code to be between HWY_BEFORE_NAMESPACE/HWY_AFTER_NAMESPACE: these set up a pragma to cover all 'functions' between them. Also, any lambdas require an extra HWY_ATTR before the opening { because lambdas do not count as 'functions'. Is it possible that this could be an easier solution to the problem? |
Wow - thanks heaps! That was it! I was aware of HWY_BEFORE_NAMESPACE/HWY_AFTER_NAMESPACE, but I was mixing lambdas and templates with lambda arguments to get as DRY as possible and adding HWY_ATTR to all lambdas has fixed the issues I was seeing on msvs + clang-cl. Looking at the docs again I see that there's a HWY_ATTR in the Transform1 example on the main readme (which is similar to what I'm doing) and I unfortunately missed that. Hopefully this conversation helps someone else in the future. Things compiled fine on macOS without HWY_ATTR before already. |
Nice, glad to hear that was it :) |
There are some dynamic dispatch scenarios that require compiling the same C++ source files more than once (but with different C++ flags for some of the compilation phases), such as x86-64 with MSVC if AVX2/AVX3 targets are enabled, AArch64 if SVE/SVE2 targets are enabled, or PPC if PPC8/PPC9/PPC10 targets are enabled.
Here are the compilation phases for multi-phase compilation with MSVC on x86-64:
Here are the compilation phases for multi-phase compilation for AArch64 with SVE/SVE2 enabled:
Here are the compilation phases for multi-phase compilation for PPC64:
There are real-world use cases for multiple compilation dynamic dispatch, including improved performance on PPC9/PPC10/AArch64.
The text was updated successfully, but these errors were encountered: