-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any chance for other language bindings? #1738
Comments
Hi @oshaboy , it's an interesting topic. I personally do not know Rust, Go and Ada very well. What seems very feasible is to have C bindings for larger groups of code, for example VQSort(). As non-inlined functions, those could be called from any language, after compiling as C++. That would be a very easy update. What seems harder is to expose individual ops (e.g. |
Yeah after taking a look at the header file I can see it's not something an For the other programming languages. Rust and Ada can inline across modules. Meanwhile with Go you're at the mercy of the compiler. Of course with all of these you have to rewrite all the inline code to the point where it's a Ship of Theseus of the original library. |
Yes indeed :) Ship of Theseus is fine by me. The main value-add of Highway is 1) shielding user code from compiler bugs 2) finding a useful subset of all instruction sets, and filling in the gaps where required. 1 would have to be repeated/maintained for other languages, but adopting the Highway ops and polyfills (2) would save the authors of new ships/libraries a lot of work. |
I have a nim-version of the single-threaded vectorized quicksort from 2022. It's really nice thx and i'll publish it soon. Maybe alongside a python-module.. |
Cool :) Do you mean a variant/port of vqsort? |
It's based on the genuine C++ version from the "Fast and Robust" repo. Therefore it does not yet include all the nice additions (8/16/64-bit-wide)-Types that you have added recently. And its AVX2-only and since i don't have access to a ARM-machine it will stay so - or maybe a M3 Airbook drops from heaven ;). But realisticly others will be able to help out on this issue. My primary concern is to show that Nim can reach 80-100-% Cpp-performance as soon as we embrace SIMD-intrinsics and thus commit to saving energy :) |
Got it. FYI we also substantially changed the pivot sampling step since then, which can make a big difference in perf/robustness. Cool goal 👍 I'm curious about the Nim plans for intrinsics, is there a writeup/design doc for it? |
thank you - thats good to know - see if i can find it in your implementation.
there is the regards, Andreas |
Thanks for the pointer to vectorized RLU cache, hadn't seen it yet! (The author name does ring a bell, I believe they also worked on sorting.) SIMDe has a different focus: given code written for a particular set of nonportable intrinsics, try to make it run on other platforms. This can be useful for existing codebases, but I'd argue that we want something else for forward-looking languages/new codebases. For example, slightly bending the contract (Reorder) in ReorderWidenMulAccumulate allows us to implement the op efficiently on all platforms, whereas the SIMDe approach would be more expensive because it must faithfully match the platform's quirks. I agree it would be super useful to get SIMD into the language. For example, in C++ we resort to pragmas to enable codegen for a particular target, but this could be done more elegantly if integrated. I would strongly advocate a 'vector length agnostic' model, where users can only ask for <= 128 bit, OR [1/8, 1/1] of the native length, which must be queried at runtime. This allows SVE/RVV support, and also avoids poor codegen from users asking for 1024 (or worse: 384) bit vectors. Would be happy to discuss with anyone who wants to start a proposal for Nim. |
once you have rewritten you caches and you get the first energy-bill, consider sending some flowers & cake to your colleagues at IBM-research/Tokio :)
yeah agree - so its gonna be highway.
well, here i'm afraid of Cpp's Post-traumatic-template-disorder :) Thinking of 'getting it into the language' i tried a DSL using basic overloads and Nim-templates - it is flexible and works, but the performance penalty is prohibitive. One could use a macro to revert the templates/overloads back into intrinsic-calls or use Nims' powerful macros (they are type-checked, so no PTTD here) from the beginning. Looking at the discussions at SO, i get it, that (when things go wrong) compiler-transformations need to be considered. What are the compiler-makers plans regarding user-code containing intrinsics ? Do they regard it as their task to optimise things/algos here ?
that will be (in case of the latest ARM-simd already is) a challenge in itself. Such Runtime-requirements exclude the compiler-makers and demand somewhat new strategies (in the area of supercomputing it has been done, so one could learn from them). But this is not my primary concern - since even AVX-3 won't show up in commodity machines soon. We'll see if Intels AVX-10 with variable runtime register-widths will be advertised as consumer- or server-technique - i expect rather server. |
To clarify, the main bit that is required in the language is the "compiler and programmer are allowed to call this intrinsic", i.e. setting the target attributes. The way this is done in clang unfortunately does not compose with templates. It would be nice if we could have something like template specialization, but in C++ what is required is that PLUS pragmas that change the target attributes. The rest can hopefully be a library, though I am not familiar with Nim macros.
Clang is quite enthusiastic about optimizing intrinsics. It's definitely not a 1:1 mapping. Often this is actually helpful; I have seen it basically rename memory (actually registers), thus optimizing out a permutation, which is very cool and would be very difficult to do by hand.
This can be done, but I remember a comment by a colleague that what should have been a 20 minute patch, took almost a day in assembly.
Unfortunately the solutions I have seen usually assume that the HPC cluster is running a certain known CPU, so they tell the compiler to hard-code a certain SVE vector size.
Aren't there consumer Zen4 machines?
AVX10 is actually static, in the sense that you generate either 128-bit, 256-bit, 512-bit instructions. It is just that the CPU advertises which of those will raise faults, i.e. should not be used. |
C, Rust, Go and Ada bindings would be nice to have. I understand that templates are important for the library to work but a lot of programmers who can benefit from a portable SIMD library simply can't use it.
Is this a planned feature already?
The text was updated successfully, but these errors were encountered: