-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Predefined functions Parallelization (e.g. IntelVectorMath functions) #68
Comments
If all we want to do is add code that doesn't otherwise mix with LoopVectorization, I think it should go into it's own standalone library. Option 3. will be much less efficient than both julia> using IntelVectorMath, LoopVectorization, BenchmarkTools
julia> x = rand(1000); y1 = similar(x); y2 = similar(x); y3 = similar(x);
julia> @benchmark $y1 .= sin.($x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.937 μs (0.00% GC)
median time: 5.100 μs (0.00% GC)
mean time: 5.106 μs (0.00% GC)
maximum time: 7.883 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 7
julia> @benchmark IntelVectorMath.sin!($y2, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.720 μs (0.00% GC)
median time: 1.731 μs (0.00% GC)
mean time: 1.735 μs (0.00% GC)
maximum time: 2.960 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
julia> @benchmark @avx $y3 .= sin.($x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 637.274 ns (0.00% GC)
median time: 642.571 ns (0.00% GC)
mean time: 643.363 ns (0.00% GC)
maximum time: 894.708 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 168 I would be in favor of adding support for special functions that aren't already supported would be a good thing, and I'm a fan of performance improvements in general. This example, as well as the I do think a better solution for improving special function performance is to work on improving or replacing the implementations in |
We should also use functions that https://github.com/tkf/ThreadsX.jl provides |
I am not sure about the way we should take. From one side, @KristofferC says we should keep libraries dumb, and from the other side, I totally get your point about loop fusion and low-level parallelization (what I can think of two main issues for manual low-level Parallelization for everything:
# An arbitrary example:
@avx begin
# x, y, z have same size
y = sin(x)
z = cos(y)
# a, b have same size
a = log(b)
w = (z.*y)*a
end
|
I think when you have such efficient looping machine there is no need to use VML which only adds overhead. It would be great to be able to use element wise (SIMD Element) libraries in a loop (Intel SVML, Sleef and as you showed some functions in GLIBC). Do you think it will be part of this package or should it be a dedicated package? |
This package uses More "elementwise" (short-vector [i.e., vectors of SIMD-vector-width]) functions, and better implementations, are always welcome. |
This is nice. I wasn't aware of that. Can you compare it also to use of the latest Sleef? Anyhow, my point was if one generate efficient loop no reason to use VML (When talking on Single Thread). Though probably one day Multi Threading + SIMD will be merged (Then By the way, it would be nice to annotate |
It would be nice if we provide a macro that replaces functions with their vectorized version.
Like
@ivm @. sin(x)
would replace this with IntelVectorMath function, and@applacc @. sin(x)
calls AppleAccelerate.We can provide such macros from IntelVectorMath.jl too, or else maybe having all of them in one place like inside LoopVectorization.jl.
@chriselrod quotes:
The major improvement these provide is that they're vectorized. If
x
is a scalar, then there isn't much benefit, if there is any at all.Version of LoopVectorization provided an
@vectorize
macro (that has since been removed) which naively swapped calls, and made loops incremented (ie, instead of 1:N, it would be 1:W:N, plus some code to handle the remainder).@avx
does this better.If they are a vector, calling
@avx sin.(x)
orIntelVectorMath.sin(x)
work (although a macro could search a whole block of code and swap them to useIntelVectorMath
.I've been planning on adding "loop splitting" support in LoopVectorization for a little while now (splitting one loop into several).
It would be possible to extend this to moving special functions into their own "loop" (a single vectorized call) and using VML (or some other library).
I would prefer "short vector" functions in general. Wouldn't require any changes to the library to support, nor would it require special casing. E.g, this works well with AVX2:
Presumably, VML does not handle vectors with a stride other than 1, which would force me to copy the elements, log them, and then sum them if I wanted to use it there.
Assuming it's able to use some pre-allocated buffer...
It looks like all that effort would pay off, so I'm open to it.
Long term I would still be in favor of implementing more of these special functions in Julia or LLVM, but this may be the better short term move. I also don't see many people jumping at the opportunity to implement SIMD versions of special functions (myself included).
Too bad VML isn't more expansive. Adding it wouldn't do much to increase the number of special functions currently supported by SLEEFPirates/LoopVectorization.
I've been wanting a digamma function, for example. I'll probably try the approach suggested by Wikipedia.
How well does VML perform on AMD? Is that something I'd have to worry about?
EDIT:
With AVX512:
With AVX512, it uses this log definition. I'd be more inclined to add something similar for AVX2. For this benchmark, the Intel compilers produce faster code.
My response:
I just wanted to clarify the thing I mean in this issue, so everyone is on the same page.
We can consider 3 kinds of syntax for the macro (I use
@ivm
to avoid confusion):IVM.
before their name:should be translated to:
which similar to 1 is translated to:
But in this case other functions can use a
for
loop with@avx
on them:should be translated to:
When someone uses
@ivm
that means they want to transformsin
toIVM.sin
.Multiple lib usage:
So which one is the syntax that we want to consider?
Related:
Came up in: JuliaMath/IntelVectorMath.jl#22 (comment), JuliaMath/IntelVectorMath.jl#43, JuliaMath/IntelVectorMath.jl#42,
The text was updated successfully, but these errors were encountered: