-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean CUDA operators #108
base: main
Are you sure you want to change the base?
Clean CUDA operators #108
Conversation
kernel = if (j,k) == (0,1) | ||
dec_cu_ker_c_wedge_product_01! | ||
elseif (j,k) == (0,2) | ||
dec_cu_ker_c_wedge_product_02! | ||
elseif (j,k) == (1,1) | ||
dec_cu_ker_c_wedge_product_11! | ||
else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a (1, 0) and a (2, 0) case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there tests for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't correspond to any of the p
wedge products and is not how this function is used internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True but this function is exposed publicly. At the very least we can assert that the larger form appears on the right. If we wanna do more, we can check the order of the forms and adjust the inputs as needed like what happens internally in Decapodes.
I don't think there are any tests for the flipped cases.
function dec_cu_ker_c_wedge_product_02!(res, f, α, wedge_cache) | ||
p, c = wedge_cache[1], wedge_cache[2] | ||
i = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x | ||
stride = gridDim().x * blockDim().x | ||
i = index | ||
|
||
@inbounds while i <= Int32(length(wedge_terms)) | ||
wedge_terms[i] = α[i] * (coeffs[Int32(1), i] * f[pv[Int32(1), i]] + coeffs[Int32(2), i] * f[pv[Int32(2), i]] | ||
+ coeffs[Int32(3), i] * f[pv[Int32(3), i]] + coeffs[Int32(4), i] * f[pv[Int32(4), i]] | ||
+ coeffs[Int32(5), i] * f[pv[Int32(5), i]] + coeffs[Int32(6), i] * f[pv[Int32(6), i]]) | ||
@inbounds while i <= Int32(length(res)) | ||
c1, c2, c3, c4, c5, c6 = c[Int32(1),i], c[Int32(2),i], c[Int32(3),i], c[Int32(4),i], c[Int32(5),i], c[Int32(6),i] | ||
p1, p2, p3, p4, p5, p6 = p[Int32(1),i], p[Int32(2),i], p[Int32(3),i], p[Int32(4),i], p[Int32(5),i], p[Int32(6),i] | ||
res[i] = α[i] * (c1*f[p1] + c2*f[p2] + c3*f[p3] + c4*f[p4] + c5*f[p5] + c6*f[p6]) | ||
i += stride | ||
end | ||
return nothing | ||
nothing | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has the performance of this kernel been impacted by these changes, specifically benchmark times and register counts?
function dec_cu_ker_c_wedge_product_11!(res, α, β, wedge_cache) | ||
e, c = wedge_cache[1], wedge_cache[2] | ||
i = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x | ||
stride = gridDim().x * blockDim().x | ||
i = index | ||
|
||
@inbounds while i <= Int32(length(wedge_terms)) | ||
@inbounds while i <= Int32(length(res)) | ||
e0, e1, e2 = e[Int32(1), i], e[Int32(2), i], e[Int32(3), i] | ||
c1, c2, c3 = c[Int32(1), i], c[Int32(2), i], c[Int32(3), i] | ||
ae0, ae1, ae2 = α[e0], α[e1], α[e2] | ||
be0, be1, be2 = β[e0], β[e1], β[e2] | ||
|
||
c1, c2, c3 = coeffs[Int32(1), i], coeffs[Int32(2), i], coeffs[Int32(3), i] | ||
|
||
wedge_terms[i] = (c1 * (ae2 * be1 - ae1 * be2) | ||
+ c2 * (ae2 * be0 - ae0 * be2) | ||
+ c3 * (ae1 * be0 - ae0 * be1)) | ||
res[i] = | ||
(c1 * (ae2 * be1 - ae1 * be2) + | ||
c2 * (ae2 * be0 - ae0 * be2) + | ||
c3 * (ae1 * be0 - ae0 * be1)) | ||
i += stride | ||
end | ||
|
||
return nothing | ||
nothing | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same performance questions as the other kernel.
Please also let me know if the CUDA tests are passing after these changes. |
In advance of adding more CUDA versions and wrappers of DEC operators (possibly based off of KernelAbstractions.jl), it is necessary to clean up the existing code.
For example, the restrictions on dispatching CUDA.jl kernel functions led to some redundant code, now deleted. The dispatch of the regular and inverse Hodge stars was also not optimal (, not exploiting the abstract
DiscreteHodge
type for example). Docstrings have also been cleaned. The docstrings for the internal functions to the wedge product are no longer exposed.