-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support of CUDA builtins #1092
Conversation
clang-tidy review says "All clean, LGTM! 👍" |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1092 +/- ##
==========================================
- Coverage 94.26% 94.24% -0.03%
==========================================
Files 55 55
Lines 8445 8447 +2
==========================================
Hits 7961 7961
- Misses 484 486 +2
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
clang-tidy review says "All clean, LGTM! 👍" |
test/CUDA/GradientKernels.cu
Outdated
cudaMalloc(&d_in, 5 * sizeof(int)); | ||
|
||
auto add = clad::gradient(add_kernel, "in, out"); | ||
add.execute_kernel(dim3(1), dim3(5, 1, 1), dummy_out, dummy_in, d_out, d_in); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't dummy_out
and dummy_in
pointers be of size 5 ints? Currently, they are of size 1 int.
cudaMemcpy(d_out, out, 5 * sizeof(int), cudaMemcpyHostToDevice); | ||
|
||
int *d_in; | ||
cudaMalloc(&d_in, 5 * sizeof(int)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please initialize d_in
values to 0
to avoid any undefined behavior.
clang-tidy review says "All clean, LGTM! 👍" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please squash all the commits into one?
This will be done automatically when merging this PR |
The primary goal of squashing commits manually in this case is to properly write and structure the commit message. For example, the commit message should describe which CUDA builtins are enabled by this pull-request. |
out[threadIdx.x] += in[threadIdx.x]; | ||
} | ||
|
||
// CHECK: void add_kernel_2_grad(int *out, int *in, int *_d_out, int *_d_in) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you investigate why we don't have __attribute__((device))
here?
Clang is printing __attribute__((device))
if we replace clang::CUDADeviceAttr::CreateImplicit
call in m_Derivative->addAttr(clang::CUDADeviceAttr::CreateImplicit(m_Context));
with m_Derivative->addAttr(clang::CUDADeviceAttr::Create(m_Context));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've seen that. The attribute is "hidden" only for the compiler to see when using implicit. I can change that to the explicit creation (Create
instead of CreateImplicit
) for clarity purposes if necessary in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that it may confuse the user into thinking that it is not a kernel actually being executed (as we don't print the overload)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create an issue for using Create
instead of CreateImplicit
.
My concern is that it may confuse the user into thinking that it is not a kernel actually being executed (as we don't print the overload)
I think it is more misleading to not show any attribute at all. Having the attribute present is also necessary for customers wanting to independently use Clad generated-derivatives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the attribute present is also necessary for customers wanting to independently use Clad generated-derivatives.
But the attribute is not the correct one if the users want to execute the function themselves, as they would expect to have the multi-threaded execution offered in a global kernel. The attribute should be __global__
instead of __device__
. A work-around would be to prepend the string "__global__ "
before the printing or dumping to source file if we want this to be intuitive.
2824439
to
9a6924e
Compare
clang-tidy review says "All clean, LGTM! 👍" |
9a6924e
to
aaf593c
Compare
clang-tidy review says "All clean, LGTM! 👍" |
The commit message says:
The pull-request does not seem to have tests for |
clang-tidy review says "All clean, LGTM! 👍" |
Added support of CUDA grid configuration builtin variables. Builtins tested: threadIdx, blockIdx, blockDim, gridDim, warpSize
15881d9
to
6cd24b3
Compare
clang-tidy review says "All clean, LGTM! 👍" |
No description provided.