Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support of CUDA builtins #1092

Merged
merged 2 commits into from
Sep 24, 2024
Merged

Conversation

kchristin22
Copy link
Collaborator

No description provided.

@kchristin22 kchristin22 self-assigned this Sep 12, 2024
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

Copy link

codecov bot commented Sep 12, 2024

Codecov Report

Attention: Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Project coverage is 94.24%. Comparing base (e2b8e35) to head (6cd24b3).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
lib/Differentiator/ReverseModeVisitor.cpp 0.00% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1092      +/-   ##
==========================================
- Coverage   94.26%   94.24%   -0.03%     
==========================================
  Files          55       55              
  Lines        8445     8447       +2     
==========================================
  Hits         7961     7961              
- Misses        484      486       +2     
Files with missing lines Coverage Δ
include/clad/Differentiator/ReverseModeVisitor.h 98.18% <ø> (ø)
lib/Differentiator/ReverseModeVisitor.cpp 97.77% <0.00%> (-0.08%) ⬇️
Files with missing lines Coverage Δ
include/clad/Differentiator/ReverseModeVisitor.h 98.18% <ø> (ø)
lib/Differentiator/ReverseModeVisitor.cpp 97.77% <0.00%> (-0.08%) ⬇️

Copy link
Collaborator

@parth-07 parth-07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

test/CUDA/GradientKernels.cu Outdated Show resolved Hide resolved
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

cudaMalloc(&d_in, 5 * sizeof(int));

auto add = clad::gradient(add_kernel, "in, out");
add.execute_kernel(dim3(1), dim3(5, 1, 1), dummy_out, dummy_in, d_out, d_in);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't dummy_out and dummy_in pointers be of size 5 ints? Currently, they are of size 1 int.

cudaMemcpy(d_out, out, 5 * sizeof(int), cudaMemcpyHostToDevice);

int *d_in;
cudaMalloc(&d_in, 5 * sizeof(int));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please initialize d_in values to 0 to avoid any undefined behavior.

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

Copy link
Collaborator

@parth-07 parth-07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Copy link
Collaborator

@parth-07 parth-07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please squash all the commits into one?

@kchristin22
Copy link
Collaborator Author

Can you please squash all the commits into one?

This will be done automatically when merging this PR

@parth-07
Copy link
Collaborator

parth-07 commented Sep 14, 2024

This will be done automatically when merging this PR

The primary goal of squashing commits manually in this case is to properly write and structure the commit message. For example, the commit message should describe which CUDA builtins are enabled by this pull-request.

out[threadIdx.x] += in[threadIdx.x];
}

// CHECK: void add_kernel_2_grad(int *out, int *in, int *_d_out, int *_d_in) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you investigate why we don't have __attribute__((device)) here?

Clang is printing __attribute__((device)) if we replace clang::CUDADeviceAttr::CreateImplicit call in m_Derivative->addAttr(clang::CUDADeviceAttr::CreateImplicit(m_Context)); with m_Derivative->addAttr(clang::CUDADeviceAttr::Create(m_Context));

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've seen that. The attribute is "hidden" only for the compiler to see when using implicit. I can change that to the explicit creation (Create instead of CreateImplicit) for clarity purposes if necessary in another PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that it may confuse the user into thinking that it is not a kernel actually being executed (as we don't print the overload)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create an issue for using Create instead of CreateImplicit.

My concern is that it may confuse the user into thinking that it is not a kernel actually being executed (as we don't print the overload)

I think it is more misleading to not show any attribute at all. Having the attribute present is also necessary for customers wanting to independently use Clad generated-derivatives.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the attribute present is also necessary for customers wanting to independently use Clad generated-derivatives.

But the attribute is not the correct one if the users want to execute the function themselves, as they would expect to have the multi-threaded execution offered in a global kernel. The attribute should be __global__ instead of __device__. A work-around would be to prepend the string "__global__ " before the printing or dumping to source file if we want this to be intuitive.

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@parth-07
Copy link
Collaborator

The commit message says:

Added support of CUDA grid configuration builtin variables. Builtins tested: threadIdx, blockIdx, blockDim, gridDim, warpSize

The pull-request does not seem to have tests for gridDim and warpSize. Can you either remove them from the commit message or add tests for them?

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

Added support of CUDA grid configuration builtin variables. Builtins tested: threadIdx, blockIdx, blockDim, gridDim, warpSize
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@vgvassilev vgvassilev merged commit 2e5560e into vgvassilev:master Sep 24, 2024
88 of 90 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants