Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support of CUDA device pullbacks #1111

Merged
merged 4 commits into from
Oct 15, 2024

Conversation

kchristin22
Copy link
Collaborator

@kchristin22 kchristin22 commented Oct 7, 2024

Add support for device pullback functions.

  1. If the function is a global kernel, we know that all its args reside in global memory. These are stored in a container (m_GlobalArgs)
  2. If the function is a device function, it was called inside a kernel or it's a nested device call (the first one was called by a kernel).
    2.1. If the user has called clad::gradient on the kernel, the m_GlobalArgs set is not empty. When a call is encountered, we iterate on the args of the call and identify which reside in global memory. If such args are found, we store their index in another container (m_GlobalArgsIndexes) and include it in the differentiation request of the device function. When the latter is derived later on using DerivePullback(), we check whether m_GlobalArgsIndexes is empty, or in another words if any of its call args are global, and if so, match them with the parameters of this function's signature (add them in the local m_GlobalArgs container). The same procedure is followed for nested calls.
    2.2. If the user didn't call clad::gradient on the kernel, but only to the device function, then we can't tell at compile time which runtime args are global. As a result the user should copy the data to local memory first and back again
    __global__ kernel(double *in, double *out, double *_d_in, double *_d_out){
    auto test = clad::gradient(device); 
    test.execute(in, out, _d_in, _d_out); // use local vars here and then assign the result back to the global args
    } 

NOTE: Currently Clad doesn't support variables in shared memory, which would be considered global as well.

@kchristin22 kchristin22 self-assigned this Oct 7, 2024
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

/// the derivative (gradient) is being computed. This is separate from the
/// m_Variables map because all other intermediate variables will
/// not be stored here.
std::unordered_map<const clang::ValueDecl*, clang::Expr*> m_ParamVariables;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: member variable 'm_ParamVariables' has protected visibility [cppcoreguidelines-non-private-member-variables-in-classes]

    std::unordered_map<const clang::ValueDecl*, clang::Expr*> m_ParamVariables;
                                                              ^

if (OverloadedDerivedFn && asGrad) {
// Derivative was found.
FunctionDecl* fnDecl =
dyn_cast<CUDAKernelCallExpr>(OverloadedDerivedFn)->getDirectCallee();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: Called C++ object pointer is null [clang-analyzer-core.CallAndMessage]

        dyn_cast<CUDAKernelCallExpr>(OverloadedDerivedFn)->getDirectCallee();
        ^
Additional context

lib/Differentiator/ErrorEstimator.cpp:444: Assuming 'OverloadedDerivedFn' is non-null

  if (OverloadedDerivedFn && asGrad) {
      ^

lib/Differentiator/ErrorEstimator.cpp:444: Left side of '&&' is true

  if (OverloadedDerivedFn && asGrad) {
      ^

lib/Differentiator/ErrorEstimator.cpp:444: Assuming 'asGrad' is true

  if (OverloadedDerivedFn && asGrad) {
                             ^

lib/Differentiator/ErrorEstimator.cpp:444: Taking true branch

  if (OverloadedDerivedFn && asGrad) {
  ^

lib/Differentiator/ErrorEstimator.cpp:447: Assuming 'OverloadedDerivedFn' is not a 'CastReturnType'

        dyn_cast<CUDAKernelCallExpr>(OverloadedDerivedFn)->getDirectCallee();
        ^

lib/Differentiator/ErrorEstimator.cpp:447: Called C++ object pointer is null

        dyn_cast<CUDAKernelCallExpr>(OverloadedDerivedFn)->getDirectCallee();
        ^

const clang::CUDAKernelCallExpr*& KCE, clang::Expr*& OverloadedDerivedFn,
llvm::SmallVectorImpl<clang::Expr*>& derivedCallArgs,
llvm::SmallVectorImpl<clang::Expr*>& ArgResult, bool asGrad) {
for (auto source : m_Sources) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'auto source' can be declared as 'auto *source' [llvm-qualified-auto]

Suggested change
for (auto source : m_Sources) {
for (auto *source : m_Sources) {

lib/Differentiator/ReverseModeVisitor.cpp Outdated Show resolved Hide resolved
lib/Differentiator/ReverseModeVisitor.cpp Outdated Show resolved Hide resolved
DerivedCallArgs.front()->getType(), m_Context, 1));
OverloadedDerivedFn = m_Builder.BuildCallToCustomDerivativeKernel(
customPushforward, pushforwardCallArgs, getCurrentScope(),
const_cast<DeclContext*>(FD->getDeclContext()), config);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: do not use const_cast [cppcoreguidelines-pro-type-const-cast]

          const_cast<DeclContext*>(FD->getDeclContext()), config);
          ^

clad::utils::ComputeEffectiveFnName(FD) + "_pullback";
OverloadedDerivedFn = m_Builder.BuildCallToCustomDerivativeKernel(
customPullback, pullbackCallArgs, getCurrentScope(),
const_cast<DeclContext*>(FD->getDeclContext()), config);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: do not use const_cast [cppcoreguidelines-pro-type-const-cast]

          const_cast<DeclContext*>(FD->getDeclContext()), config);
          ^

lib/Differentiator/ReverseModeVisitor.cpp Outdated Show resolved Hide resolved
lib/Differentiator/VisitorBase.cpp Outdated Show resolved Hide resolved
@kchristin22 kchristin22 changed the title Add support of CUDA nested calls: Kernel call inside a host function and device function call inside kernel Draft: Add support of CUDA nested calls: Kernel call inside a host function and device function call inside kernel Oct 7, 2024
@kchristin22 kchristin22 marked this pull request as draft October 7, 2024 12:57
@vgvassilev
Copy link
Owner

In a separate PR we should probably move the cuda specific builtins into a separate header.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

lib/Differentiator/ReverseModeVisitor.cpp Outdated Show resolved Hide resolved
lib/Differentiator/ReverseModeVisitor.cpp Outdated Show resolved Hide resolved
lib/Differentiator/ReverseModeVisitor.cpp Outdated Show resolved Hide resolved
lib/Differentiator/ReverseModeVisitor.cpp Outdated Show resolved Hide resolved
lib/Differentiator/ReverseModeVisitor.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

lib/Differentiator/ReverseModeVisitor.cpp Outdated Show resolved Hide resolved
Copy link

codecov bot commented Oct 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.26%. Comparing base (f86eede) to head (8accef3).
Report is 2 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1111      +/-   ##
==========================================
+ Coverage   94.24%   94.26%   +0.02%     
==========================================
  Files          48       48              
  Lines        8164     8198      +34     
==========================================
+ Hits         7694     7728      +34     
  Misses        470      470              
Files with missing lines Coverage Δ
include/clad/Differentiator/DiffPlanner.h 66.66% <ø> (ø)
include/clad/Differentiator/ExternalRMVSource.h 25.00% <ø> (ø)
include/clad/Differentiator/ReverseModeVisitor.h 97.22% <ø> (ø)
lib/Differentiator/ReverseModeVisitor.cpp 95.53% <100.00%> (+0.06%) ⬆️
lib/Differentiator/VisitorBase.cpp 97.14% <100.00%> (ø)
Files with missing lines Coverage Δ
include/clad/Differentiator/DiffPlanner.h 66.66% <ø> (ø)
include/clad/Differentiator/ExternalRMVSource.h 25.00% <ø> (ø)
include/clad/Differentiator/ReverseModeVisitor.h 97.22% <ø> (ø)
lib/Differentiator/ReverseModeVisitor.cpp 95.53% <100.00%> (+0.06%) ⬆️
lib/Differentiator/VisitorBase.cpp 97.14% <100.00%> (ø)

Copy link
Collaborator

@parth-07 parth-07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have separate PRs for enabling kernel calls in a host function and device function call inside a kernel.

lib/Differentiator/DerivativeBuilder.cpp Outdated Show resolved Hide resolved
lib/Differentiator/ReverseModeVisitor.cpp Outdated Show resolved Hide resolved
lib/Differentiator/ReverseModeVisitor.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

lib/Differentiator/ReverseModeVisitor.cpp Show resolved Hide resolved
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@@ -46,6 +46,8 @@ struct DiffRequest {
clang::CallExpr* CallContext = nullptr;
/// Args provided to the call to clad::gradient/differentiate.
const clang::Expr* Args = nullptr;
/// Indexes of global args of function as a subset of Args.
std::unordered_set<size_t> GlobalArgsIndexes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: member variable 'GlobalArgsIndexes' has public visibility [cppcoreguidelines-non-private-member-variables-in-classes]

  std::unordered_set<size_t> GlobalArgsIndexes;
                             ^

/// the derivative (gradient) is being computed. This is separate from the
/// m_Variables map because all other intermediate variables will
/// not be stored here.
std::unordered_set<const clang::ValueDecl*> m_ParamVarsWithDiff;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: member variable 'm_ParamVarsWithDiff' has protected visibility [cppcoreguidelines-non-private-member-variables-in-classes]

    std::unordered_set<const clang::ValueDecl*> m_ParamVarsWithDiff;
                                                ^

@@ -51,6 +56,8 @@
/// that will be put immediately in the beginning of derivative function
/// block.
Stmts m_Globals;
/// Global args of the function.
std::unordered_set<const clang::ParmVarDecl*> m_GlobalArgs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: member variable 'm_GlobalArgs' has protected visibility [cppcoreguidelines-non-private-member-variables-in-classes]

    std::unordered_set<const clang::ParmVarDecl*> m_GlobalArgs;
                                                  ^

lib/Differentiator/ReverseModeVisitor.cpp Outdated Show resolved Hide resolved
lib/Differentiator/ReverseModeVisitor.cpp Outdated Show resolved Hide resolved
lib/Differentiator/ReverseModeVisitor.cpp Outdated Show resolved Hide resolved
lib/Differentiator/ReverseModeVisitor.cpp Outdated Show resolved Hide resolved
lib/Differentiator/ReverseModeVisitor.cpp Show resolved Hide resolved
@kchristin22 kchristin22 requested a review from parth-07 October 11, 2024 11:51
@kchristin22 kchristin22 changed the title Draft: Add support of CUDA nested calls: Kernel call inside a host function and device function call inside kernel Add support of CUDA nested calls: Kernel call inside a host function and device function call inside kernel Oct 12, 2024
@kchristin22 kchristin22 marked this pull request as ready for review October 12, 2024 13:15
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

include/clad/Differentiator/DiffPlanner.h Outdated Show resolved Hide resolved
@parth-07
Copy link
Collaborator

Can you please rebase the pull-request on top of origin/master?

@vgvassilev
Copy link
Owner

Can you please rebase the pull-request on top of origin/master?

Done. @kchristin22 we need a better commit message.

@kchristin22 kchristin22 changed the title Add support of CUDA nested calls: Kernel call inside a host function and device function call inside kernel Add support of CUDA device pullbacks Oct 13, 2024
@kchristin22
Copy link
Collaborator Author

Done. @kchristin22 we need a better commit message.

On it!

For this purpose, a deeper look into atomic ops had to be taken. Atomic ops can only be applied on global or shared GPU memory.
Hence, we needed to identify which call args of the device function pullback are actually kernel args and, thus, global.
The indexes of those args are stored in a vector in the differentiation request for the internal device function and appended to the name of the pullback function.
Later on, when deriving the encountered device function, the global call args are matched with the function's params based on their stored indexes.
This way, the atomic ops are minimized to the absolute necessary number and no error arises.
Copy link
Collaborator

@parth-07 parth-07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, except for one nitpick comment.

include/clad/Differentiator/ReverseModeVisitor.h Outdated Show resolved Hide resolved
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@kchristin22
Copy link
Collaborator Author

Done!

lib/Differentiator/ReverseModeVisitor.cpp Outdated Show resolved Hide resolved
lib/Differentiator/ReverseModeVisitor.cpp Outdated Show resolved Hide resolved
lib/Differentiator/ReverseModeVisitor.cpp Outdated Show resolved Hide resolved
lib/Differentiator/ReverseModeVisitor.cpp Outdated Show resolved Hide resolved
Copy link
Owner

@vgvassilev vgvassilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@vgvassilev vgvassilev merged commit fc6d311 into vgvassilev:master Oct 15, 2024
90 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants