Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle write-race conditions in CUDA kernels: Add atomic operation #1104

Merged
merged 1 commit into from
Oct 5, 2024

Conversation

kchristin22
Copy link
Collaborator

@kchristin22 kchristin22 commented Sep 24, 2024

In case two or more threads read from the same address, when computing the reverse mode derivative this is translated into two or more threads writing in the same position:

__global__ void kernel(int *a, int *b) {
  int index = threadIdx.x + blockIdx.x * blockDim.x;
  a[2 * index] = b[0];
  a[2 * index + 1] = b[0];
}

void kernel_grad(int *a, int *b, int *_d_a, int *_d_b) {
    unsigned int _t1 = blockIdx.x;
    unsigned int _t0 = blockDim.x;
    int _d_index = 0;
    int index0 = threadIdx.x + _t1 * _t0;
    int _t2 = a[2 * index0];
    a[2 * index0] = b[0];
    int _t3 = a[2 * index0 + 1];
    a[2 * index0 + 1] = b[0];
    {
        a[2 * index0 + 1] = _t3;
        int _r_d1 = _d_a[2 * index0 + 1];
        _d_a[2 * index0 + 1] = 0;
        _d_b[0] += _r_d1;
    }
    {
        a[2 * index0] = _t2;
        int _r_d0 = _d_a[2 * index0];
        _d_a[2 * index0] = 0;
       _d_b[0] += _r_d0;
    }
}

A simple solution to this problem is to make the last plus-assign op atomic:

__global__ void kernel(int *a, int *b) {
  int index = threadIdx.x + blockIdx.x * blockDim.x;
  a[2 * index] = b[0];
  a[2 * index + 1] = b[0];
}

void kernel_grad(int *a, int *b, int *_d_a, int *_d_b) {
    unsigned int _t1 = blockIdx.x;
    unsigned int _t0 = blockDim.x;
    int _d_index = 0;
    int index0 = threadIdx.x + _t1 * _t0;
    int _t2 = a[2 * index0];
    a[2 * index0] = b[0];
    int _t3 = a[2 * index0 + 1];
    a[2 * index0 + 1] = b[0];
    {
        a[2 * index0 + 1] = _t3;
        int _r_d1 = _d_a[2 * index0 + 1];
        _d_a[2 * index0 + 1] = 0;
        atomicAdd(&_d_b[0], _r_d1);
    }
    {
        a[2 * index0] = _t2;
        int _r_d0 = _d_a[2 * index0];
        _d_a[2 * index0] = 0;
        atomicAdd(&_d_b[0], _r_d0);
    }
}

@kchristin22 kchristin22 self-assigned this Sep 24, 2024
Copy link

codecov bot commented Sep 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.22%. Comparing base (844d9a3) to head (4c7bea2).
Report is 1 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1104      +/-   ##
==========================================
+ Coverage   94.20%   94.22%   +0.01%     
==========================================
  Files          48       48              
  Lines        8104     8132      +28     
==========================================
+ Hits         7634     7662      +28     
  Misses        470      470              
Files with missing lines Coverage Δ
include/clad/Differentiator/ReverseModeVisitor.h 97.22% <ø> (ø)
lib/Differentiator/ReverseModeVisitor.cpp 95.44% <100.00%> (+0.05%) ⬆️
Files with missing lines Coverage Δ
include/clad/Differentiator/ReverseModeVisitor.h 97.22% <ø> (ø)
lib/Differentiator/ReverseModeVisitor.cpp 95.44% <100.00%> (+0.05%) ⬆️

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions


FunctionDecl* atomicAddFunc = nullptr;
for (LookupResult::iterator it = lookupResult.begin();
it != lookupResult.end(); it++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use range-based for loop instead [modernize-loop-convert]

Suggested change
it != lookupResult.end(); it++) {
for (auto decl : lookupResult) {

lib/Differentiator/ReverseModeVisitor.cpp:1497:

-           NamedDecl* decl = *it;
-           // FIXME: check for underlying types of the pointers
+           // FIXME: check for underlying types of the pointers


FunctionDecl* atomicAddFunc = nullptr;
for (LookupResult::iterator it = lookupResult.begin();
it != lookupResult.end(); it++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use range-based for loop instead [modernize-loop-convert]

Suggested change
it != lookupResult.end(); it++) {
for (auto decl : lookupResult) {

lib/Differentiator/ReverseModeVisitor.cpp:2317:

-               NamedDecl* decl = *it;
-               // FIXME: check for underlying types of the pointers
+               // FIXME: check for underlying types of the pointers

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

m_Context.getTranslationUnitDecl());

FunctionDecl* atomicAddFunc = nullptr;
for (auto decl : lookupResult) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'auto decl' can be declared as 'auto *decl' [llvm-qualified-auto]

Suggested change
for (auto decl : lookupResult) {
for (auto *decl : lookupResult) {

m_Context.getTranslationUnitDecl());

FunctionDecl* atomicAddFunc = nullptr;
for (auto decl : lookupResult) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'auto decl' can be declared as 'auto *decl' [llvm-qualified-auto]

Suggested change
for (auto decl : lookupResult) {
for (auto *decl : lookupResult) {

@vgvassilev vgvassilev force-pushed the cuda-atomic branch 2 times, most recently from 7a1d405 to 6d6f853 Compare September 24, 2024 15:12
Copy link
Collaborator

@parth-07 parth-07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good.

auto* add_assign = BuildOp(BO_AddAssign, result, dfdx());
// Add it to the body statements.
addToCurrentBlock(add_assign, direction::reverse);
if (m_DiffReq->hasAttr<clang::CUDAGlobalAttr>()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to go in this if path when differentiating device functions as well, right?

// Add it to the body statements.
addToCurrentBlock(add_assign, direction::reverse);
if (m_DiffReq->hasAttr<clang::CUDAGlobalAttr>()) {
DeclarationName atomicAddId = &m_Context.Idents.get("atomicAdd");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a separate function for finding the appropriate atomicAdd function.

m_Context.getTranslationUnitDecl());

FunctionDecl* atomicAddFunc = nullptr;
for (auto decl : lookupResult) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can we use unresolved lookup in Sema::ActOnCallExpr as we do in DerivativeBuilder::BuildCallToCustomDerivativeOrNumericalDiff instead of explicitly finding the correct atomicAdd declaration?

Expr* atomicCall = BuildCallExprToFunction(atomicAddFunc, atomicArgs);

// Add it to the body statements.
addToCurrentBlock(atomicCall, direction::reverse);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please shift this entire if path in a separate function such as CudaAddAssign.

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
Copy link
Contributor

github-actions bot commented Oct 1, 2024

clang-tidy review says "All clean, LGTM! 👍"

Copy link
Contributor

github-actions bot commented Oct 1, 2024

clang-tidy review says "All clean, LGTM! 👍"

3 similar comments
Copy link
Contributor

github-actions bot commented Oct 1, 2024

clang-tidy review says "All clean, LGTM! 👍"

Copy link
Contributor

github-actions bot commented Oct 1, 2024

clang-tidy review says "All clean, LGTM! 👍"

Copy link
Contributor

github-actions bot commented Oct 1, 2024

clang-tidy review says "All clean, LGTM! 👍"

Copy link
Contributor

github-actions bot commented Oct 1, 2024

clang-tidy review says "All clean, LGTM! 👍"

@@ -104,6 +104,39 @@ Expr* getArraySizeExpr(const ArrayType* AT, ASTContext& context,
return CladTapeResult{*this, PushExpr, PopExpr, TapeRef};
}

clang::Expr* ReverseModeVisitor::BuildCallToCudaAtomicAdd(clang::Expr* LHS,
clang::Expr* RHS) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the function not using RHS?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great observation! It was left from the copy-paste. I'll fix it.

auto* add_assign = BuildOp(BO_AddAssign, result, dfdx());
// Add it to the body statements.
addToCurrentBlock(add_assign, direction::reverse);
if (m_DiffReq->hasAttr<clang::CUDAGlobalAttr>() ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same if-condition is repeated at two places. Can you please create a separate function for this such as: if (shouldUseCUDAAtomicOps())?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I make this a variable or a function of ReverseModeVisitor?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function sounds better to me.

Copy link
Contributor

github-actions bot commented Oct 3, 2024

clang-tidy review says "All clean, LGTM! 👍"

Copy link
Collaborator

@parth-07 parth-07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@vgvassilev vgvassilev force-pushed the cuda-atomic branch 2 times, most recently from 20aef9d to cd74405 Compare October 4, 2024 06:21
Copy link
Contributor

github-actions bot commented Oct 4, 2024

clang-tidy review says "All clean, LGTM! 👍"

Copy link
Contributor

github-actions bot commented Oct 5, 2024

clang-tidy review says "All clean, LGTM! 👍"

@vgvassilev vgvassilev merged commit 4ac4f77 into vgvassilev:master Oct 5, 2024
90 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants