Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DON'T MERGE] Revert hip cooperative groups workaround [REDUX] #22

Draft
wants to merge 79 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
513f861
1. Disable third_party_install so flow doesn't automatically build AP…
jithunnair-amd May 5, 2020
ed421e9
Update setup.py to hipify before building extension
jithunnair-amd May 6, 2020
e82fa34
Cooperative groups is not supported by HIP yet, so replace with worka…
jithunnair-amd May 6, 2020
bdb8421
Use ROCm APEX
jithunnair-amd May 8, 2020
7c0e6ac
Update ROCm APEX commit to get FusedLayerNorm and hipification fixes
jithunnair-amd May 8, 2020
23200d4
Update requirements to use tensorflow-rocm package instead of tensorf…
jithunnair-amd May 8, 2020
c10bdcb
Use DeepSpeedExamples fork
jithunnair-amd May 8, 2020
43212b3
Typo
jithunnair-amd May 8, 2020
54ad8a5
Use changes_for_rocm_build branch for jithunnair-amd fork of DeepSpee…
jithunnair-amd May 16, 2020
453d501
Update ROCm APEX commit
jithunnair-amd May 28, 2020
4454bc2
Update DeepSpeedExamples commit
jithunnair-amd May 29, 2020
db28f75
Update ROCm Apex commit
jithunnair-amd Jun 25, 2020
30f661e
Merge branch upstream into master
jithunnair-amd Sep 15, 2020
077638d
Enable cooperative groups for ROCm
jithunnair-amd Sep 15, 2020
66c135e
Update setup.py to build lamb extension for ROCm
jithunnair-amd Sep 15, 2020
9379918
Do not install torch and torchvision for ROCm using pip
jithunnair-amd Sep 15, 2020
b5866a6
Use ROCm fork of DeepSpeedExamples
jithunnair-amd Sep 16, 2020
9c624c2
Update DeepSpeedExamples commit to use ROCm fork master branch
jithunnair-amd Sep 16, 2020
ab6aca1
Update DeepSpeedExamples commit
jithunnair-amd Sep 26, 2020
884f08e
ROCm PyTorch can be installed in the user local area in some cases
jithunnair-amd Sep 26, 2020
17febe5
Remove requirements.txt since upstream moved it to requirements folder
jithunnair-amd Sep 29, 2020
46d64e2
Add Dockerfile for ROCm
jithunnair-amd Sep 30, 2020
c2d4cc0
Add skips for unit tests that fail on ROCm. Current status: 72 passed…
jithunnair-amd Sep 30, 2020
9f0c80d
Enable CPU adam extension for ROCm
jithunnair-amd Oct 19, 2020
cb3f83a
Install requirements as appropriate for ROCm
jithunnair-amd Oct 27, 2020
617027f
Skip additional unit tests that fail on CI (but not locally)
jithunnair-amd Oct 28, 2020
a508e62
Do not skip unit tests which pass with latest PyTorch
jithunnair-amd Nov 3, 2020
3dd5e2d
Modify include files to build CPU Adam extension
jithunnair-amd Nov 3, 2020
77cd5c3
Update setup.py for latest hipify
jithunnair-amd Dec 16, 2020
7f9bbeb
Update CPU Adam header files to remove ifdefing unnecessary with late…
jithunnair-amd Dec 16, 2020
ea71005
Hipified transformer kernel extensions
jithunnair-amd Dec 23, 2020
fbddd93
Cooperative Groups workaround for transformer kernels extension
jithunnair-amd Dec 23, 2020
9091b20
Update apex commit
jithunnair-amd Jan 7, 2021
3edda06
Merge from upstream; resolve conflicts; checkout 'theirs' for tests/u…
jithunnair-amd Mar 26, 2021
5e6bb85
Integrate op_builder from upstream and update for ROCm
jithunnair-amd Mar 26, 2021
67ed124
Update Dockerfile.rocm
jithunnair-amd Mar 27, 2021
c4fe427
Temporary hacks to workaround: 1) setup.py issues on ROCm wrt. absolu…
jithunnair-amd Mar 27, 2021
74ebc97
torch.version.cuda doesn't exist for ROCm PyTorch
jithunnair-amd Mar 27, 2021
1bb74d0
Add hip_version
jithunnair-amd Mar 29, 2021
3d4e19d
Check hip version for ROCm builds
jithunnair-amd Mar 29, 2021
9939bd7
Remove unused dir
jithunnair-amd Mar 31, 2021
99571e5
Skipped the tests with the error,
rraminen Apr 8, 2021
9d8ad53
Updated Dockerfile.rocm
rraminen Apr 9, 2021
e323eab
Merge pull request #5 from ROCmSoftwarePlatform/Dockerfile.rocm_PR
jithunnair-amd Apr 10, 2021
529ebcd
Update skipIfRocm to add customizable reason string (#6)
jithunnair-amd Apr 12, 2021
37651f3
Disable AVX512 for ROCm to enable same build of DeepSpeed to work on …
jithunnair-amd Apr 13, 2021
7be71d3
Update headers and include_dirs to enable transformer extension (#8)
jithunnair-amd Apr 19, 2021
1c69737
Add patched CG headers to rocm install path (#9)
jithunnair-amd Apr 19, 2021
ac4f8d5
Update DeepSpeedExamples commit (#10)
jithunnair-amd Apr 19, 2021
14204ab
Update DeepSpeedExamples commit
jithunnair-amd Apr 21, 2021
827ebfb
Update DeepSpeedExamples commit
jithunnair-amd Apr 22, 2021
2f77a87
v0.3.15 IFU
rraminen Apr 28, 2021
0d06e02
Merge pull request #12 from rraminen/IFU
jeffdaily Apr 28, 2021
3f2657f
Add Github Actions ifu.yml
jithunnair-amd May 11, 2021
9b41aa7
Update ifu.yml to ignore DeepSpeedExamples
jithunnair-amd May 12, 2021
497f5a1
Merge remote-tracking branch 'upstream/master' into IFU-master-2021-0…
jithunnair-amd May 12, 2021
b1563d6
Merge pull request #13 from ROCmSoftwarePlatform/IFU-master-2021-05-12
jithunnair-amd May 12, 2021
2066405
Update DeepSpeedExamples commit
jithunnair-amd May 12, 2021
0a87051
Merge remote-tracking branch 'upstream/master'
invalid-email-address May 17, 2021
e827515
Use branch name in PR title/branch name
jithunnair-amd May 17, 2021
ae10359
Merge pull request #14 from ROCmSoftwarePlatform/IFU-master-2021-05-17
jithunnair-amd May 17, 2021
4c7a252
Add email functionality
jithunnair-amd May 21, 2021
7b900de
IFU-master-2021-05-27
rraminen May 28, 2021
5de081e
Pointed DeepSpeedExamples to latest commit after IFU
rraminen Jun 4, 2021
1850f88
Merge pull request #17 from rraminen/IFU_5_27
jithunnair-amd Jun 4, 2021
d296665
Revert "Add patched CG headers to rocm install path (#9)"
rraminen Jun 23, 2021
f50fa7b
Revert "Update headers and include_dirs to enable transformer extensi…
rraminen Jun 23, 2021
2585f29
Added back the required code from the commits, 1c69737e1a8a8ae5ed9d29…
rraminen Jun 23, 2021
0be9645
Revert "Cooperative Groups workaround for transformer kernels extension"
rraminen Jun 23, 2021
f428da5
Added defined(__HIP_PLATFORM_HCC__) to kernels code
rraminen Jun 23, 2021
ed2ee34
Revert "Enable cooperative groups for ROCm"
rraminen Jun 23, 2021
742fd64
Enable cooperative groups for ROCm
rraminen Jun 23, 2021
1d20b14
Added CuPy installation from source
rraminen Jun 24, 2021
f6c79ae
Added h5py installation
rraminen Jun 24, 2021
0cf3306
Merge pull request #20 from rraminen/PR_Update_Dockerfile
jithunnair-amd Jun 28, 2021
81b744e
hip cooperative groups functionality for coalesced_group in fused_lam…
rraminen Jun 28, 2021
bf2979b
Merge pull request #21 from rraminen/PR_Revert_HIP_Cooperative_Groups…
jithunnair-amd Jun 29, 2021
5b0fac7
Revert "Merge pull request #21 from rraminen/PR_Revert_HIP_Cooperativ…
jithunnair-amd Jun 29, 2021
909f9bf
Re-Merge pull request #21 from rraminen/PR_Revert_HIP_Cooperative_Gro…
jithunnair-amd Jun 29, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions .github/workflows/ifu.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
name: IntegrateFromUpstream
on:
# schedule:
# # verified via crontab.guru website. “At 06:55 on Monday.”
# - cron: '55 6 * * 1'
workflow_dispatch:
inputs:
message:
description: 'Reason for manual trigger'
required: false
default: 'refresh branch'
jobs:
IntegrateFromUpstream:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 0
- name: Get Current Date
id: date
run: echo "::set-output name=date::$(date +'%Y-%m-%d')"
- name: Extract branch name
id: extract_branch
shell: bash
run: echo "##[set-output name=branch;]$(echo ${GITHUB_REF#refs/heads/})"
- name: Fetch and Merge
id: fetch_and_merge
run: |
echo "Reason for trigger: ${{ github.event.inputs.message }}"
echo "Actor for trigger: ${{ github.actor }}"
git config user.name github-actions
git config user.email [email protected]
git remote add upstream https://github.com/microsoft/DeepSpeed
git fetch upstream master
git merge upstream/master
# Since we use our own fork of DeepSpeedExamples, ignore theirs
git checkout HEAD DeepSpeedExamples
- name: Create Pull Request
id: create_pull_request
uses: jithunnair-amd/create-pull-request@v3
with:
# token: ${{ secrets.PAT }}
branch: IFU-${{ steps.extract_branch.outputs.branch }}-${{ steps.date.outputs.date }}
title: IFU-${{ steps.extract_branch.outputs.branch }}-${{ steps.date.outputs.date }}
assignees: rraminen
reviewers: jithunnair-amd
delete-branch: true
- name: Send email
uses: jithunnair-amd/[email protected]
if: always()
with:
server_address: smtp.gmail.com
server_port: 465
secure: true
username: ${{ secrets.GMAIL_USERNAME }}
password: ${{ secrets.GMAIL_PASSWORD }}
subject: IFU to ${{ steps.extract_branch.outputs.branch }} branch of ${{ github.repository }}
to: [email protected], [email protected]
from: ${{ secrets.GMAIL_USERNAME }}
html_body: |
<b>Fetch and Merge</b>: ${{ steps.fetch_and_merge.outcome }} <br/>
<b>Create Pull Request</b>: ${{ steps.create_pull_request.outcome }} <br/>
<b>Pull request</b>: <a href="${{ steps.create_pull_request.outputs.pull-request-url }}">${{ steps.create_pull_request.outputs.pull-request-url }}</a> <br/>
3 changes: 1 addition & 2 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
[submodule "DeepSpeedExamples"]
path = DeepSpeedExamples
url = https://github.com/microsoft/DeepSpeedExamples
branch = master
url = https://github.com/ROCmSoftwarePlatform/DeepSpeedExamples.git
2 changes: 1 addition & 1 deletion DeepSpeedExamples
18 changes: 18 additions & 0 deletions csrc/includes/cublas_wrappers.h
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,9 @@
#include <cuda.h>
#include <cuda_fp16.h>
#include <cuda_runtime.h>
#ifndef __HIP_PLATFORM_HCC__
#include <mma.h>
#endif
#include <stdio.h>

int cublas_gemm_ex(cublasHandle_t handle,
Expand All @@ -19,7 +21,11 @@ int cublas_gemm_ex(cublasHandle_t handle,
const float* A,
const float* B,
float* C,
#ifdef __HIP_PLATFORM_HCC__
rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
#else
cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT);
#endif

int cublas_gemm_ex(cublasHandle_t handle,
cublasOperation_t transa,
Expand All @@ -32,7 +38,11 @@ int cublas_gemm_ex(cublasHandle_t handle,
const __half* A,
const __half* B,
__half* C,
#ifdef __HIP_PLATFORM_HCC__
rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
#else
cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT_TENSOR_OP);
#endif

int cublas_strided_batched_gemm(cublasHandle_t handle,
int m,
Expand All @@ -49,7 +59,11 @@ int cublas_strided_batched_gemm(cublasHandle_t handle,
int stride_B,
int stride_C,
int batch,
#ifdef __HIP_PLATFORM_HCC__
rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
#else
cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT);
#endif

int cublas_strided_batched_gemm(cublasHandle_t handle,
int m,
Expand All @@ -66,4 +80,8 @@ int cublas_strided_batched_gemm(cublasHandle_t handle,
int stride_B,
int stride_C,
int batch,
#ifdef __HIP_PLATFORM_HCC__
rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
#else
cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT_TENSOR_OP);
#endif
4 changes: 4 additions & 0 deletions csrc/includes/custom_cuda_layers.h
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,11 @@
#include <stdio.h>
#include <stdlib.h>

#ifdef __HIP_PLATFORM_HCC__
#include <hip/hip_cooperative_groups.h>
#else
#include <cooperative_groups.h>
#endif
#include <curand_kernel.h>

#include "context.h"
Expand Down
12 changes: 12 additions & 0 deletions csrc/includes/feed_forward.h
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,11 @@ class FeedForward {
weights,
input_ptr,
out,
#ifdef __HIP_PLATFORM_HCC__
rocblas_gemm_algo(config_.gemm_algos[0]));
#else
cublasGemmAlgo_t(config_.gemm_algos[0]));
#endif
}
void Backward(int bsz,
const T* out_grad,
Expand All @@ -68,7 +72,11 @@ class FeedForward {
input_ptr,
out_grad,
weights_grad,
#ifdef __HIP_PLATFORM_HCC__
rocblas_gemm_algo(config_.gemm_algos[1]));
#else
cublasGemmAlgo_t(config_.gemm_algos[1]));
#endif

cublas_gemm_ex(_cublasHandle,
CUBLAS_OP_N,
Expand All @@ -81,7 +89,11 @@ class FeedForward {
weights,
out_grad,
inp_grad_out,
#ifdef __HIP_PLATFORM_HCC__
rocblas_gemm_algo(config_.gemm_algos[2]));
#else
cublasGemmAlgo_t(config_.gemm_algos[2]));
#endif

launch_fuse_transpose_bias_kernel<T>(out_grad, bias_grad, bsz, config_.outputSize, stream);
}
Expand Down
36 changes: 36 additions & 0 deletions csrc/includes/gemm_test.h
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@
#pragma once

#include <cuda_fp16.h>
#ifndef __HIP_PLATFORM_HCC__
#include <cuda_profiler_api.h>
#endif
#include <array>
#include <cstdio>
#include <cstdlib>
Expand Down Expand Up @@ -58,7 +60,11 @@ class GemmTest {
B,
A,
C,
#ifdef __HIP_PLATFORM_HCC__
static_cast<rocblas_gemm_algo>(algo));
#else
static_cast<cublasGemmAlgo_t>(algo));
#endif
});

int algo_bw1 = Run(loops, [=](int algo) {
Expand All @@ -73,7 +79,11 @@ class GemmTest {
A,
C,
B,
#ifdef __HIP_PLATFORM_HCC__
static_cast<rocblas_gemm_algo>(algo));
#else
static_cast<cublasGemmAlgo_t>(algo));
#endif
});

int algo_bw2 = Run(loops, [=](int algo) {
Expand All @@ -88,7 +98,11 @@ class GemmTest {
B,
C,
A,
#ifdef __HIP_PLATFORM_HCC__
static_cast<rocblas_gemm_algo>(algo));
#else
static_cast<cublasGemmAlgo_t>(algo));
#endif
});

return std::array<int, 3>({algo_fw, algo_bw1, algo_bw2});
Expand All @@ -100,8 +114,13 @@ class GemmTest {
float fast_latency = (std::numeric_limits<float>::max)();
int fast_algo = 0;

#ifdef __HIP_PLATFORM_HCC__
for (int algo = (int)rocblas_gemm_algo_standard;
algo <= (int)rocblas_gemm_algo_standard;
#else
for (int algo = (int)CUBLAS_GEMM_DEFAULT_TENSOR_OP;
algo <= (int)CUBLAS_GEMM_ALGO15_TENSOR_OP;
#endif
algo++) {
int warm_up = 5;
for (int i = 0; i < warm_up; ++i) f(algo);
Expand Down Expand Up @@ -186,7 +205,11 @@ class StridedGemmTest {
stride_b,
stride_c,
bsz,
#ifdef __HIP_PLATFORM_HCC__
static_cast<rocblas_gemm_algo>(algo));
#else
static_cast<cublasGemmAlgo_t>(algo));
#endif
});

int algo_bw1 = Run(loops, [=](int algo) {
Expand Down Expand Up @@ -216,7 +239,11 @@ class StridedGemmTest {
stride_b,
stride_c,
bsz,
#ifdef __HIP_PLATFORM_HCC__
static_cast<rocblas_gemm_algo>(algo));
#else
static_cast<cublasGemmAlgo_t>(algo));
#endif
});

int algo_bw2 = Run(loops, [=](int algo) {
Expand All @@ -243,7 +270,11 @@ class StridedGemmTest {
stride_b,
stride_c,
bsz,
#ifdef __HIP_PLATFORM_HCC__
static_cast<rocblas_gemm_algo>(algo));
#else
static_cast<cublasGemmAlgo_t>(algo));
#endif
});

return std::array<int, 3>({algo_fw, algo_bw1, algo_bw2});
Expand All @@ -255,8 +286,13 @@ class StridedGemmTest {
float fast_latency = (std::numeric_limits<float>::max)();
int fast_algo = 0;

#ifdef __HIP_PLATFORM_HCC__
for (int algo = (int)rocblas_gemm_algo_standard;
algo <= (int)rocblas_gemm_algo_standard;
#else
for (int algo = (int)CUBLAS_GEMM_DEFAULT_TENSOR_OP;
algo <= (int)CUBLAS_GEMM_ALGO15_TENSOR_OP;
#endif
algo++) {
int warm_up = 5;
for (int i = 0; i < warm_up; ++i) f(algo);
Expand Down
4 changes: 4 additions & 0 deletions csrc/includes/general_kernels.h
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,11 @@
#include <stdio.h>
#include <stdlib.h>

#ifdef __HIP_PLATFORM_HCC__
#include <hip/hip_cooperative_groups.h>
#else
#include <cooperative_groups.h>
#endif
#include <curand_kernel.h>

#include "context.h"
Expand Down
16 changes: 16 additions & 0 deletions csrc/includes/strided_batch_gemm.h
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,11 @@ class StridedBatchGemm {
stride_b,
stride_c,
bsz,
#ifdef __HIP_PLATFORM_HCC__
rocblas_gemm_algo(_config.gemm_algos[0]));
#else
cublasGemmAlgo_t(_config.gemm_algos[0]));
#endif
}

void ForwardPlusSave(T* output, const T* _buffer_a, const T* _buffer_b, cublasHandle_t handle)
Expand All @@ -96,7 +100,11 @@ class StridedBatchGemm {
stride_b,
stride_c,
_config.batch_size,
#ifdef __HIP_PLATFORM_HCC__
rocblas_gemm_algo(_config.gemm_algos[0]));
#else
cublasGemmAlgo_t(_config.gemm_algos[0]));
#endif

k_buf = _buffer_a;
q_buf = _buffer_b;
Expand Down Expand Up @@ -136,7 +144,11 @@ class StridedBatchGemm {
stride_b,
stride_c,
bsz,
#ifdef __HIP_PLATFORM_HCC__
rocblas_gemm_algo(_config.gemm_algos[1]));
#else
cublasGemmAlgo_t(_config.gemm_algos[1]));
#endif

// A need to transpose.
cublasOperation_t op_a = (_config.op_A == CUBLAS_OP_T ? CUBLAS_OP_N : CUBLAS_OP_T);
Expand All @@ -161,7 +173,11 @@ class StridedBatchGemm {
stride_b,
stride_c,
bsz,
#ifdef __HIP_PLATFORM_HCC__
rocblas_gemm_algo(_config.gemm_algos[2]));
#else
cublasGemmAlgo_t(_config.gemm_algos[2]));
#endif
}

inline int GetN() const { return _config.k; }
Expand Down
4 changes: 4 additions & 0 deletions csrc/lamb/fused_lamb_cuda_kernel.cu
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,11 @@
#include <iostream>

//#include <helper_functions.h>
#if defined(__HIP_PLATFORM_HCC__) && HIP_VERSION > 305
#include <hip/hip_cooperative_groups.h>
#else
#include <cooperative_groups.h>
#endif
#include <cuda_runtime_api.h>
#include <stdio.h>

Expand Down
8 changes: 4 additions & 4 deletions csrc/quantization/quantizer.cu
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ namespace cg = cooperative_groups;

__global__ void qunatize_kernel(__half* vals, int group_size, int num_bits)
{
#if __CUDA_ARCH__ >= 700
#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)

cg::thread_block b = cg::this_thread_block();
cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
Expand Down Expand Up @@ -206,7 +206,7 @@ __global__ void sr_qunatize_kernel(__half* vals,
int num_bits,
std::pair<uint64_t, uint64_t> seed)
{
#if __CUDA_ARCH__ >= 700
#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)

cg::thread_block b = cg::this_thread_block();
cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
Expand Down Expand Up @@ -484,7 +484,7 @@ template void launch_sr_qunatize_kernel(__half* vals,

__global__ void qunatize_kernel_asym(__half* vals, int group_size, int num_bits)
{
#if __CUDA_ARCH__ >= 700
#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)

cg::thread_block b = cg::this_thread_block();
cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
Expand Down Expand Up @@ -729,7 +729,7 @@ __global__ void sr_qunatize_kernel_asym(__half* vals,
int num_bits,
std::pair<uint64_t, uint64_t> seed)
{
#if __CUDA_ARCH__ >= 700
#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)

cg::thread_block b = cg::this_thread_block();
cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
Expand Down
Loading