ROCm 6.0 prep changes #4537

loadams · 2023-10-18T22:08:11Z

Reimplementation of #4172.

These changes should be backward-compatible with multiple previous ROCm versions, please refer to https://rocm.docs.amd.com/en/latest/understand/file_reorg.html#wrapper-header-files

Co-authored-by: BLOrange-AMD [email protected]

loadams · 2023-10-18T22:14:13Z

op_builder/cpu_adagrad.py

@@ -38,10 +38,4 @@ def include_paths(self):
            CUDA_INCLUDE = []
        elif not self.is_rocm_pytorch():
            CUDA_INCLUDE = [os.path.join(torch.utils.cpp_extension.CUDA_HOME, "include")]
-        else:


@rraminen @jithunnair-amd @BLOrange-AMD - this is a re-implementation of PR #4172. Since I'm not able to re-run the tests on that PR due to the permissions on the ROCm fork that it is from. Do these changes make sense and should we go ahead with merging these changes?

Looks like CUDA_INCLUDE is not initialized with ROCm. Original PR has this: https://github.com/ROCmSoftwarePlatform/DeepSpeed/blob/00120efcc7796d302c395f6d1f0e9007335ea5c1/op_builder/cpu_adagrad.py#L42

Please apply this to other locations as well.

@loadams "Since I'm not able to re-run the tests on that PR due to the permissions on the ROCm fork that it is from" ... Can you please elaborate what we need to change on the permissions for our fork, so we can enable CI running on PRs from our fork?

@BLOrange-AMD - good catch, thanks. I'm also happy to merge your original PR, just I'm not able to re-run the transient test/merge the latest changes into it. If you could do that we can just merge that one?

@loadams Since you're able to kick off CI on MI200 using this PR, let's use this PR as the one to target merge. I wouldn't want to merge a PR that we cannot run CI on.

@jithunnair-amd - running here on updated PR - let me know if this looks good to merge?

@loadams The above run failed with "5 failed, 42 passed, 10 skipped, 16 warnings", whereas one of your earlier runs with the apex fixes ran many more tests: 79 failed, 735 passed, 129 skipped, 40 warnings. That looks like the failures in the above job just terminated the run?

Yes, @jithunnair-amd - that's due to the environment issues on our side we are still trying to work out to get resolved.

But if the overall PR is fine (and doesn't appear to be causing these from our side), then we should be fine to go ahead and complete the PR?

Yes, sounds good!

Btw, we have taken an internal action item to look at the DeepSpeed unit test status upstream so we can get this workflow to green status. If there are certain failures that you know are due to CI environment issues, please let us know so we can exclude them from our investigation.

* ROCm 6.0 prep changes * PR feedback * Try updating apex

ROCm 6.0 prep changes

6776b4b

loadams requested review from jeffra, RezaYazdaniAminabadi and cmikeh2 as code owners October 18, 2023 22:08

loadams commented Oct 18, 2023

View reviewed changes

loadams added 2 commits October 18, 2023 15:40

PR feedback

2b93fae

Try updating apex

1c48468

loadams requested a review from mrwyattii as a code owner October 18, 2023 23:17

Merge branch 'master' into loadams/rocm6-changes

4023d36

loadams added the rocm AMD/ROCm/HIP issues label Oct 20, 2023

mrwyattii approved these changes Oct 20, 2023

View reviewed changes

loadams added this pull request to the merge queue Oct 20, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 20, 2023

loadams added this pull request to the merge queue Oct 20, 2023

Merged via the queue into master with commit e238351 Oct 20, 2023

loadams mentioned this pull request Oct 24, 2023

Cleanup for ROCm6.0 re-org changes #4172

Closed

baodii pushed a commit to baodii/DeepSpeed that referenced this pull request Nov 7, 2023

ROCm 6.0 prep changes (microsoft#4537)

4e3fd70

* ROCm 6.0 prep changes * PR feedback * Try updating apex

mauryaavinash95 pushed a commit to mauryaavinash95/DeepSpeed that referenced this pull request Feb 17, 2024

ROCm 6.0 prep changes (microsoft#4537)

d282f41

* ROCm 6.0 prep changes * PR feedback * Try updating apex

loadams deleted the loadams/rocm6-changes branch February 28, 2024 18:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROCm 6.0 prep changes #4537

ROCm 6.0 prep changes #4537

loadams commented Oct 18, 2023 •

edited

Loading

loadams Oct 18, 2023

BLOrange-AMD Oct 18, 2023 •

edited

Loading

jithunnair-amd Oct 18, 2023

loadams Oct 18, 2023

jithunnair-amd Oct 18, 2023

loadams Oct 20, 2023

jithunnair-amd Oct 20, 2023 •

edited

Loading

loadams Oct 20, 2023

loadams Oct 20, 2023

jithunnair-amd Oct 20, 2023

ROCm 6.0 prep changes #4537

ROCm 6.0 prep changes #4537

Conversation

loadams commented Oct 18, 2023 • edited Loading

Choose a reason for hiding this comment

BLOrange-AMD Oct 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jithunnair-amd Oct 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

loadams commented Oct 18, 2023 •

edited

Loading

BLOrange-AMD Oct 18, 2023 •

edited

Loading

jithunnair-amd Oct 20, 2023 •

edited

Loading