Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove PP Grad Tail Check #2538

Merged
merged 14 commits into from
Oct 26, 2023
Merged

Remove PP Grad Tail Check #2538

merged 14 commits into from
Oct 26, 2023

Conversation

Quentin-Anthony
Copy link
Contributor

After the patch in #1400 for BigScience, the final element of the inputs tuple is conditional on whether its grad is null (https://github.com/microsoft/DeepSpeed/blob/v0.7.5/deepspeed/runtime/pipe/engine.py#L995).

This will always fail if elt.grad is None in (https://github.com/microsoft/DeepSpeed/blob/v0.7.5/deepspeed/runtime/pipe/engine.py#L1026) with a IndexError: tuple index out of range because the index_grad_tail in (https://github.com/microsoft/DeepSpeed/blob/v0.7.5/deepspeed/runtime/pipe/engine.py#L1006) doesn't exist.

We're facing this issue intermittently whenever the grad tail is full of nan. I propose we just always include the grad tail.

@dashstander @jeffra @ShadenSmith

Quentin-Anthony added a commit to EleutherAI/DeeperSpeed that referenced this pull request Mar 9, 2023
Remove PP Grad Tail Check (until microsoft#2538 is merged to upstream)
@loadams loadams self-assigned this Aug 22, 2023
@loadams
Copy link
Contributor

loadams commented Sep 1, 2023

Hi @Quentin-Anthony - is this an active PR you'd still like to see merged? Just going through some older PRs and trying to see if we should review/prioritize or close.

@Quentin-Anthony
Copy link
Contributor Author

Hi @Quentin-Anthony - is this an active PR you'd still like to see merged? Just going through some older PRs and trying to see if we should review/prioritize or close.

Yes I'm still interested! I'll take another look sometime over the next few days and decide how best to close this out.

@loadams
Copy link
Contributor

loadams commented Sep 1, 2023

Hi @Quentin-Anthony - is this an active PR you'd still like to see merged? Just going through some older PRs and trying to see if we should review/prioritize or close.

Yes I'm still interested! I'll take another look sometime over the next few days and decide how best to close this out.

Thanks! Other than a current known failure, it is passing all tests at least, but we will just want to figure out who would be best to review.

@RUAN-ZX
Copy link
Contributor

RUAN-ZX commented Oct 26, 2023

Hi, I have met the same problem as well. I wonder if we can move forwards and merge this solution ASAP?

@loadams loadams enabled auto-merge October 26, 2023 16:06
@loadams loadams added this pull request to the merge queue Oct 26, 2023
Merged via the queue into microsoft:master with commit 3589cad Oct 26, 2023
15 checks passed
delock pushed a commit to delock/DeepSpeedSYCLSupport that referenced this pull request Oct 30, 2023
* Remove PP Grad Tail Check (microsoft#2538)

* Only communicate grad tail if it exists

Co-authored-by: Dashiell Stander <[email protected]>

* Revert previous patch and just always send the grad tail

* Formatting

---------

Co-authored-by: Dashiell Stander <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>

* Added __HIP_PLATFORM_AMD__=1 (microsoft#4570)

* fix multiple definition while building evoformer (microsoft#4556)

Current builder for evoformer use the same name for `attention.cpp` and
`attention.cu`, leading to same intermediate filename `attention.o`:
```shell
march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -
isystem /home/zejianxie/.conda/envs/dll/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem 
/home/zejianxie/.conda/envs/dll/include build/temp.linux-x86_64-cpython-
310/csrc/deepspeed4science/evoformer_attn/attention.o build/temp.linux-x86_64-cpython-
310/csrc/deepspeed4science/evoformer_attn/attention.o build/temp.linux-x86_64-cpython-
310/csrc/deepspeed4science/evoformer_attn/attention_back.o
```
and
```
`attention_impl(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&)':
      tmpxft_0012bef1_00000000-6_attention.compute_86.cudafe1.cpp:(.text+0x330): multiple definition of `attention_impl(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&)'; build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:tmpxft_0012bef1_00000000-6_attention.compute_86.cudafe1.cpp:(.text+0x330): first defined here
      /home/zejianxie/.conda/envs/dll/bin/../lib/gcc/x86_64-conda-linux-gnu/11.4.0/../../../../x86_64-conda-linux-gnu/bin/ld: build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:(.bss+0x0): multiple definition of `torch::autograd::(anonymous namespace)::graph_task_id'; build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:(.bss+0x0): first defined here
```

I use following to reproduce and confirm my fix works:
```
git clone https://github.com/NVIDIA/cutlass --depth 1
CUTLASS_PATH=$PWD/cutlass DS_BUILD_EVOFORMER_ATTN=1 pip install ./DeepSpeed --global-option="build_ext"
```

![image](https://github.com/microsoft/DeepSpeed/assets/41792945/9e406b37-330c-431c-8bf9-6be378dee4ff)

Co-authored-by: Conglong Li <[email protected]>

* Update ccl.py

---------

Co-authored-by: Quentin Anthony <[email protected]>
Co-authored-by: Dashiell Stander <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Ramya Ramineni <[email protected]>
Co-authored-by: Xie Zejian <[email protected]>
Co-authored-by: Conglong Li <[email protected]>
baodii pushed a commit to baodii/DeepSpeed that referenced this pull request Nov 7, 2023
* Only communicate grad tail if it exists

Co-authored-by: Dashiell Stander <[email protected]>

* Revert previous patch and just always send the grad tail

* Formatting

---------

Co-authored-by: Dashiell Stander <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
mauryaavinash95 pushed a commit to mauryaavinash95/DeepSpeed that referenced this pull request Feb 17, 2024
* Only communicate grad tail if it exists

Co-authored-by: Dashiell Stander <[email protected]>

* Revert previous patch and just always send the grad tail

* Formatting

---------

Co-authored-by: Dashiell Stander <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants