-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove PP Grad Tail Check #2538
Conversation
Co-authored-by: Dashiell Stander <[email protected]>
Remove PP Grad Tail Check (until microsoft#2538 is merged to upstream)
Hi @Quentin-Anthony - is this an active PR you'd still like to see merged? Just going through some older PRs and trying to see if we should review/prioritize or close. |
Yes I'm still interested! I'll take another look sometime over the next few days and decide how best to close this out. |
Thanks! Other than a current known failure, it is passing all tests at least, but we will just want to figure out who would be best to review. |
Hi, I have met the same problem as well. I wonder if we can move forwards and merge this solution ASAP? |
* Remove PP Grad Tail Check (microsoft#2538) * Only communicate grad tail if it exists Co-authored-by: Dashiell Stander <[email protected]> * Revert previous patch and just always send the grad tail * Formatting --------- Co-authored-by: Dashiell Stander <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> * Added __HIP_PLATFORM_AMD__=1 (microsoft#4570) * fix multiple definition while building evoformer (microsoft#4556) Current builder for evoformer use the same name for `attention.cpp` and `attention.cu`, leading to same intermediate filename `attention.o`: ```shell march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe - isystem /home/zejianxie/.conda/envs/dll/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/zejianxie/.conda/envs/dll/include build/temp.linux-x86_64-cpython- 310/csrc/deepspeed4science/evoformer_attn/attention.o build/temp.linux-x86_64-cpython- 310/csrc/deepspeed4science/evoformer_attn/attention.o build/temp.linux-x86_64-cpython- 310/csrc/deepspeed4science/evoformer_attn/attention_back.o ``` and ``` `attention_impl(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&)': tmpxft_0012bef1_00000000-6_attention.compute_86.cudafe1.cpp:(.text+0x330): multiple definition of `attention_impl(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&)'; build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:tmpxft_0012bef1_00000000-6_attention.compute_86.cudafe1.cpp:(.text+0x330): first defined here /home/zejianxie/.conda/envs/dll/bin/../lib/gcc/x86_64-conda-linux-gnu/11.4.0/../../../../x86_64-conda-linux-gnu/bin/ld: build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:(.bss+0x0): multiple definition of `torch::autograd::(anonymous namespace)::graph_task_id'; build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:(.bss+0x0): first defined here ``` I use following to reproduce and confirm my fix works: ``` git clone https://github.com/NVIDIA/cutlass --depth 1 CUTLASS_PATH=$PWD/cutlass DS_BUILD_EVOFORMER_ATTN=1 pip install ./DeepSpeed --global-option="build_ext" ``` ![image](https://github.com/microsoft/DeepSpeed/assets/41792945/9e406b37-330c-431c-8bf9-6be378dee4ff) Co-authored-by: Conglong Li <[email protected]> * Update ccl.py --------- Co-authored-by: Quentin Anthony <[email protected]> Co-authored-by: Dashiell Stander <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Ramya Ramineni <[email protected]> Co-authored-by: Xie Zejian <[email protected]> Co-authored-by: Conglong Li <[email protected]>
* Only communicate grad tail if it exists Co-authored-by: Dashiell Stander <[email protected]> * Revert previous patch and just always send the grad tail * Formatting --------- Co-authored-by: Dashiell Stander <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>
* Only communicate grad tail if it exists Co-authored-by: Dashiell Stander <[email protected]> * Revert previous patch and just always send the grad tail * Formatting --------- Co-authored-by: Dashiell Stander <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>
After the patch in #1400 for BigScience, the final element of the
inputs
tuple is conditional on whether its grad is null (https://github.com/microsoft/DeepSpeed/blob/v0.7.5/deepspeed/runtime/pipe/engine.py#L995).This will always fail if
elt.grad is None
in (https://github.com/microsoft/DeepSpeed/blob/v0.7.5/deepspeed/runtime/pipe/engine.py#L1026) with aIndexError: tuple index out of range
because theindex_grad_tail
in (https://github.com/microsoft/DeepSpeed/blob/v0.7.5/deepspeed/runtime/pipe/engine.py#L1006) doesn't exist.We're facing this issue intermittently whenever the grad tail is full of
nan
. I propose we just always include the grad tail.@dashstander @jeffra @ShadenSmith