-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pull in microsoft-main-fpdt
branch from argonne-lcf
#13
base: main
Are you sure you want to change the base?
Conversation
* pass batch_dim_idx to deepspeed sequence parallel distributed attention for supporting batch size larger than 1 * add FPDT support; add Ulysses rotary position embedding support * add FPDT support; add Ulysses rotary position embedding support * add FPDT support; add Ulysses rotary position embedding support * add FPDT support; add Ulysses rotary position embedding support * remove unnecessary files * set the warmup length to be FPDT chunk size if enabled --------- Co-authored-by: Jinghan Yao <[email protected]> Co-authored-by: Jinghan Yao <[email protected]>
* [tools]GQA convert support * fix readme
Previously, `deepspeed_to_megatron.py` would raise an import error due to the relative import. This commit fixes this issue by changing from the relative import to the absolute import like in `deepspeed_to_transformers.py`.
Reviewer's Guide by SourceryThis pull request integrates the Sequence diagram for FPDT attention forward passsequenceDiagram
participant Input
participant FPDT_Attention
participant Flash_Attention
participant Memory
Input->>FPDT_Attention: hidden_states
activate FPDT_Attention
FPDT_Attention->>FPDT_Attention: Split into chunks
loop For each chunk
FPDT_Attention->>Flash_Attention: Process chunk
alt Memory offloading enabled
Flash_Attention->>Memory: Offload intermediate results
Memory->>Flash_Attention: Load when needed
end
Flash_Attention-->>FPDT_Attention: Chunk results
end
FPDT_Attention->>FPDT_Attention: Merge chunk results
FPDT_Attention-->>Input: output, attention_bias
deactivate FPDT_Attention
Class diagram for updated transformer componentsclassDiagram
class ParallelTransformerLayer {
-input_layernorm
-self_attention
-post_attention_layernorm
-mlp
+forward()
}
class FPDT_Attention {
-qkv_linear_weight
-qkv_linear_bias
-qkv_dense_weight
-qkv_dense_bias
-chunk_size
-enable_offloading
+forward()
}
class FPDT_FFN {
-dense_h_to_4h
-dense_4h_to_h
-fpdt_FFN_chunk_size
+forward()
}
ParallelTransformerLayer --> FPDT_Attention
ParallelTransformerLayer --> FPDT_FFN
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Summary by Sourcery
Integrate Flash Attention and FPDT support for improved performance and memory efficiency.
New Features:
Tests: