Pull in `microsoft-main-fpdt` branch from `argonne-lcf` #13

saforem2 · 2024-12-25T14:45:30Z

Summary by Sourcery

Integrate Flash Attention and FPDT support for improved performance and memory efficiency.

New Features:

Introduce Flash Attention and FPDT (Fully Pipelined Deep Transformer) support.

Tests:

Update tests to cover Flash Attention and FPDT integration.

* pass batch_dim_idx to deepspeed sequence parallel distributed attention for supporting batch size larger than 1 * add FPDT support; add Ulysses rotary position embedding support * add FPDT support; add Ulysses rotary position embedding support * add FPDT support; add Ulysses rotary position embedding support * add FPDT support; add Ulysses rotary position embedding support * remove unnecessary files * set the warmup length to be FPDT chunk size if enabled --------- Co-authored-by: Jinghan Yao <[email protected]> Co-authored-by: Jinghan Yao <[email protected]>

* [tools]GQA convert support * fix readme

Previously, `deepspeed_to_megatron.py` would raise an import error due to the relative import. This commit fixes this issue by changing from the relative import to the absolute import like in `deepspeed_to_transformers.py`.

sourcery-ai · 2024-12-25T14:45:35Z

Reviewer's Guide by Sourcery

This pull request integrates the microsoft-main-fpdt branch from the argonne-lcf repository, introducing significant performance enhancements through Flash Attention and FPDT (Fully Pipelined DeepSpeed Transformer) optimizations. Key changes include refactoring the QKV projection, MLP layers, and attention mechanisms to leverage these DeepSpeed features. Additionally, the code incorporates logging and device context management for improved debugging and portability.

Sequence diagram for FPDT attention forward pass

sequenceDiagram
    participant Input
    participant FPDT_Attention
    participant Flash_Attention
    participant Memory

    Input->>FPDT_Attention: hidden_states
    activate FPDT_Attention

    FPDT_Attention->>FPDT_Attention: Split into chunks
    loop For each chunk
        FPDT_Attention->>Flash_Attention: Process chunk
        alt Memory offloading enabled
            Flash_Attention->>Memory: Offload intermediate results
            Memory->>Flash_Attention: Load when needed
        end
        Flash_Attention-->>FPDT_Attention: Chunk results
    end

    FPDT_Attention->>FPDT_Attention: Merge chunk results
    FPDT_Attention-->>Input: output, attention_bias
    deactivate FPDT_Attention

Class diagram for updated transformer components

classDiagram
    class ParallelTransformerLayer {
        -input_layernorm
        -self_attention
        -post_attention_layernorm
        -mlp
        +forward()
    }

    class FPDT_Attention {
        -qkv_linear_weight
        -qkv_linear_bias
        -qkv_dense_weight
        -qkv_dense_bias
        -chunk_size
        -enable_offloading
        +forward()
    }

    class FPDT_FFN {
        -dense_h_to_4h
        -dense_4h_to_h
        -fpdt_FFN_chunk_size
        +forward()
    }

    ParallelTransformerLayer --> FPDT_Attention
    ParallelTransformerLayer --> FPDT_FFN

File-Level Changes

Change	Details	Files
Integrated Flash Attention and FPDT (Fully Pipelined DeepSpeed Transformer) optimizations for performance enhancements.	Refactored QKV projection in `transformer.py` to support GQA (Grouped Query Attention). Added FPDT support for MLP layers in `transformer.py`. Modified attention mechanism in `transformer.py` to use Flash Attention and FPDT. Updated `gpt_model.py` to handle FPDT logits loss. Added FPDT input construction in `pretrain_gpt.py`. Modified initialization in `initialize.py` to warm up FPDT functions. Added device context to rotary embedding in `rotary_pos_embedding.py`. Added FPDT arguments in `arguments.py`. Updated `language_model.py` to use rotary position embedding with device context and handle FPDT sequence lengths. Added `ds_sequence_parallel_fpdt` flag to control FPDT usage. Added `ds_sequence_parallel_fpdt_chunk_size` argument to control chunk size in FPDT attention. Added `ds_sequence_parallel_fpdt_offloading` flag to enable offloading in FPDT attention. Added logging for rank and log level in `transformer.py`. Updated `finetune_llama.sh` to support conversion between Hugging Face and Megatron-Deepspeed formats. Updated `hf2megads_weight_converter.py` to handle QKV refactoring for GQA. Updated `finetune_llama.sh` to use an empty ds_config during conversion. Added a new shell script `ds_pretrain_gpt_6.7B_fpdt_32k.sh` for pretraining with FPDT. Added example data and vocabulary files for testing. Updated documentation in `README.md` to reflect the changes for FPDT and conversion scripts.	`megatron/model/transformer.py` `megatron/model/gpt_model.py` `pretrain_gpt.py` `megatron/initialize.py` `megatron/model/rotary_pos_embedding.py` `megatron/arguments.py` `megatron/model/language_model.py`
Refactored weight conversion scripts to handle GQA (Grouped Query Attention).	Modified `_qkv_refactor` and `_qkv_refactor_to_hf` functions in `hf2megads_weight_converter.py` to handle the updated QKV projection format for GQA. Removed the `use_gqa` flag as GQA is now handled directly by the refactoring functions.	`tools/hf2megads_weight_converter.py`
Updated example scripts and documentation.	Added new arguments to `finetune_llama.sh` for controlling FPDT and conversion processes. Updated `README.md` with instructions for converting weights and fine-tuning with FPDT. Added an empty `ds_config.json` file for use during conversion.	`examples_deepspeed/finetune_hf_llama/finetune_llama.sh` `examples_deepspeed/finetune_hf_llama/README.md` `examples_deepspeed/finetune_hf_llama/ds_config.json`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time. You can also use
this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

…into microsoft-main-fpdt

YJHMITWEB and others added 4 commits December 4, 2024 17:34

[tool]GQA convert support (microsoft#454)

c3df187

* [tools]GQA convert support * fix readme

Fix import error in deepspeed_to_megatron.py (microsoft#455)

f4157be

Previously, `deepspeed_to_megatron.py` would raise an import error due to the relative import. This commit fixes this issue by changing from the relative import to the absolute import like in `deepspeed_to_transformers.py`.

feat: Pull in Microsoft FPDT from upstream

36abfe7

saforem2 added 2 commits December 25, 2024 09:24

Add comment with details of bug when not using flash in transformer.py

188d37b

Merge branch 'main' of https://github.com/microsoft/Megatron-DeepSpeed …

1a21057

…into microsoft-main-fpdt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pull in `microsoft-main-fpdt` branch from `argonne-lcf` #13

Pull in `microsoft-main-fpdt` branch from `argonne-lcf` #13

saforem2 commented Dec 25, 2024 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Dec 25, 2024 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Pull in microsoft-main-fpdt branch from argonne-lcf #13

Are you sure you want to change the base?

Pull in microsoft-main-fpdt branch from argonne-lcf #13

Conversation

saforem2 commented Dec 25, 2024 • edited by sourcery-ai bot Loading

Summary by Sourcery

sourcery-ai bot commented Dec 25, 2024 • edited Loading

Reviewer's Guide by Sourcery

Sequence diagram for FPDT attention forward pass

Class diagram for updated transformer components

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Pull in `microsoft-main-fpdt` branch from `argonne-lcf` #13

Pull in `microsoft-main-fpdt` branch from `argonne-lcf` #13

saforem2 commented Dec 25, 2024 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Dec 25, 2024 •

edited

Loading