Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull in microsoft-main-fpdt branch from argonne-lcf #13

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

saforem2
Copy link
Owner

@saforem2 saforem2 commented Dec 25, 2024

Summary by Sourcery

Integrate Flash Attention and FPDT support for improved performance and memory efficiency.

New Features:

  • Introduce Flash Attention and FPDT (Fully Pipelined Deep Transformer) support.

Tests:

  • Update tests to cover Flash Attention and FPDT integration.

YJHMITWEB and others added 4 commits December 4, 2024 17:34
* pass batch_dim_idx to deepspeed sequence parallel distributed attention for supporting batch size larger than 1

* add FPDT support; add Ulysses rotary position embedding support

* add FPDT support; add Ulysses rotary position embedding support

* add FPDT support; add Ulysses rotary position embedding support

* add FPDT support; add Ulysses rotary position embedding support

* remove unnecessary files

* set the warmup length to be FPDT chunk size if enabled

---------

Co-authored-by: Jinghan Yao <[email protected]>
Co-authored-by: Jinghan Yao <[email protected]>
* [tools]GQA convert support

* fix readme
Previously, `deepspeed_to_megatron.py` would raise an import error
due to the relative import.

This commit fixes this issue by changing from the relative import
to the absolute import like in `deepspeed_to_transformers.py`.
Copy link

sourcery-ai bot commented Dec 25, 2024

Reviewer's Guide by Sourcery

This pull request integrates the microsoft-main-fpdt branch from the argonne-lcf repository, introducing significant performance enhancements through Flash Attention and FPDT (Fully Pipelined DeepSpeed Transformer) optimizations. Key changes include refactoring the QKV projection, MLP layers, and attention mechanisms to leverage these DeepSpeed features. Additionally, the code incorporates logging and device context management for improved debugging and portability.

Sequence diagram for FPDT attention forward pass

sequenceDiagram
    participant Input
    participant FPDT_Attention
    participant Flash_Attention
    participant Memory

    Input->>FPDT_Attention: hidden_states
    activate FPDT_Attention

    FPDT_Attention->>FPDT_Attention: Split into chunks
    loop For each chunk
        FPDT_Attention->>Flash_Attention: Process chunk
        alt Memory offloading enabled
            Flash_Attention->>Memory: Offload intermediate results
            Memory->>Flash_Attention: Load when needed
        end
        Flash_Attention-->>FPDT_Attention: Chunk results
    end

    FPDT_Attention->>FPDT_Attention: Merge chunk results
    FPDT_Attention-->>Input: output, attention_bias
    deactivate FPDT_Attention
Loading

Class diagram for updated transformer components

classDiagram
    class ParallelTransformerLayer {
        -input_layernorm
        -self_attention
        -post_attention_layernorm
        -mlp
        +forward()
    }

    class FPDT_Attention {
        -qkv_linear_weight
        -qkv_linear_bias
        -qkv_dense_weight
        -qkv_dense_bias
        -chunk_size
        -enable_offloading
        +forward()
    }

    class FPDT_FFN {
        -dense_h_to_4h
        -dense_4h_to_h
        -fpdt_FFN_chunk_size
        +forward()
    }

    ParallelTransformerLayer --> FPDT_Attention
    ParallelTransformerLayer --> FPDT_FFN
Loading

File-Level Changes

Change Details Files
Integrated Flash Attention and FPDT (Fully Pipelined DeepSpeed Transformer) optimizations for performance enhancements.
  • Refactored QKV projection in transformer.py to support GQA (Grouped Query Attention).
  • Added FPDT support for MLP layers in transformer.py.
  • Modified attention mechanism in transformer.py to use Flash Attention and FPDT.
  • Updated gpt_model.py to handle FPDT logits loss.
  • Added FPDT input construction in pretrain_gpt.py.
  • Modified initialization in initialize.py to warm up FPDT functions.
  • Added device context to rotary embedding in rotary_pos_embedding.py.
  • Added FPDT arguments in arguments.py.
  • Updated language_model.py to use rotary position embedding with device context and handle FPDT sequence lengths.
  • Added ds_sequence_parallel_fpdt flag to control FPDT usage.
  • Added ds_sequence_parallel_fpdt_chunk_size argument to control chunk size in FPDT attention.
  • Added ds_sequence_parallel_fpdt_offloading flag to enable offloading in FPDT attention.
  • Added logging for rank and log level in transformer.py.
  • Updated finetune_llama.sh to support conversion between Hugging Face and Megatron-Deepspeed formats.
  • Updated hf2megads_weight_converter.py to handle QKV refactoring for GQA.
  • Updated finetune_llama.sh to use an empty ds_config during conversion.
  • Added a new shell script ds_pretrain_gpt_6.7B_fpdt_32k.sh for pretraining with FPDT.
  • Added example data and vocabulary files for testing.
  • Updated documentation in README.md to reflect the changes for FPDT and conversion scripts.
megatron/model/transformer.py
megatron/model/gpt_model.py
pretrain_gpt.py
megatron/initialize.py
megatron/model/rotary_pos_embedding.py
megatron/arguments.py
megatron/model/language_model.py
Refactored weight conversion scripts to handle GQA (Grouped Query Attention).
  • Modified _qkv_refactor and _qkv_refactor_to_hf functions in hf2megads_weight_converter.py to handle the updated QKV projection format for GQA.
  • Removed the use_gqa flag as GQA is now handled directly by the refactoring functions.
tools/hf2megads_weight_converter.py
Updated example scripts and documentation.
  • Added new arguments to finetune_llama.sh for controlling FPDT and conversion processes.
  • Updated README.md with instructions for converting weights and fine-tuning with FPDT.
  • Added an empty ds_config.json file for use during conversion.
examples_deepspeed/finetune_hf_llama/finetune_llama.sh
examples_deepspeed/finetune_hf_llama/README.md
examples_deepspeed/finetune_hf_llama/ds_config.json

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time. You can also use
    this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants