Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ShardFormer] Add Ulysses Sequence Parallelism support for Command-R, Qwen2 and ChatGLM #5854

Closed
wants to merge 38 commits into from

Conversation

GuangyaoZhang
Copy link
Contributor

@GuangyaoZhang GuangyaoZhang commented Jun 25, 2024

📌 Checklist before creating the PR

  • I have created an issue for this PR for traceability
  • The title follows the standard format: [doc/gemini/tensor/...]: A concise description
  • I have added relevant tags if possible for us to better distinguish different PRs
  • I have installed pre-commit: pip install pre-commit && pre-commit install

🚨 Issue number

resolved #5853

📝 What does this PR do?

Add Ulysses Sequence Parallelism support for Command-R, Qwen2 and ChatGLM

💥 Checklist before requesting a review

  • I have linked my PR to an issue (instruction)
  • My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
  • I have performed a self-review of my code
  • I have added thorough tests.
  • I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

  • 🌝 Yes, I do.
  • 🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

@GuangyaoZhang GuangyaoZhang requested a review from a team as a code owner June 25, 2024 09:13
GuangyaoZhang and others added 3 commits June 25, 2024 17:17
* update to fully overlap, still debugging

* improve interface

* fixed deadlock bug

* debug NaN loss

* (experimental) use one comm group for send_fw_recv_fw to fix NaN

* cleaned up interfaces; use one batch p2p for all

* clean up; removed the double p2p batch case

* p2p test passsed

* improve overlap: send fwd before backward

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* tentatively use 2 p2p batches

* remove two p2p batches

* fix typos

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove pp.sh

---------

Co-authored-by: Edenzzzz <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: root <root@notebook-c55824c0-7742-45e8-9591-c855bb77ad29-0.notebook-c55824c0-7742-45e8-9591-c855bb77ad29.colossal-ai.svc.cluster.local>
* [gemini] fix missing return

* [gemini] fix missing arg pass

* [gemini] use gather tensor instead of list

* [test] enable flash attention for benchmark by default

* [test] enable flash attention for benchmark by default

---------

Co-authored-by: genghaozhe <[email protected]>
@GuangyaoZhang GuangyaoZhang changed the title Add Ulysses Sequence Parallelism support for Command-R, Qwen2 and ChatGLM [ShardFormer] Add Ulysses Sequence Parallelism support for Command-R, Qwen2 and ChatGLM Jun 26, 2024
ver217 and others added 9 commits June 27, 2024 16:34
* [zero] use bucket during allgather

* [zero] rename api
)

* t5 token, still pytest fail

* Resolve T5 Pytest Failure

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix typos

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* lazy init support

* lazy init llama support

* :lazy init support for baichuan

* aligh rpc

* add note for baichuan

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* delete xformers

* fix

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* [moe] removed openmoe-coupled code and rectify mixstral code (hpcaitech#5471)

* [Feauture] MoE refractor; Intergration with Mixtral  (hpcaitech#5682)

* cherry pick from refractor-moe branch

* tests passed

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* support ep + zero

---------

Co-authored-by: Edenzzzz <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* add mixtral auto policy & move pipeline forward code to modeling folder

* [moe refactor] modify kernel test without Route Class

* [moe refactor] add moe tensor test path environment variable to github workflow

* fix typos

* fix moe test bug due to the code rebase

* [moe refactor] fix moe zero test, and little bug in low level zero

* fix typo

* add moe tensor path to github workflow

* remove some useless code

* fix typo & unify global variable XX_AXIS logic without using -1

* fix typo & prettifier the code

* remove print code & support zero 2 test

* remove useless code

* reanme function

* fix typo

* fix typo

* Further improve the test code

* remove print code

* [moe refactor] change test model from fake moe model to mixtral moe layer and remove useless test

* [moe refactor] skip some unit test which will be refactored later

* [moe refactor] fix unit import error

* [moe refactor] fix circular import issues

* [moe refactor] remove debug code

* [moe refactor] update github workflow

* [moe/zero] refactor low level optimizer (hpcaitech#5767)

* [zero] refactor low level optimizer

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [Feature] MoE refactor with newest version of ZeRO (hpcaitech#5801)

* [zero] remove redundant members in BucketStore (hpcaitech#5802)

* [zero] align api with previous version

* [Moe/Zero] Update MoeHybridParallelPlugin with refactored ZeRO and Fix Zero bug (hpcaitech#5819)

* [moe refactor] update unit test with the refactored ZeRO and remove useless test

* move moe checkpoint to checkpoint folder and exchange global axis to class member

* update moe hybrid parallel plugin with newest version of zero & fix zero working/master params bug

* fix zero unit test

* Add an assertion to prevent users from using it incorrectly

* [hotfix]Solve the compatibility issue of zero refactor (hpcaitech#5823)

* [moe refactor] update unit test with the refactored ZeRO and remove useless test

* move moe checkpoint to checkpoint folder and exchange global axis to class member

* update moe hybrid parallel plugin with newest version of zero & fix zero working/master params bug

* fix zero unit test

* Add an assertion to prevent users from using it incorrectly

* Modify function parameter names to resolve compatibility issues

* [zero] fix missing hook removal (hpcaitech#5824)

* [MoE] Resolve .github conflict (hpcaitech#5829)

* [Fix/Example] Fix Llama Inference Loading Data Type (hpcaitech#5763)

* [fix/example] fix llama inference loading dtype

* revise loading dtype of benchmark llama3

* [release] update version (hpcaitech#5752)

* [release] update version

* [devops] update compatibility test

* [devops] update compatibility test

* [devops] update compatibility test

* [devops] update compatibility test

* [test] fix ddp plugin test

* [test] fix gptj and rpc test

* [devops] fix cuda ext compatibility

* [inference] fix flash decoding test

* [inference] fix flash decoding test

* fix (hpcaitech#5765)

* [test] Fix/fix testcase (hpcaitech#5770)

* [fix] branch for fix testcase;

* [fix] fix test_analyzer & test_auto_parallel;

* [fix] remove local change about moe;

* [fix] rm local change moe;

* [Hotfix] Add missing init file in inference.executor (hpcaitech#5774)

* [CI/tests] simplify some test case to reduce testing time (hpcaitech#5755)

* [ci/tests] simplify some test case to reduce testing time

* [ci/tests] continue to remove test case to reduce ci time cost

* restore some test config

* [ci/tests] continue to reduce ci time cost

* [misc] update dockerfile (hpcaitech#5776)

* [misc] update dockerfile

* [misc] update dockerfile

* [devops] fix docker ci (hpcaitech#5780)

* [Inference]Add Streaming LLM (hpcaitech#5745)

* Add Streaming LLM

* add some parameters to llama_generation.py

* verify streamingllm config

* add test_streamingllm.py

* modified according to the opinions of review

* add Citation

* change _block_tables tolist

* [hotfix] fix llama flash attention forward (hpcaitech#5777)

* [misc] Accelerate CI for zero and dist optim (hpcaitech#5758)

* remove fp16 from lamb

* remove d2h copy in checking states

---------

Co-authored-by: Edenzzzz <[email protected]>

* [Test/CI] remove test cases to reduce CI duration (hpcaitech#5753)

* [test] smaller gpt2 test case

* [test] reduce test cases: tests/test_zero/test_gemini/test_zeroddp_state_dict.py

* [test] reduce test cases: tests/test_zero/test_gemini/test_grad_accum.py

* [test] reduce test cases tests/test_zero/test_gemini/test_optim.py

* Revert "[test] smaller gpt2 test case"

Some tests might depend on the size of model (num of chunks)

This reverts commit df705a5.

* [test] reduce test cases: tests/test_checkpoint_io/test_gemini_checkpoint_io.py

* [CI] smaller test model for two mwo the two modifid cases

* [CI] hardcode gpt model for tests/test_zero/test_gemini/test_search.py since we need a fixed answer there

* [hotfix] fix testcase in test_fx/test_tracer (hpcaitech#5779)

* [fix] branch for fix testcase;

* [fix] fix test_analyzer & test_auto_parallel;

* [fix] remove local change about moe;

* [fix] rm local change moe;

* [fix] fix test_deepfm_model & test_dlrf_model;

* [fix] fix test_hf_albert & test_hf_gpt;

* [gemini] optimize reduce scatter d2h copy (hpcaitech#5760)

* [gemini] optimize reduce scatter d2h copy

* [fix] fix missing reduce variable

* [refactor] remove legacy async reduce scatter code

* [gemini] missing sync

* Revert "[refactor] remove legacy async reduce scatter code"

This reverts commit 58ad76d.

* [gemini] further optimize with async all reduce

* [fix] pass flag from manager to chunk

* Allow building cuda extension without a device. (hpcaitech#5535)

Added FORCE_CUDA environment variable support, to enable building extensions where a GPU device is not present but cuda libraries are.

* [misc] fix dist logger (hpcaitech#5782)

* [install]fix setup (hpcaitech#5786)

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [misc] update requirements (hpcaitech#5787)

* [shardformer] fix import (hpcaitech#5788)

* upgrade colossal-chat support tp_group>1, add sp for sft

* upgrade ppo dpo rm script

* run pre-commit

* moupdate ci tests, st ci test cases passed, tp failed in generation for ppo, sp is buggy

* fix training script

* fix ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix transformers version

* remove duplicated test

* fix datasets version

* remove models that require huggingface auth from ci

* remove local data path

* update ci

* remove baichuan from template test due to transformer version conflict

* merge

* Refactor modeling by adding attention backend

Signed-off-by: char-1ee <[email protected]>

* Fix tests and naming

Signed-off-by: char-1ee <[email protected]>

* Pass inference model shard configs for module init

Signed-off-by: char-1ee <[email protected]>

* Clean up

Signed-off-by: char-1ee <[email protected]>

* replace the customized dataloader setup with the build-in one

* replace the customized dataloader setup with the build-in one

* Remove flash attention backend

Signed-off-by: char-1ee <[email protected]>

* fix readme

* Fix test import

Signed-off-by: char-1ee <[email protected]>

* update sft trainning script

* [Inference]refactor baichuan (hpcaitech#5791)

* refactor baichuan

* remove unused code and add TODO for lazyinit

* [test] fix chatglm test kit (hpcaitech#5793)

* [shardformer] fix modeling of bloom and falcon (hpcaitech#5796)

* [test] fix qwen2 pytest distLarge (hpcaitech#5797)

* [Inference] Fix flash-attn import and add model test (hpcaitech#5794)

* Fix torch int32 dtype

Signed-off-by: char-1ee <[email protected]>

* Fix flash-attn import

Signed-off-by: char-1ee <[email protected]>

* Add generalized model test

Signed-off-by: char-1ee <[email protected]>

* Remove exposed path to model

Signed-off-by: char-1ee <[email protected]>

* Add default value for use_flash_attn

Signed-off-by: char-1ee <[email protected]>

* Rename model test

Signed-off-by: char-1ee <[email protected]>

---------

Signed-off-by: char-1ee <[email protected]>

* [Gemini] Use async stream to prefetch and h2d data moving (hpcaitech#5781)

* use async stream to prefetch and h2d data moving

* Remove redundant code

* [gemini] quick fix on possible async operation (hpcaitech#5803)

* [gemini] quick fix on possible async operation

* [gemini] quick fix on possible async operation

* [shardformer] upgrade transformers to 4.39.3 (hpcaitech#5815)

* [shardformer]upgrade transformers for gpt2/gptj/whisper (hpcaitech#5807)

* [shardformer] fix modeling of gpt2 and gptj

* [shardformer] fix whisper modeling

* [misc] update requirements

---------

Co-authored-by: ver217 <[email protected]>

* [shardformer]upgrade transformers for mistral (hpcaitech#5808)

* upgrade transformers for mistral

* fix

* fix

* [shardformer]upgrade transformers for llama (hpcaitech#5809)

* update transformers

fix

* fix

* fix

* [inference] upgrade transformers (hpcaitech#5810)

* update transformers

fix

* fix

* fix

* fix

* fix

* [gemini] update transformers for gemini (hpcaitech#5814)

---------

Co-authored-by: ver217 <[email protected]>

* Support 4d parallel + flash attention (hpcaitech#5789)

* support tp + sp + pp

* remove comments

---------

Co-authored-by: Edenzzzz <[email protected]>

---------

Signed-off-by: char-1ee <[email protected]>
Co-authored-by: Yuanheng Zhao <[email protected]>
Co-authored-by: Hongxin Liu <[email protected]>
Co-authored-by: flybird11111 <[email protected]>
Co-authored-by: duanjunwen <[email protected]>
Co-authored-by: yuehuayingxueluo <[email protected]>
Co-authored-by: Edenzzzz <[email protected]>
Co-authored-by: Edenzzzz <[email protected]>
Co-authored-by: botbw <[email protected]>
Co-authored-by: Charles Coulombe <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: YeAnbang <[email protected]>
Co-authored-by: char-1ee <[email protected]>
Co-authored-by: Runyu Lu <[email protected]>
Co-authored-by: YeAnbang <[email protected]>
Co-authored-by: Guangyao Zhang <[email protected]>

* [zero] fix hook bug

* [zero] add low level optimizer back (hpcaitech#5839)

* [zero] fix param & refactor

* [zero] add back original low level opt

* [zero] remove moe related

* [zero] pass zero tests

* [zero] refactor

* [chore] add del func back

* [zero] comments and naming (hpcaitech#5840)

* [zero] modify api (hpcaitech#5843)

* [zero] modify api

* [test] remove _grad_store access in tests

* [test] fix (hpcaitech#5857)

* [CI] skip openmoe CI check

* [CI] fox pre-commit

* [zero] remove redundant memebr init (hpcaitech#5862)

* [misc] remove useless code, modify the pg mesh implementation

* [misc] remove useless code, modify the pg mesh implementation

* [misc] use tempfile

* resolve conflict with main branch

* [misc] use tempfile in test_moe_checkpoint.py

* [misc] remove useless code, add assertion about sequence parallel, move logger into function

* [misc] remove useless code

---------

Signed-off-by: char-1ee <[email protected]>
Co-authored-by: Frank Lee <[email protected]>
Co-authored-by: Edenzzzz <[email protected]>
Co-authored-by: Edenzzzz <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: botbw <[email protected]>
Co-authored-by: Yuanheng Zhao <[email protected]>
Co-authored-by: Hongxin Liu <[email protected]>
Co-authored-by: flybird11111 <[email protected]>
Co-authored-by: duanjunwen <[email protected]>
Co-authored-by: yuehuayingxueluo <[email protected]>
Co-authored-by: Charles Coulombe <[email protected]>
Co-authored-by: YeAnbang <[email protected]>
Co-authored-by: char-1ee <[email protected]>
Co-authored-by: Runyu Lu <[email protected]>
Co-authored-by: YeAnbang <[email protected]>
Co-authored-by: Guangyao Zhang <[email protected]>
Edenzzzz and others added 12 commits July 1, 2024 17:07
* [pre-commit.ci] pre-commit autoupdate

updates:
- [github.com/PyCQA/autoflake: v2.2.1 → v2.3.1](PyCQA/autoflake@v2.2.1...v2.3.1)
- [github.com/pycqa/isort: 5.12.0 → 5.13.2](PyCQA/isort@5.12.0...5.13.2)
- [github.com/psf/black-pre-commit-mirror: 23.9.1 → 24.4.2](psf/black-pre-commit-mirror@23.9.1...24.4.2)
- [github.com/pre-commit/mirrors-clang-format: v13.0.1 → v18.1.7](pre-commit/mirrors-clang-format@v13.0.1...v18.1.7)
- [github.com/pre-commit/pre-commit-hooks: v4.3.0 → v4.6.0](pre-commit/pre-commit-hooks@v4.3.0...v4.6.0)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* [quant] fix bitsandbytes version check

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
updates:
- [github.com/pre-commit/mirrors-clang-format: v18.1.7 → v18.1.8](pre-commit/mirrors-clang-format@v18.1.7...v18.1.8)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* [Feature] deepseek moe expert parallel implement

* [misc] fix typo, remove redundant file (hpcaitech#5867)

* [misc] fix typo

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [Feature] deepseek support & unit test

* [misc] remove debug code & useless print

* [misc] fix typos (hpcaitech#5872)

* [Feature] remove modeling file, use auto config. (hpcaitech#5884)

* [misc] fix typos

* [Feature] deepseek support via auto model, remove modeling file

* [misc] delete useless file

* [misc] fix typos

* [Deepseek] remove redundant code (hpcaitech#5888)

* [misc] fix typos

* [Feature] deepseek support via auto model, remove modeling file

* [misc] delete useless file

* [misc] fix typos

* [misc] remove redundant code

* [Feature/deepseek] resolve comment. (hpcaitech#5889)

* [misc] fix typos

* [Feature] deepseek support via auto model, remove modeling file

* [misc] delete useless file

* [misc] fix typos

* [misc] remove redundant code

* [misc] mv module replacement into if branch

* [misc] add some warning message and modify some code in unit test

* [misc] fix typos

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…ch#5838)

* Diffusion Model Inference support

* Stable Diffusion 3 Support

* pixartalpha support
@GuangyaoZhang GuangyaoZhang deleted the sp branch July 9, 2024 08:05
@GuangyaoZhang GuangyaoZhang restored the sp branch July 9, 2024 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE]: Add Ulysses Sequence Parallelism support for Command-R, Qwen2 and ChatGLM
9 participants