Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long sequence parallelism (Ulysses) integration with HuggingFace #5774

Merged
merged 30 commits into from
Aug 21, 2024

Conversation

samadejacobs
Copy link
Contributor

@samadejacobs samadejacobs commented Jul 16, 2024

This PR enhances capabilities of DeepSpeed long sequence (context) parallelism (aka DS Ulysses) with support for HuggingFace (and by extension other frameworks) models. With HF integration, users can use sequence parallelism for model pre/mid/post-training, finetuning etc. Usage requires both torch >=2.2.2 and flash-attention. ZeRO-1 and 2 are supported, ZeRO-3 and SPDA support in progress. Corresponding PR in HF is PR32305.

@samadejacobs samadejacobs added the enhancement New feature or request label Jul 16, 2024
@samadejacobs samadejacobs marked this pull request as ready for review July 20, 2024 00:27
deepspeed/comm/comm.py Show resolved Hide resolved
deepspeed/runtime/engine.py Outdated Show resolved Hide resolved
deepspeed/runtime/engine.py Outdated Show resolved Hide resolved
deepspeed/utils/groups.py Outdated Show resolved Hide resolved
deepspeed/utils/groups.py Outdated Show resolved Hide resolved
@samadejacobs samadejacobs enabled auto-merge August 13, 2024 20:16
@loadams loadams disabled auto-merge August 15, 2024 20:35
@loadams loadams enabled auto-merge August 19, 2024 17:52
@loadams loadams added this pull request to the merge queue Aug 21, 2024
Merged via the queue into microsoft:master with commit 8b191d7 Aug 21, 2024
14 checks passed
@glowwormX
Copy link

How to enable this function? Is there any document? I updated deepspeed0.15.1. The transformer manually modified some code according to your pr. During startup, the error message "No sequence parallel group found" is displayed.

@loadams
Copy link
Contributor

loadams commented Sep 26, 2024

@glowwormX - can you please open an issue with your questions? That's more likley to get traction than a comment here.

@Lzhang-hub
Copy link
Contributor

I test uly with
torchrun --nproc_per_node=8 test_ulysses.py
got error:

[rank6]: Traceback (most recent call last):
[rank6]:   File "/data1/nfs15/nfs/zhanglei335/mlsys/train/long-context-train/llm-train/uly_sp_test.py", line 166, in <module>
[rank6]:     get_loss(model, data_loader, DS_CONFIG)
[rank6]:   File "/data1/nfs15/nfs/zhanglei335/mlsys/train/long-context-train/llm-train/uly_sp_test.py", line 112, in get_loss
[rank6]:     model, _, _, _ = deepspeed.initialize(model=model,
[rank6]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/__init__.py", line 193, in initialize
[rank6]:     engine = DeepSpeedEngine(args=args,
[rank6]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 269, in __init__
[rank6]:     self._configure_distributed_model(model)
[rank6]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1188, in _configure_distributed_model
[rank6]:     self.data_parallel_group = groups._get_data_parallel_group()
[rank6]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/groups.py", line 405, in _get_data_parallel_group
[rank6]:     return mesh_device.get_group(mesh_dim="data_parallel")
[rank6]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/device_mesh.py", line 423, in get_group
[rank6]:     _find_pg_by_ranks_and_tag(*self._dim_group_infos[mesh_dim][:2])
[rank6]: IndexError: list index out of range
[rank7]: Traceback (most recent call last):
[rank7]:   File "/data1/nfs15/nfs/zhanglei335/mlsys/train/long-context-train/llm-train/uly_sp_test.py", line 166, in <module>
[rank7]:     get_loss(model, data_loader, DS_CONFIG)
[rank7]:   File "/data1/nfs15/nfs/zhanglei335/mlsys/train/long-context-train/llm-train/uly_sp_test.py", line 112, in get_loss
[rank7]:     model, _, _, _ = deepspeed.initialize(model=model,
[rank7]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/__init__.py", line 193, in initialize
[rank7]:     engine = DeepSpeedEngine(args=args,
[rank7]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 269, in __init__
[rank7]:     self._configure_distributed_model(model)
[rank7]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1188, in _configure_distributed_model
[rank7]:     self.data_parallel_group = groups._get_data_parallel_group()
[rank7]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/groups.py", line 405, in _get_data_parallel_group
[rank7]:     return mesh_device.get_group(mesh_dim="data_parallel")
[rank7]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/device_mesh.py", line 423, in get_group
[rank7]:     _find_pg_by_ranks_and_tag(*self._dim_group_infos[mesh_dim][:2])
[rank7]: IndexError: list index out of range

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants