Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] DeepSpeed Ulysses zero3 compatibility #6582

Open
Xirid opened this issue Sep 27, 2024 · 1 comment
Open

[BUG] DeepSpeed Ulysses zero3 compatibility #6582

Xirid opened this issue Sep 27, 2024 · 1 comment
Labels
bug Something isn't working training

Comments

@Xirid
Copy link

Xirid commented Sep 27, 2024

Describe the bug
Training a hf model (llama 3.1 with peft) on long context with sequence_parallel_size > 1 works only up until zero stage 2.
If I set "stage" to 3 I get the following error:

[rank1]:   File "/root/miniconda3/envs/finetuning/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1464, in partition_grads
[rank1]:     grad_buffer = self.__param_id_to_grad_partition[param.ds_id].narrow(0, 0, grad_partition.numel())
[rank1]: RuntimeError: start (0) + length (8388608) exceeds dimension size (4194304).

I also had to disable this assertion when switching over from zero 1 to 3:

assert train_batch == micro_batch * grad_acc * self.world_size

So maybe there is an issue with the world_size definition when running zero3 (though even when fixing this to the correct world size and device_mesh the same error occurs)?

To Reproduce
Running the example from:
DeepSpeedExamples/post_training/sequence_parallelism/test_ulysses.py
with:

 "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": True
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        },
        "overlap_comm": True,
        "contiguous_gradients": True,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": True
    },

on the hf pr: huggingface/transformers#32305

Expected behavior
ZeRo-3 should work as stated in the official blog post.

ds_report output

DeepSpeed general environment info:
torch install path ............... ['/root/miniconda3/envs/finetuning/lib/python3.10/site-packages/torch']
torch version .................... 2.4.1+cu121
deepspeed install path ........... ['/root/miniconda3/envs/finetuning/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.15.1, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.4, cuda 12.1
shared memory (/dev/shm) size .... 321.31 GB

System info:

  • OS: [e.g. Ubuntu 18.04]
  • GPU count and types [e.g. two machines with x8 A100s each]
  • Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
  • Python version
  • Any other relevant info about your setup

Launcher context
I am using the deepspeed launcher.

Thanks for the help!
Even if this not officially supported I would be thankful for some pointers, so I can implement something on my own.
For context:
We want to train a 70B model on seq length of 60k. 8B already works with Ulysses, but without zero-3 I think 70B is impossible on a single node.

@Xirid Xirid added bug Something isn't working training labels Sep 27, 2024
@samadejacobs
Copy link
Contributor

@Xirid, ZeRO stage 3 is currently not supported in DeepSpeed long context parallelism (Ulyesses). ZeRO3 support is on our roadmap, contributions are welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

2 participants