Add APIs to offload states of model, optimizer, and engine #6011

tohtana · 2024-08-16T18:34:24Z

This PR adds the following APIs to offload model, optimizer, and engine states.

def offload_states(self,
                   include: Container[OffloadStateTypeEnum] = None,
                   device: OffloadDeviceEnum = OffloadDeviceEnum.cpu,
                   pin_memory: bool = True,
                   non_blocking: bool = False) -> None:
    """Move the ZeRO optimizer buffers to the specified device.

    Arguments:
        include: Optional. The set of states to offload. If not provided, all states are offloaded.
        device: Optional. The device to move the ZeRO optimizer buffers to.
        pin_memory: Optional. Whether to pin the memory of the offloaded states.
        non_blocking: Optional. Whether to offload the states asynchronously.
...
def offload_states_back(self, non_blocking: bool = False) -> None:

Here is the typical usage.

# Offload after forward, backward, and step
model.offload_states()
# Do something requiring a lot of device memory
...
# Load states back to device memory
model.offload_states_back()

You can selectively offload states to balance the offloading overhead and memory saving.

model.offload_states(include=set([OffloadStateTypeEnum.hp_params, OffloadStateTypeEnum.opt_states], device=OffloadDeviceEnum.cpu)

Performance (4.3B parameters / 4x A100)

Environment (4x A100, benchmark script)
- Average Device to Host transfer time: 2.45 GB/s, aggregated: 9.79 GB/s
- Average Host to Device transfer: 11.05 GB/s, aggregated: 44.19 GB/s
Mem (allocated by PyTorch)
- Before offload 18.2GB
- After offloading 17.7MB
Time (benchmark script, offloading time/loading time)

python output_table.py

	pin_memory=0 non_blocking=0	pin_memory=0 non_blocking=1	pin_memory=1 non_blocking=0	pin_memory=1 non_blocking=1
1	4.34 / 3.42	4.99 / 2.37	6.5 / 2.42	6.0 / 2.39
2	9.9 / 3.28	5.1 / 2.34	6.21 / 2.42	6.25 / 2.45
3	9.92 / 3.19	6.71 / 2.35	6.33 / 2.38	5.93 / 2.42
4	9.55 / 2.82	7.11 / 2.39	6.9 / 2.38	6.5 / 2.43
5	4.4 / 3.35	6.04 / 2.41	6.26 / 2.41	6.32 / 2.47
6	4.4 / 3.57	6.58 / 2.42	6.88 / 2.4	6.35 / 2.43
7	9.51 / 3.12	6.9 / 2.39	6.9 / 2.39	6.46 / 2.4
8	4.77 / 3.64	6.69 / 2.39	7.39 / 2.42	6.56 / 2.46
9	9.5 / 3.07	7.18 / 2.42	6.67 / 2.39	7.38 / 2.46

TODO:

Enable offloading to a NVMe storage -> NVMe support is non-trivial. I suggest adding the support in another PR
[DONE] Discard buffer (and recreate it) instead of offloading. We don't need to restore the contiguous buffer for reduce.
[DONE] Check pin_memory improves performance or not

deepspeed/runtime/engine.py

deepspeed/runtime/zero/offload_config.py

deepspeed/runtime/zero/stage3.py

deepspeed/runtime/utils.py

tests/unit/runtime/zero/test_offload_states.py

…eepSpeed into tohtana/offload_zero_buffers

kfertakis · 2024-09-12T16:26:53Z

Hi @tohtana ,

Thank you for your work. I've been trying the new APIs to test model offloading in a multi-model deployment (e.g., deepspeed-chat) as part of #5620 . Although the API works in offloading a model and reducing GPU memory initially, after bringing the model back and completing the first training iteration (i.e., optimiser states have been updated), I get a RuntimeError: param {} still in flight exception when trying to offload the model again. I thus wanted to ask whether you think this has something to do with a misuse of the API from my end or if you could provide some further context. I'm providing the relevant stack trace below: Thank you again.

[rank0]: Traceback (most recent call last):
[rank0]:   File "training_script.py", line 173, in gen_function
[rank0]:     self.model_engine.offload_states()
[rank0]:   File "/home/user/DeepSpeed/deepspeed/runtime/engine.py", line 3710, in offload_states
[rank0]:     self.optimizer.offload_states(include=include, device=device, pin_memory=pin_memory, non_blocking=non_blocking)
[rank0]:   File "/home/user/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 2794, in offload_states
[rank0]:     self.empty_partition_cache()
[rank0]:   File "/home/user/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 2785, in empty_partition_cache
[rank0]:     self.parameter_offload.empty_partition_cache()
[rank0]:   File "/home/user/DeepSpeed/deepspeed/runtime/zero/parameter_offload.py", line 181, in empty_partition_cache
[rank0]:     self.partition_all_parameters()
[rank0]:   File "/home/user/DeepSpeed/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/home/user/DeepSpeed/deepspeed/runtime/zero/parameter_offload.py", line 159, in partition_all_parameters
[rank0]:     self.get_param_coordinator(training=self.module.training).release_and_reset_all(self.module)
[rank0]:   File "/home/user/DeepSpeed/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/home/user/venv/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/user/DeepSpeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 412, in release_and_reset_all
[rank0]:     raise RuntimeError(f"param {param.ds_summary()} still in flight")
[rank0]: RuntimeError: param {'id': 1, 'status': 'INFLIGHT', 'numel': 4198400, 'ds_numel': 4198400, 'shape': (2050, 2048), 'ds_shape': (2050, 2048), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([4198400])} still in flight

tohtana · 2024-09-12T21:29:47Z

Thank you for reporting, @kfertakis!

I have an example script showing the usage of the APIs. Can you try this?
I suspect that ZeRO3 fails to clean the partitioning status for some models. I would like to clarify that your issue is model specific or not.

kfertakis · 2024-09-13T11:19:36Z

So I tested the issue again with various models and it seems the problem is model-size related as it does not seem to occur for smaller models (i.e., <= 1B params, e.g., gpt2, gpt2-medium) and it does for bigger ones(i.e., OPT-1.3B, mistral-7B). Is there anything I could do to investigate it further and debug it? By the way, I should mention that I'm testing this in a single node, single GPU configuration (i.e., single worker) thus ZeRO3 partitioning should not have to partition data across other workers. I will also test the benchmark you referenced with an artificially larger model size setting.

Thanks again.

tohtana · 2024-09-17T06:57:09Z

Hi @kfertakis, I tried this example with a 4B model but it worked. Can you try this on your environment?
It would be also great if you could offer us a simple repo.

tjruwase · 2024-09-18T10:13:46Z

in flight exception when trying to offload the model again. I thus wanted to ask whether you think this has something to do with a misuse of the API from my end or if you could provide some further context. I'm providing the relevant stack trace below: Thank you again.

@tohtana, I wonder if it is useful to expose validate_device() functionality as a deepspeed utility, so that clients can check/confirm the offload status at arbitrary points in their code?

DeepSpeed/tests/unit/runtime/zero/test_offload_states.py

Line 20 in 8f81634

def validate_device(model, device: torch.device, include) -> None:

Similar to how see_memory_usage enables inspection of HBM/DRAM usage at any point, we could provide mechanisms for offload status. Perhaps we need something like see_offload_status that displays the mapping of params, grads, and optimizer to {HBM, DRAM, NVMe}.

@kfertakis, would love to get your thoughts as well on whether any of the above would be useful? Thanks!

kfertakis · 2024-09-18T13:36:51Z

Hey, thanks for the comments.

@tohtana, I've tried the example you provided and it does seem to work so I'm sharing a fork of the DeepSpeed-Examples repo to showcase the problem. I've modified the DeepSpeed-Chat code to use offload_states. After you prepare an environment with the right deepspeed version for the new API and also install DeepSpeed-Chat, you can then run the following:

deepspeed --num_gpus=1 ./applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py --actor_model_name_or_path facebook/opt-1.3b --critic_model_name_or_path facebook/opt-1.3b --actor_zero_stage 3 --critic_zero_stage 3 --num_padding_at_beginning 1 --data_path Dahoas/rm-static --per_device_generation_batch_size 2 --per_device_training_batch_size 2 --generation_batches 1 --ppo_epochs 1 --max_answer_seq_len 512 --max_prompt_seq_len 512 --gradient_accumulation_steps 1 --actor_dropout 0.0 --deepspeed --dtype bf16 --enable_hybrid_engine --offload_test

this should lead to the RuntimeError: param {} still in flight that I mentioned. Any thoughts on this would be much appreciated.

@tjruwase thanks for the reference. Current problem aside, I can see how the helper functions can be useful in the future for ensuring consistency. thanks.

tohtana · 2024-09-21T01:09:21Z

Hi @kfertakis, thank you for sharing the repro. It seems that the actual issue is related to ZeRO3's prefetching.

I opened #6557 as a workaround to address this issue. Can you try the branch tohtana/clean_up_prefetch_param? It also includes the offloading APIs. You can just switch to it.

kfertakis · 2024-09-24T16:44:08Z

Hi @tohtana,

thank you for your work. I tried your branch and the issue seems to be fixed. I will continue testing and raise any new issues but for now, the offload_states API seems to be working as expected. Many thanks.

kfertakis · 2024-09-25T10:55:50Z

I also wanted to ask if the offloading functionality could be extended to support DeepSpeedCPUAdam optimiser, besides FusedAdam, in the future for offloading a model with an optimizer which is already offloaded to the CPU? Thank you

tohtana · 2024-09-27T05:37:22Z

I wonder if it is useful to expose validate_device() functionality as a deepspeed utility, so that clients can check/confirm the offload status at arbitrary points in their code?

@tjruwase Let me address this by another PR after this one is merged.

tohtana · 2024-09-27T05:41:41Z

Thank you @kfertakis for validating the fix.

I also wanted to ask if the offloading functionality could be extended to support DeepSpeedCPUAdam optimiser, besides FusedAdam, in the future for offloading a model with an optimizer which is already offloaded to the CPU? Thank you

Let me consider how to do this. Please feel free to open a new issue to track it as I am going to merge this PR first.

kfertakis · 2024-10-01T14:25:42Z

Thank you @tohtana for completing and merging the feature. I've opened two additional requests #6595 , #6596 to track the relevant extensions we discussed above. thanks.

Parameters prefetched by ZeRO3 are sometimes not used. This occurs when the actual sub-module execution differs from previous tracing. As a result, the state of the allgather handle for such a parameter remains `INFLIGHT`, causing functions like `empty_partition_cache` to detect it and throw an error. This PR resolves the issue by ensuring that communication finishes and the parameters are freed. As this issue was mentioned in #6011, this includes the change of the branch. We need to merge #6011 first. --------- Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>

This PR adds an API `deepspeed.runtime.zero.offload_states get_state_devices`, which gets devices of offload states as suggested in this [comment](#6011 (comment)). We could lift this up to `deepspeed.utils` but would need to resolve a circular import: User code -> `deepspeed.utils` -> `deepspeed.utils.offload_states` -> `deepspeed.runtime.zero` -> `deepspeed.runtime.zero.partition_parameters` -> `deepspeed.utils` This will require a significant refactoring as long as we have `OffloadStateTypeEnum` in `deepspeed.runtime.zero`. --------- Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>

add apis to offload states of model, optimizer, and engine

5825104

tohtana requested review from tjruwase and loadams as code owners August 16, 2024 18:34

tohtana and others added 13 commits August 16, 2024 18:44

update api doc

600c822

Merge branch 'master' into tohtana/offload_zero_buffers

153a482

reduce global reference to buffer

126d9b7

loosen type hint

05df37c

Merge branch 'master' into tohtana/offload_zero_buffers

837c06c

add option for pin_memory and non blocking copy

3f8179d

fix offloading of lp grad

37ffa02

add verification in test

93c5a90

improve offloading of lp params

512e9c9

Merge branch 'master' into tohtana/offload_zero_buffers

de2a894

fix pinning

c749b05

Merge branch 'master' into tohtana/offload_zero_buffers

36d6e10

resolve conflict

af95a37

tohtana assigned tjruwase Sep 3, 2024