Some Demos on How to config to offload tensors to nvme device #6752

niebowen666 · 2024-11-15T08:23:57Z

Dear authors:
I am wondering if I could get some demos on config which is used to train a Language Model with ZeRO-Infinity?
It confused me a lot that how to config the "offload_param" and "offload_optimizer"
Thanks!

tjruwase · 2024-11-15T14:25:12Z

@niebowen666, how about

niebowen666 · 2024-11-16T05:35:56Z

Thanks for your reply!
But I suffered some new issues:
I configured a new configure during the training of llama as below:

ds_config = {
            "train_batch_size":32,
            "optimizer": {
                "type": "Adam",
                "params": {
                    "lr": 0.00006,
                    "betas": [0.9, 0.95],
                    "weight_decay": 0.01
                }
            },
            "zero_optimization": {
                "stage": 3,
                "contiguous_gradients": True,
                "stage3_max_live_parameters": 1e9,
                "stage3_max_reuse_distance": 1e9,
                "stage3_prefetch_bucket_size": 1e7,
                "stage3_param_persistence_threshold": 1e5,
                "reduce_bucket_size": 1e7,
                "sub_group_size": 1e9,
                "offload_optimizer": {
                    "device": "cpu"
                },
                "offload_param": {
                    "device": "cpu"
                }
            }
        }

It get the error"out of memory " when I set train_batch_size to 64.
I also read your source code, and there exits some confusions for me：
In runtime/engine.py, line 314 to line 322, It seems that if I configured the"optimizer" as adam, it wouldnot run the _configure_zero_optimizer so that the tensor generated by model will not be offload to CPU.
Is my idea right?

jomayeri · 2024-12-06T22:13:42Z

The ZeRO optimizer will still be configured and offload will occur even if the Adam optimizer is declared in the config. As for the batch size OOM can occur when it becomes too large.

jomayeri self-assigned this Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some Demos on How to config to offload tensors to nvme device #6752

Some Demos on How to config to offload tensors to nvme device #6752

niebowen666 commented Nov 15, 2024

tjruwase commented Nov 15, 2024

niebowen666 commented Nov 16, 2024

jomayeri commented Dec 6, 2024 •

edited

Loading

Some Demos on How to config to offload tensors to nvme device #6752

Some Demos on How to config to offload tensors to nvme device #6752

Comments

niebowen666 commented Nov 15, 2024

tjruwase commented Nov 15, 2024

niebowen666 commented Nov 16, 2024

jomayeri commented Dec 6, 2024 • edited Loading

jomayeri commented Dec 6, 2024 •

edited

Loading