Official code for the paper "Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark ".
Authors (* Equal Contribution): Yihua Zhang*, Pingzhi Li*, Junyuan Hong*, Jiaxiang Li*, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen, Jason D. Lee, Wotao Yin, Mingyi Hong, Zhangyang Wang, Sijia Liu, and Tianlong Chen
This repo contains the source code and reproducing guide of ZO-LLM. This research endeavor is designed to help researchers better understand the capabilities, limitations and principles associated with the BP-free, zeroth-order (ZO) optimization as a solution for reducing memory costs during Large Language Model (LLM) fine-tuning. Our study unveils previously overlooked optimization principles, highlighting the importance of task alignment, the role of the forward gradient method, and the balance between algorithm complexity and fine-tuning performance.
This project is organized around the following scopes, including:
- Five LLM families: Roberta, OPT, LLaMA, Vicuna, and Mistral.
- Three task complexities: binary classification, question-answering, and commonsense reasoning.
- Four fine-tuning schemes: full fine-tuning, LoRA, prefix tuning, and prompt tuning.
- Six BP-free optimization methods: ZO-SGD, ZO-SGD-Sign, ZO-SGD-MMT, ZO-SGD-Cons, ZO-Adam, and forward gradient.
- Three novel enhancements to ZO optimization: block-wise descent, hybrid training, and gradient sparsity.
This project is structured around the hyperparameter sweeping for various tasks & models & tuning schemes & optimization
methods. All optimization methods are implemented in zo-bench/trainer.py
. Task configurations are defined in
zo-bench/tasks.py
and zo-bench/templates.py
. The main entry point is zo-bench/run.py
.
.
├── zo-bench
│ ├── modeling_mistral
│ │ ├─── __init__.py
│ │ ├── configuration_mistral.py
│ │ ├── modleing_mistral.py
│ ├── modeling_llama.py
│ ├── modeling_opt.py
│ ├── modeling_roberta.py
│ ├── prefix_tuning.py
│ ├── prompt_tuning.py
│ ├── run.py
│ ├── tasks.py
│ ├── templates.py
│ ├── test_fake_text_memory.py
│ ├── trainer.py
│ ├── utils.py
│ ├── sweep
│ │ ├── Copa_llama-7b
│ │ │ ├── adam
│ │ │ │ ├── adam_copa_ft.yml
│ │ │ │ ├── adam_copa_lora.yml
│ │ │ │ ├── adam_copa_prefix.yml
│ │ │ │ ├── adam_copa_prompt.yml
│ │ │ ├── forward_grad
│ │ │ │ ├── forward_grad_copa_ft.yml
│ │ │ │ ├── forward_grad_copa_lora.yml
│ │ │ │ ├── forward_grad_copa_prefix.yml
│ │ │ │ ├── forward_grad_copa_prompt.yml
│ │ │ ├── sgd
│ │ │ │ ├── sgd_copa_ft.yml
│ │ │ │ ├── sgd_copa_lora.yml
│ │ │ │ ├── sgd_copa_prefix.yml
│ │ │ │ ├── sgd_copa_prompt.yml
│ │ │ ├── sign_sgd
│ │ │ │ ├── sign_sgd_copa_ft.yml
│ │ │ │ ├── sign_sgd_copa_lora.yml
│ │ │ │ ├── sign_sgd_copa_prefix.yml
│ │ │ │ ├── sign_sgd_copa_prompt.yml
│ │ │ ├── zo_adam
│ │ │ │ ├── zo_adam_copa_ft.yml
│ │ │ │ ├── zo_adam_copa_lora.yml
│ │ │ │ ├── zo_adam_copa_prefix.yml
│ │ │ │ ├── zo_adam_copa_prompt.yml
│ │ │ ├── zo_sgd
│ │ │ │ ├── zo_sgd_copa_ft.yml
│ │ │ │ ├── zo_sgd_copa_lora.yml
│ │ │ │ ├── zo_sgd_copa_prefix.yml
│ │ │ │ ├── zo_sgd_copa_prompt.yml
│ │ │ ├── zo_sgd_conserv
│ │ │ │ ├── zo_sgd_conserv_copa_ft.yml
│ │ │ │ ├── zo_sgd_conserv_copa_lora.yml
│ │ │ │ ├── zo_sgd_conserv_copa_prefix.yml
│ │ │ │ ├── zo_sgd_conserv_copa_prompt.yml
│ │ │ ├── zo_sgd_momen
│ │ │ │ ├── zo_sgd_momen_copa_ft.yml
│ │ │ │ ├── zo_sgd_momen_copa_lora.yml
│ │ │ │ ├── zo_sgd_momen_copa_prefix.yml
│ │ │ │ ├── zo_sgd_momen_copa_prompt.yml
│ │ ├── Copa_llama-13b
│ │ │ ├── ...
│ │ ├── Copa_mistral
│ │ │ ├── ...
│ │ ├── Copa_opt-13b
│ │ │ ├── ...
│ │ ├── Copa_vicuna
│ │ │ ├── ...
│ │ ├── SST2_opt-1.3b
│ │ │ ├── ...
│ │ ├── WinoGrande_llama-7b
│ │ │ ├── ...
│ │ ├── WinoGrande_llama-13b
│ │ │ ├── ...
│ │ ├── WinoGrande_mistral
│ │ │ ├── ...
│ │ ├── WinoGrande_opt-13b
│ │ │ ├── ...
│ │ ├── WinoGrande_vicuna
│ │ │ ├── ...
├── environment.yml
All you need is:
conda create -n zollm python=3.10
conda activate zollm
pip install -r requirements.txt
We provide detailed hyperparameter settings in sweeps,
where the sweep configuration for tuning a MODEL on TASK under SCHEME with OPTIMIZER is organized as zo-bench/sweeps/TASK_MODEL/OPTIMIZER/SCHEME.yml
.
An example use of sweep for full fine-tuning LLaMA-7B with ZO-SGD on the COPA task is as follows:
~> wandb sweep zo-bench/sweeps/Copa_llama-7b/zo_sgd/zo_sgd_copa_ft.yml
wandb: Creating sweep from: zo-bench/sweeps/Copa_llama-7b/zo_sgd/zo_sgd_copa_ft.yml
wandb: Created sweep with ID: <ID>
wandb: View sweep at: https://wandb.ai/<unique ID>
wandb: Run sweep agent with: wandb agent <unique ID>
~> wandb agent <unique ID>
For the Extended Study, please check the following code:
- Block-wise ZO: Add the argument
--module_wise_perturbation=True
to the command line. Note that temporarily we only support OPT family models. For example:
python run.py --model_name=facebook/opt-1.3b --task_name=SST2 --output_dir=result/SST2-ft-$TAG --num_train_epochs=5 \
--per_device_train_batch_size=16 --load_best_model_at_end --evaluation_strategy=steps --save_strategy=steps \
--save_total_limit=1 --eval_steps=1000 --max_steps=20000 --logging_steps=10 --num_eval=1000 --num_train=1000 \
--num_dev=500 --train_as_classification --perturbation_mode=two_side --trainer=zo_sgd --train_set_seed=0 \
--lr_scheduler_type=constant --save_steps=1000 --load_float16 --learning_rate=1e-8 --zo_eps=0.001 --momentum=0.9
--weight_decay=0 --module_wise_perturbation=True
- Gradient Pruning: The corresponding arguments are
sparse_gradient_group
,gradient_sparsity
andsparse_gradient_resample_steps
. The options of them can be found in the correpsonding comments inrun.py
. An example sweep configuration is inzo-bench/sweeps/SST2_opt-1.3b/zo_sgd_sparse_grad/zo_sgd_sparse_grad_cls_ft.yml
.
@misc{zhang2024revisiting,
title={Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark},
author={Yihua Zhang and Pingzhi Li and Junyuan Hong and Jiaxiang Li and Yimeng Zhang and Wenqing Zheng and Pin-Yu Chen and Jason D. Lee and Wotao Yin and Mingyi Hong and Zhangyang Wang and Sijia Liu and Tianlong Chen},
year={2024},
eprint={2402.11592},
archivePrefix={arXiv},
primaryClass={cs.LG}
}