Efficient 2:4 Sparse Pre-training Examples

This repository contains code of the following papers. For the latest version of our toolkit, please install from https://github.com/huyz2023/2by4-pretrain.

Accelerating Transformer Pre-training with 2:4 Sparsity [arXiv] [OpenReview] [PDF]

Yuezhou Hu, Kang Zhao, Weiyu Huang, Jianfei Chen, Jun Zhu

International Conference on Machine Learning (ICML), 2024

S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training [arXiv] [OpenReview]

Yuezhou Hu, Jun Zhu, Jianfei Chen

Neural Information Processing Systems (NeurIPS), 2024

Installation

From source:

git clone --recursive https://github.com/huyz2023/2by4-pretrain
cd 2by4-pretrain
pip install -e .

Please refer to https://github.com/huyz2023/2by4-pretrain for more details about our toolkit.

Training

The different folders include different methods:

original: Baselines we use in our papers (same baselines for both papers).
v1: Transposable SR-STE + dense fine-tuning (from paper "Accelerating Transformer Pre-training with 2:4 Sparsity")
v2: S-STE + FP8 (simulation only, from paper "S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training")

By default, minimum-variance unbiased estimator (MVUE) is applied to backward pass to calculate linear.weight.grad.

nanoGPT

We recommend you to run nanoGPT scripts since it applies the least modification to the original code and is easy to read. Our code is a copy from the original https://github.com/karpathy/nanoGPT.

Common instructions:

To replicate pre-training, enter */nanoGPT folder and run:

torchrun --standalone --nproc_per_node=8 train.py config/train_gpt2.py

For evaluation (GLUE and SQuAD), enter */nanoGPT/finetune_* folder and run:

sh run.sh

We only provide scripts to replicate GPT-2 124M. To replicate other sizes in paper, follow the instructions below.

Model arguments: Modify(n_layer, n_head, n_embd) in */nanoGPT/train.py. Specifically, (12, 12, 768) for 124M, (24, 16, 1024) for GPT-2 350M, (36, 20, 1280) for GPT-2 774M, and (48, 25, 1600) for GPT-2 1558M.
(v1 only) Masked decay factor: Modify masked decay factor alpha in v1/nanoGPT/train.py. For GPT-2 124M and 774M, this should be 6e-5, for GPT-2 350M and GPT-2 1558M, this should be 2e-4.
Evaluation script: Modify hf_model in */nanoGPT/finetune_*/run.sh to match the model size (gpt2, gpt2-medium, gpt2-large, gpt2-xl).

Hyperparameters for pre-training and evaluation:

pre-training	value
learning rate	1.5e-4
minimum learning rate	1e-5
batch size	512
sequence length	1024
max iters	300k
warmup	3k

evaluation	value
learning rate	5e-6
warmup	0.1

What changes are made from the original nanoGPT repository?

Change hyperparameters.
Use float16 and GradScaler for training stability.
Replace nn.Linear from FFN block with FP8SparseLinear or SparseLinearTranspose. The relevant code is in sparse_ops.py.

(v1 only) Masked decay and dense fine-tuning:

 for micro_step in range(gradient_accumulation_steps):
 ...
    # clip the gradient
    if grad_clip != 0.0:
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
    #################### content added ####################
    with torch.no_grad():
        for p in model.parameters():
            if hasattr(p, 'mask') and p.mode == 'sparse':
                p.grad = p.grad.float()
                masked_add_(p.grad.data, p.data, p.mask, alpha=alpha)
                p.cnt = 0
    if iter_num == 250000:
        for p in model.parameters():
            if hasattr(p, 'mask') and p.mode == 'sparse':
                p.mode = 'dense'
    #################### content added ####################
    # step the optimizer and scaler if training in fp16
    scaler.step(optimizer)
    scaler.update()
    # flush the gradients as soon as we can, no need for this memory anymore
    optimizer.zero_grad(set_to_none=True)
...

Citation

If you like our study, please cite:

@inproceedings{
  hu2024accelerating,
  title={Accelerating Transformer Pre-training with 2:4 Sparsity},
  author={Yuezhou Hu and Kang Zhao and Weiyu Huang and Jianfei Chen and Jun Zhu},
  booktitle={Forty-first International Conference on Machine Learning},
  year={2024},
  url={https://openreview.net/forum?id=kTaX87Zn6M}
}
@inproceedings{
  hu2024sste,
  title={S-{STE}: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training},
  author={Yuezhou Hu and Jun Zhu and Jianfei Chen},
  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
  year={2024},
  url={https://openreview.net/forum?id=8abNCVJs2j}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
original/nanoGPT		original/nanoGPT
v1/nanoGPT		v1/nanoGPT
v2/nanoGPT		v2/nanoGPT
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient 2:4 Sparse Pre-training Examples

Installation

Training

nanoGPT

Citation

About

Releases

Packages

Languages

License

thu-ml/2by4-pretrain-acc-examples

Folders and files

Latest commit

History

Repository files navigation

Efficient 2:4 Sparse Pre-training Examples

Installation

Training

nanoGPT

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages