Masked Structural Growth

We grow up language models in pre-training with efficient schedules and function-preserving operators that yields 2x speedup.

MSG paper: https://arxiv.org/abs/2305.02869

Quick Start

The following example shows how to run MSG on public Bert Pre-training data.

Pre-processing

    preprocess_bert_data.py

This generates static masks for raw data.

Run MSG

For Bert-base:

    sh grow_bert_base.sh

For Bert-large:

    sh grow_bert_large.sh

Evaluation

    cd glue_eval
    sh run_glue_together_with_stat.sh

Notes

You can modify configs/*.json and set "attention_probs_dropout_prob" and "hidden_dropout_prob" to 0.0 in order to check function preservation. However, according to different pytorch versions, there can still be negligible differences of loss before and after growth.

References

If this project helps you, please cite us, thanks!

@article{DBLP:journals/corr/abs-2305-02869,
  author       = {Yiqun Yao and
                  Zheng Zhang and
                  Jing Li and
                  Yequan Wang},
  title        = {2x Faster Language Model Pre-training via Masked Structural Growth},
  journal      = {CoRR},
  volume       = {abs/2305.02869},
  year         = {2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
configs		configs
glue_eval		glue_eval
model_ex		model_ex
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
preprocess_bert_data.py		preprocess_bert_data.py
requirements.txt		requirements.txt
run_grow_bert.py		run_grow_bert.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Masked Structural Growth

Quick Start

Notes

References

About

Releases

Packages

Languages

License

shaonan1993/MSG

Folders and files

Latest commit

History

Repository files navigation

Masked Structural Growth

Quick Start

Notes

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages