Inheritune: Training Smaller Yet More Attentive Language Models

⚠️ Warning

This repository is still under development and may still contain various bugs.

This is the official repository for the paper Inheritune: Training Smaller Yet More Attentive Language Models.

Abstract

Large Language Models (LLMs) have achieved remarkable performance across various natural language processing tasks, primarily due to the transformer architecture and its self-attention mechanism. However, we observe that in standard decoder-style LLMs, attention matrices degenerate to single-column for deeper layers. Layers in this state are unable to learn anything meaningful and mostly redundant; we refer to these as lazy layers. The goal of this paper is to train smaller models by eliminating this structural inefficiency without compromising performance.

Motivated by this observation, we propose Inheritune, a simple yet effective training recipe for developing smaller, high-performing language models. Smaller models trained with Inheritune inherit early transformer layers from a larger pre-trained model, then retrain and progressively expand until they match or exceed the performance of the larger model. We demonstrate that Inheritune enables the training of various sizes of GPT-2 models on datasets like OpenWebText-9B and FineWeb_Edu. Models trained with \method{}, despite having significantly fewer layers, match or even surpass the performance of their larger counterparts. For instance, our 16-layer GPT-2 medium variant achieves comparable performance to the standard 24-layer GPT-2 medium model.

Attention Degeneration in Standard Decoder style-LLMs

An analysis of a 36-layer GPT-2 large shows the max rank of the attention matrices across all layers.

A closer look at the same GPT-2 large model reveals that the dominant mass proportion of several attention matrices are concentrated in a single column, particularly in deeper layers.

Main Result: Training GPT-2 xlarge (1.5B) with OpenWebText-9B tokens

Model derived using Inheritune converges faster and matches the final validation loss of the full-sized model trained from scratch, despite being smaller. Training GPT-2 xlarge vanilla models from scratch and our variants with OpenWebText-9B for 100K steps.

Additional Result: Training GPT-2 large* (680M) with Fineweb_edu

Models derived using Inheritune converge faster and match the final validation loss of the full-sized model despite using fewer layers.

Downstream Performance of Models Trained with Fineweb_edu

Models	Recipe	Layers	ARCE (acc)	PIQA (acc)	SciQ (acc)	Hellaswag (acc norm)	Lambada (acc)	Average
GPT-2 Medium
	rand init	24	51.05	61.81	74.8	30.79	20.28	47.74
	rand init	16	49.92	61.92	73.3	29.56	19.54	46.84
	Ours	16	51.26	61.81	73.8	30.55	23	48.08
GPT-2 Large^†
	rand init	32	52.48	64.58	75.3	32.65	22.2	49.44
	rand init	16	50.34	63.11	75	30.86	21.56	48.17
	Ours	16	52.9	63.55	76.1	32.14	24.06	49.75

Models trained with Inheritune outperform both their larger and same-size counterparts trained from scratch on average zero-shot downstream performance.

For evaluation, we use accuracy (acc) and normalized accuracy (acc norm) metrics following the Open LLM leaderboard. All models are trained with FineWeb_edu.

Additional Experiments in Low Data Regime

Train 1.5B base language model using 1B tokens with 1 GPU for half a day

Performance of our 1.5B base LM derived using 1B data with Inheritune on an average of 9 different datasets (left) and MMLU benchmark (right) that evaluates commonsense, truthfulness, natural language inference and language understanding. We compare our model's performance with reference model-OpenLLamA-3B (2x size), other small base LMs of size 1B-2B parameters such as MPT-1.3B, OPT-1.3B, Pythia-1.4B (pre-trained from scratch) and ShearLLaMA-1.5B (pruned and continually trained using existing large base LM).

Table of Results

Below is the comparison of our target model with reference models and other baseline models of similar size when pre-trained from scratch and pre-trained with inherited weights and pruning. Our model, although trained with fewer tokens, achieves comparable performance. We have highlighted all scores where our model achieves at least 90% of the score compared to its reference language model or outperforms at least two of the baseline models. All tasks are evaluated using 0-shot except MMLU, which is 5-shot. The models marked with n/a are trained from scratch.

Commonsense Reasoning

Model		Commonsense Reasoning
Name (# train tokens)	Reference	Winograd	PIQA	Boolq	WinoGrande	Logiqa
OpenLLaMA-3B (1T)	n/a	63.46	74.97	67.18	62.27	28.4
OPT-1.3B (300B)	n/a	38.46	71.82	57.83	59.51	27.04
Pythia-1.4B (300B)	n/a	36.54	70.89	63.12	56.99	27.65
MPT-1.3B (200B)	n/a	63.46	71.44	50.89	58.09	28.26
Sheared LLaMA-1.3B (50B)	LLaMA2-7B	36.54	73.45	62.02	58.17	27.34
Ours-1.5B (1B)	OpenLLaMA-3B	50.96	56.47	61.68	51.69	25.19

Language Understanding, Inference & Factuality

Model		Lang. Understanding & Inference				Factuality
Name (# train tokens)	Reference	MMLU(5)	WNLI	QNLI	MNLI	TruthfulQA
OpenLLaMA-3B (1T)	n/a	27.21	50.7	51.3	37.3	35
OPT-1.3B (300B)	n/a	24.96	42.25	51.29	35.82	38.67
Pythia-1.4B (300B)	n/a	25.56	53.52	49.48	32.76	38.66
MPT-1.3B (200B)	n/a	25.82	40.85	50.52	35.93	38.68
Sheared LLaMA-1.3B (50B)	LLaMA2-7B	25.71	49.3	50.98	37.94	37.14
Ours-1.5B (1B)	OpenLLaMA-3B	25.67	43.66	49.41	34.42	48.61

News

[2024-04-22] We've released the first version of codebase for Inheritune in low data regime and also full data regime.

[2024-04-22] We've added the discussions option at the top for community feedback and discussions. Feel free to suggest new experiments and post your results.

Cite us

If you find this work helpful, please consider citing us:

@inproceedings{Sanyal2024pretraining,
  title  = {Inheritune: Training Smaller Yet More Attentive Language Models},
  author = {Sunny Sanyal, Ravid Shwartz-Ziv, Alexandros G. Dimakis and Sujay Sanghavi},
  year   = {2024},
  url={https://arxiv.org/abs/2404.08634}
  
}

Acknowledgement

The training code for small language model 1B-2B is mainly adapted from litgpt. The code for GPT2 experiments are mainly adapted from Sophia and nanoGPT.
The llama image is created using DALLE.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
GPT2-experiments		GPT2-experiments
images		images
lit-gpt		lit-gpt
README.md		README.md
poster.pdf		poster.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Inheritune: Training Smaller Yet More Attentive Language Models

Abstract

Attention Degeneration in Standard Decoder style-LLMs

Main Result: Training GPT-2 xlarge (1.5B) with OpenWebText-9B tokens

Additional Result: Training GPT-2 large* (680M) with Fineweb_edu

Downstream Performance of Models Trained with Fineweb_edu

Additional Experiments in Low Data Regime

Train 1.5B base language model using 1B tokens with 1 GPU for half a day

Table of Results

Commonsense Reasoning

Language Understanding, Inference & Factuality

News

Cite us

Acknowledgement

About

Releases

Packages

Languages

sanyalsunny111/LLM-Inheritune

Folders and files

Latest commit

History

Repository files navigation

Inheritune: Training Smaller Yet More Attentive Language Models

Abstract

Attention Degeneration in Standard Decoder style-LLMs

Main Result: Training GPT-2 xlarge (1.5B) with OpenWebText-9B tokens

Additional Result: Training GPT-2 large* (680M) with Fineweb_edu

Downstream Performance of Models Trained with Fineweb_edu

Additional Experiments in Low Data Regime

Train 1.5B base language model using 1B tokens with 1 GPU for half a day

Table of Results

Commonsense Reasoning

Language Understanding, Inference & Factuality

News

Cite us

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages