This repo is for rebuilding the OpenAi GPT2(124M) model.
As a Decoder-only Transformer model, the structure of GPT follows the decoder structure as in the Attention is All You Need Paper.
The original OpenAI Blog Post can be found at Better language models and their implications, which links to the paper Language Models are Unsupervised Multitask Learners and the github repo gpt2.
Besides the GPT2 paper, this repo also references the GPT3 paper, which is the Language Models are Few-Shot Learners.
The model training, optimization, and hyperparameter tuning follows both papers and implements Flash Attention as mentioned in both FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness and FlashAttention-2:Faster Attention with Better Parallelism and Work Partitioning.
The Dataset used to train the model is the 10BT sample from Huggingface Fineweb-Edu.
The model evaluation uses the Hellaswag LLM Benchmark, and achieved 30.68% Accuracy, which is 1.13% higher than the GPT2(124M) model with only 10% size training dataset.
Since this repo uses Pytorch
, the huggingface GPT2 implementation is also referenced.