HW 5 - Large Language Model

Team Members -

Start Training

pip install torch numpy transformers datasets tiktoken wandb tqdm pytorch-ignite

First generate the train and val data by running data/openwebtext/prepare.py. This script would fetch the OpenWebText data and perform a train-val split, followed by sub-word level tokenization using tiktoken. Finally it saves the process train and val data in the data/ folder.

$ python3 data/openwebtext/prepare.py

You can run the pretraining by simply running the train script. The configurations for the training can be set using the config.py file.

python3 train.py

python data/cnn_dailymail/prepare.py

python data/squad/prepare.py

Set the right file names and the required config variables in finetune_config.py and config.py. The other fields can be left untouched, but the file paths will need to be modified.
The model trained checkpoints can be downloaded from this directory - https://drive.google.com/drive/folders/13nobcjJdx2svWk4mJ8Xj_gO3p9V9I4AZ?usp=sharing

NanaoGPT - https://github.com/karpathy/nanoGPT/tree/master This repository was referred to for creating the LLM model.