NeurIPS 2023 LLM Efficiency Challenge Quickstart Guide

The NeurIPS 2023 Efficiency Challenge is a competition focused on training 1 LLM for 24 hours on 1 GPU – the team with the best LLM gets to present their results at NeurIPS 2023.

This quick start guide is a short starter guide illustrating the main steps to get started with Lit-GPT, which was selected as the competition's official starter kit.

Competition Facts

Permitted GPUs:

1x A100 (40 GB RAM);
1x RTX 4090 (24 GB RAM).

Permitted models:

All transformer-based LLM base models that are not finetuned yet.

The subset of Lit-GPT models supported in this competition is listed in the table below. These don't include models that have been finetuned or otherwise aligned, as per the rules of the challenge.

Models in Lit-GPT	Reference
Meta AI Llama 2 Base	Touvron et al. 2023
TII UAE Falcon Base	TII 2023
OpenLM Research OpenLLaMA	Geng & Liu 2023
EleutherAI Pythia	Biderman et al. 2023
StabilityAI StableLM Base	Stability AI 2023

Permitted datasets

Any open-source dataset is allowed. Originally, per competition rules, datasets that utilize "generated content" from other LLMs were not permitted. However, the rules were recently softened to also allow LLM-generated datasets if those datasets are made available and if it is not against the usage restrictions and guidelines of the LLM. If you plan to use a specific dataset that is not explicitely listed on the challenge website or want to use LLM-generated data, it is recommended to reach out to the organizers and confirm that this is in line with the competition rules.

Examples of permitted datasets are the following:

Databricks-Dolly-15
OpenAssistant Conversations Dataset (oasst1)
The Flan Collection

You are allowed to create your own datasets if they are made publicly accessible under an open-source license, and they are not generated from other LLMs (even open-source ones).

Helpful competition rules relevant to the dataset choice:

The maximum prompt/completion length the models are expected to handle is 2048 tokens.
The evaluation will be on English texts only.

Submission deadline

October 25, 2023 (Please check official website in case of updates.)

Lit-GPT Setup

Use the following steps to set up the Lit-GPT repository on your machine.

git clone https://github.com/Lightning-AI/lit-gpt
cd lit-gpt
pip install -r requirements.txt tokenizers sentencepiece huggingface_hub

Downloading Model Checkpoints

This section explains how to download the StableLM 3B Base model, one of the smallest models supported in Lit-GPT (an even smaller, supported model is Pythia, which starts at 70M parameters). The downloaded and converted checkpoints will occupy approximately 28 Gb of disk space.

python scripts/download.py \
  --repo_id stabilityai/stablelm-base-alpha-3b

python scripts/convert_hf_checkpoint.py \
  --checkpoint_dir checkpoints/stabilityai/stablelm-base-alpha-3b

While StableLM 3B Base is useful as a first starter model to set things up, you may want to use the more capable Falcon 7B or Llama 2 7B/13B models later. See the download_* tutorials in Lit-GPT to download other model checkpoints.

After downloading and converting the model checkpoint, you can test the model via the following command:

python generate/base.py \
  --prompt "LLM efficiency competitions are fun, because" \
  --checkpoint_dir checkpoints/stabilityai/stablelm-base-alpha-3b

Downloading and Preparing Datasets

The following command will download and preprocess the Dolly15k dataset for the StableLM 3B Base model:

python scripts/prepare_dolly.py \
  --checkpoint_dir checkpoints/stabilityai/stablelm-base-alpha-3b \
  --destination_path data/dolly-stablelm3b

Note

The preprocessed dataset is specific to the StableLM 3B model. If you use a different model like Falcon or Llama 2 later, you'll need to process the dataset with that model checkpoint directory. This is because each model uses a different tokenizer.

Finetuning

Low-rank Adaptation (LoRA) is a good choice for a first finetuning run. The Dolly dataset has ~15k samples, and the finetuning might take half an hour.

To accelerate this for testing purposes, edit the ./finetune/lora.py script and change max_iters = 50000 to max_iters = 500 at the top of the file.

Note

The Dolly dataset has a relatively long context length, which could result in out-of-memory issues. The maximum context length that is used for the evaluation, according to the official competition rules, is 2,048 tokens. Hence, it's highly recommended to prepare the dataset with a fixed max length, for example, python scripts/prepare_dolly.py --max_seq_length 2048.

The following command finetunes the model:

CUDA_VISIBLE_DEVICES=2 python finetune/lora.py \
  --data_dir data/dolly-stablelm3b \
  --checkpoint_dir "checkpoints/stabilityai/stablelm-base-alpha-3b" \
  --out_dir "out/stablelm3b/dolly/lora/experiment1" \
  --precision "bf16-true"

With 500 iterations, this takes approximately 1-2 min on an A100 and uses 26.30 GB GPU memory.

If you are using an RTX 4090, change micro_batch_size=4 to micro_batch_size=1 so that the model will only use 12.01 GB of memory.

(More finetuning settings are explained here.)

Local Evaluation

The official Lit-GPT competition will use a small subset of HELM tasks for model evaluation, which includes BigBench (general), MMLU (knowledge), TruthfulQA (knowledge and harm in a multiple choice format), CNN/DailyMail (news summarization), GSM8K (math), and BBQ (bias).

HELM is currently also being integrated into Lit-GPT to evaluate LLMs before submission.

However, a tool with a more convenient interface is Eleuther AI's Evaluation Harness, which contains some tasks, for example, BigBench, TruthfulQA, and GSM8k, that overlap with HELM. We can set up the Evaluation Harness as follows:

pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@master

And then we can use it via the following command:

python eval/lm_eval_harness.py \
  --checkpoint_dir "checkpoints/stabilityai/stablelm-base-alpha-3b" \
  --precision "bf16-true" \
  --eval_tasks "[truthfulqa_mc,gsm8k]" \
  --save_filepath "results-stablelm-3b.json"

(You can find a full task list in the task table here.)

To evaluate a LoRA-finetuned model, you need to first merge the LoRA weights with the base model to create a new checkpoint file:

python scripts/merge_lora.py \
  --checkpoint_dir "checkpoints/stabilityai/stablelm-base-alpha-3b/" \
  --lora_path "out/stablelm3b/dolly/lora/experiment1/lit_model_lora_finetuned.pth" \
  --out_dir "out/lora_merged/stablelm-base-alpha-3b/"

cp checkpoints/stabilityai/stablelm-base-alpha-3b/*.json \
out/lora_merged/stablelm-base-alpha-3b/

For more information on LoRA weight merging, please see the Merging LoRA Weights section of the LoRA finetuning documentation.

After merging the weights, we can use the lm_eval_harness.py similar to before with the only difference that we now use the new checkpoint folder containing the merged LoRA model:

python eval/lm_eval_harness.py \
  --checkpoint_dir "out/lora_merged/stablelm-base-alpha-3b" \
  --precision "bf16-true" \
  --eval_tasks "[truthfulqa_mc,gsm8k]" \
  --save_filepath "results-stablelm-3b.json"

Submission

You will be required to submit a Docker image for the submission itself. Fortunately, the organizers have a GitHub repository with the exact steps here and a toy-submission setup guide to test your model locally before submission.

Additional Information & Resources

The official NeurIPS 2023 LLM Efficiency Challenge competition website
A more extensive guide, including environment setup tips: The NeurIPS 2023 LLM Efficiency Challenge Starter Guide
Official competition Discord and Lightning AI + Lit-GPT Discord
LoRA vs Adapter vs Adapter v2 comparison in Lit-GPT using Falcon 7B: Finetuning Falcon LLMs More Efficiently With LoRA and Adapters
Dealing with out-of-memory (OOM) errors in Lit-GPT
Introduction to Fabric (an API to access more advanced PyTorch features used in Lit-GPT) and memory saving tips: Optimizing Memory Usage for Training LLMs and Vision Transformers in PyTorch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

neurips_challenge_quickstart.md

neurips_challenge_quickstart.md

NeurIPS 2023 LLM Efficiency Challenge Quickstart Guide

Competition Facts

Lit-GPT Setup

Downloading Model Checkpoints

Downloading and Preparing Datasets

Finetuning

Local Evaluation

Submission

Additional Information & Resources

Files

neurips_challenge_quickstart.md

Latest commit

History

neurips_challenge_quickstart.md

File metadata and controls

NeurIPS 2023 LLM Efficiency Challenge Quickstart Guide

Competition Facts

Lit-GPT Setup

Downloading Model Checkpoints

Downloading and Preparing Datasets

Finetuning

Local Evaluation

Submission

Additional Information & Resources