ChatGPT has three stages, which is based on InstructGPT which was the last open publication on this topic from OpenAI. The first state is Supervised Finetuning stage or Instruct Tuning stage where model is traning on different taks(InstructGPT was trained on 30K task) in supervised autoregressive way. In second stage we perform reward model traning and in last stage we perform RL with human feeback with Proximal Policy Gradient. Here's a diagram from the InstructGPT paper:
- Add deep speed optimizations
- serve chat based application
src
|_config
configs.py # Training and model Configs
|_experiments
|_main.py # some scratch code to quickly test something
|_inference
|_models
|_gpt.py # GPT-2 implementation with LoRA
|_llama.py # wish I could have more time to test with LLaMA
|_training
|_train_ppo.py # training script for PPO
|_train_rm.py # trianing script for Reward Model
|_train_sft.py # training script for SFT model
|_tariners.py # the actual training loops and other trainer utilities, such as saving states
|_utils
|_dataset.py # multiple datasets definition
|_evaluate.py # evaluate the generation with ChatGPT
|_loss.py # loss functions used in different training
|_prepare_sft_dataset.py
|_tokenizer.py # tokenizers in a unified class
requirements.txt # dependencies without PyTorch! Install your own pytorch 2.0 nightly.
ChatGPT (or InstructGPT, which is the last open publication on this topic from OpenAI) has three stages: Supervised Fine-tuning, Reward Model Training and RL with Human Feedback (with Proximal Policy Gradient). Here's a diagram from the InstructGPT paper:
- You need to have a GPU with at least 16GB VRAM, CUDA 11.7+ and also Python 3.8 or higher (other versions of CUDA and Python might work but I didn't test, Python 2.0 is very demanding.).
- Install PyTorch 2.0
- Install dependencies with
pip install -r requirements.txt
- The first step is to traing a SFT model, inside
src
directory, run this command. You can change batch size via-b
. The bigger VRAM you have the larger batch size you can afford. Optionally, you can open the tensorboard.ipynb to check the training progress.
python training/train_sft.py --n experiment_name -b 2`
- Once you finished SFT stage, you can start to train the reward model. You should have a directory started with
sft_
in yourruns
directory. Find the final model weights and run this. This should start a reward model training for 1 epoch and generate a directory started withrm_
with weights in it.
python training/train_rm.py -b 2 -n experiment_name -p "./runs/path/to/your/weights"
- Finally, you can start the RLHF with the reward model and SFT model you get from previous two steps. Run this command. Because the training is not stable sometimes, I stopped early around 12K steps with a batch size of 1. The final weights will be in a directory started with
ppo_
.
python training/train_rm.py -b 2 -n experiment_name -a "./runs/path/to/sft/weights" -c "./runs/path/to/reward_model/weights" -s naive
- You need to have an OpenAI account with credential key
- Put your key into a file called "openai.key" JSON file. It should be a dictionary with a key called "OPENAI_API_KEY" and the value is your key.
- Inside
src
, run this:
python utils/evaluate.py -s "/path/to/sft/model" -p "/path/to/ppo/model"
- This should generate the "Human" preference over each model using ChatGPT as a proxy
Based on NanoGPT
Strucutre
GPT
__init__
TransformerDecoder
lm_head
forward(x)
return lm_head(TransformerDecoder(x))
from_checkpoint
from_pretrained
generate
batch_generate
TransformerDecoder
__init__()
# Layers
# token_embedding_layer
# postion_embedding_layer
# input_dropout
# decoder_blocks = [TransformerDecoderBlock(cfg) for _ in range(cfg.n_layers)]
# ln = LayerNorm
forward(x, attention_mask):
# pos, token_embeddings, pos_embeddings
x = input_dropout(token_embeddings + pos_embeddings)
# Stack N blocks
for block in self.decoder_blocks:
x = block(x)
y = ln(x)
TransformerDecoderBlock
__init__()
# Layers
# ln1, ln2 = LayerNorm
# mmsa = MaskedMultiheadSelfAttention
# ffn = FeedForwardNetworks
forward(x, attention_mask):
# identity1 = x # For Residual connection
# x = identity1 + mmsa(ln1(x))
# identity2 = x
# x = identity2 + mmsa(ln1(x))
# return x
MaskedMultiheadSelfAttention
__init__
Q, K, V, attention_dropout, output_dropout, mask
forward(x, attention_mask):
# x3 <- qkv_projection(x)
# Q, K, V = x3.split()
# attention = Q @ K.transpose(2, 3)
# apply attention_mask to attention
# attention_dropout
# weighted_value = attention @ V
# project weighted_value linearly to get output
# y = output_projection(weighted_value)