Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEMO: minGPT on tinygrad #138

Open
wants to merge 36 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
bbd2120
rename main code folder
ziliangpeng Apr 26, 2024
149c1b2
rename a bunch of code ref
ziliangpeng Apr 26, 2024
b47cf06
git ignore some files
ziliangpeng Apr 26, 2024
5638752
Merge pull request #1 from ziliangpeng/v--rename
ziliangpeng Apr 26, 2024
eb3aba3
use proxy DataLoader
ziliangpeng Apr 27, 2024
114f2b6
Merge pull request #2 from ziliangpeng/v--dataloader
ziliangpeng Apr 27, 2024
4698a9f
wip
ziliangpeng Apr 27, 2024
885d3ac
some more stupid wip
ziliangpeng Apr 27, 2024
c4abaae
wip
ziliangpeng Apr 27, 2024
29bdc7d
a bersion that almost runs except a few tensor without grad
ziliangpeng Apr 27, 2024
c24d17a
a little bit of cleanup
ziliangpeng Apr 27, 2024
2feca7f
bring back the dropout
ziliangpeng Apr 28, 2024
a1bf2b4
print total params count
ziliangpeng Apr 28, 2024
63ee708
fixed the tinygrad no-grad issue. moving bias as local var
ziliangpeng Apr 28, 2024
4283874
cancel decay
ziliangpeng Apr 28, 2024
6ee157e
cancel init_w
ziliangpeng Apr 28, 2024
b49cbd9
some cleanup
ziliangpeng Apr 28, 2024
2b1f9ba
some cleanup of adder
ziliangpeng Apr 28, 2024
230ceae
add simple NOTE
ziliangpeng Apr 28, 2024
9568f28
cancel clip_grad_norm_
ziliangpeng Apr 28, 2024
2a22fff
super small cleanup
ziliangpeng Apr 28, 2024
db3301a
tidy import
ziliangpeng Apr 28, 2024
c447ee0
fix the no_grad bias issue
ziliangpeng Apr 28, 2024
8405765
bug fix
ziliangpeng Apr 28, 2024
d81e154
more cleanup
ziliangpeng Apr 28, 2024
45c1a51
move comment
ziliangpeng Apr 28, 2024
146dfb5
nit
ziliangpeng Apr 28, 2024
cf4549e
Merge pull request #3 from ziliangpeng/tinygradify
ziliangpeng Apr 28, 2024
8d17e92
update README
ziliangpeng Apr 28, 2024
5485999
organize top level files and metadata
ziliangpeng Apr 28, 2024
34f36c8
Merge pull request #4 from ziliangpeng/v--README
ziliangpeng Apr 28, 2024
0db60b7
awkward typo
ziliangpeng Apr 28, 2024
b897968
hotz
ziliangpeng Apr 28, 2024
751cc86
remove torch DataLoader
nuro-v Sep 29, 2024
43acada
remove torch in adder
nuro-v Oct 5, 2024
a5bc468
Merge pull request #5 from ziliangpeng/v--remove-dataloader
ziliangpeng Oct 5, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ __pycache__/
*.swp
.env
.pylintrc
tinyGPT.egg-info/
1 change: 1 addition & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
The MIT License (MIT) Copyright (c) 2020 Andrej Karpathy
The MIT License (MIT) Copyright (c) 2024 Ziliang Peng

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

Expand Down
143 changes: 15 additions & 128 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,146 +1,33 @@
# You like karpathy? You like geohot? You love tinyGPT! ❤️

# minGPT
![tinyGPT](hotzilla.avif)

![mingpt](mingpt.jpg)
tinyGPT is an attempt to port karpathy's minGPT to geohot's tinygrad. It serves a few purposes:
- demonstrate API compatibility and diff between PyTorch and Tinygrad
- Identify missing features/APIs in Tinygrad
- Benchmark and compare performance

A PyTorch re-implementation of [GPT](https://github.com/openai/gpt-2), both training and inference. minGPT tries to be small, clean, interpretable and educational, as most of the currently available GPT model implementations can a bit sprawling. GPT is not a complicated model and this implementation is appropriately about 300 lines of code (see [mingpt/model.py](mingpt/model.py)). All that's going on is that a sequence of indices feeds into a [Transformer](https://arxiv.org/abs/1706.03762), and a probability distribution over the next index in the sequence comes out. The majority of the complexity is just being clever with batching (both across examples and over sequence length) for efficiency.
### Library Installation and Test

**note (Jan 2023)**: though I may continue to accept and change some details, minGPT is in a semi-archived state. For more recent developments see my rewrite [nanoGPT](https://github.com/karpathy/nanoGPT). Basically, minGPT became referenced across a wide variety of places (notebooks, blogs, courses, books, etc.) which made me less willing to make the bigger changes I wanted to make to move the code forward. I also wanted to change the direction a bit, from a sole focus on education to something that is still simple and hackable but has teeth (reproduces medium-sized industry benchmarks, accepts some tradeoffs to gain runtime efficiency, etc).

The minGPT library is three files: [mingpt/model.py](mingpt/model.py) contains the actual Transformer model definition, [mingpt/bpe.py](mingpt/bpe.py) contains a mildly refactored Byte Pair Encoder that translates between text and sequences of integers exactly like OpenAI did in GPT, [mingpt/trainer.py](mingpt/trainer.py) is (GPT-independent) PyTorch boilerplate code that trains the model. Then there are a number of demos and projects that use the library in the `projects` folder:

- `projects/adder` trains a GPT from scratch to add numbers (inspired by the addition section in the GPT-3 paper)
- `projects/chargpt` trains a GPT to be a character-level language model on some input text file
- `demo.ipynb` shows a minimal usage of the `GPT` and `Trainer` in a notebook format on a simple sorting example
- `generate.ipynb` shows how one can load a pretrained GPT2 and generate text given some prompt

### Library Installation

If you want to `import mingpt` into your project:
If you want to `import tinygpt` into your project:

```
git clone https://github.com/karpathy/minGPT.git
cd minGPT
git clone https://github.com/ziliangpeng/tinyGPT.git
cd tinyGPT
pip install -e .
```

### Usage
After that, you can run the demo project to see the result:

Here's how you'd instantiate a GPT-2 (124M param version):

```python
from mingpt.model import GPT
model_config = GPT.get_default_config()
model_config.model_type = 'gpt2'
model_config.vocab_size = 50257 # openai's model vocabulary
model_config.block_size = 1024 # openai's model block_size (i.e. input context length)
model = GPT(model_config)
```

And here's how you'd train it:

```python
# your subclass of torch.utils.data.Dataset that emits example
# torch LongTensor of lengths up to 1024, with integers from [0,50257)
train_dataset = YourDataset()

from mingpt.trainer import Trainer
train_config = Trainer.get_default_config()
train_config.learning_rate = 5e-4 # many possible options, see the file
train_config.max_iters = 1000
train_config.batch_size = 32
trainer = Trainer(train_config, model, train_dataset)
trainer.run()
```

See `demo.ipynb` for a more concrete example.

### Unit tests

Coverage is not super amazing just yet but:

cd project/adder
python adder.py
```
python -m unittest discover tests
```

### todos

- add gpt-2 finetuning demo on arbitrary given text file
- add dialog agent demo
- better docs of outcomes for existing projects (adder, chargpt)
- add mixed precision and related training scaling goodies
- distributed training support
- reproduce some benchmarks in projects/, e.g. text8 or other language modeling
- proper logging instead of print statement amateur hour haha
- i probably should have a requirements.txt file...
- it should be possible to load in many other model weights other than just gpt2-\*

### References

Code:

- [openai/gpt-2](https://github.com/openai/gpt-2) has the model definition in TensorFlow, but not the training code
- [openai/image-gpt](https://github.com/openai/image-gpt) has some more modern gpt-3 like modification in its code, good reference as well
- [huggingface/transformers](https://github.com/huggingface/transformers) has a [language-modeling example](https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling). It is full-featured but as a result also somewhat challenging to trace. E.g. some large functions have as much as 90% unused code behind various branching statements that is unused in the default setting of simple language modeling

Papers + some implementation notes:

#### Improving Language Understanding by Generative Pre-Training (GPT-1)

- Our model largely follows the original transformer work
- We trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads). For the position-wise feed-forward networks, we used 3072 dimensional inner states.
- Adam max learning rate of 2.5e-4. (later GPT-3 for this model size uses 6e-4)
- LR decay: increased linearly from zero over the first 2000 updates and annealed to 0 using a cosine schedule
- We train for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens.
- Since layernorm is used extensively throughout the model, a simple weight initialization of N(0, 0.02) was sufficient
- bytepair encoding (BPE) vocabulary with 40,000 merges
- residual, embedding, and attention dropouts with a rate of 0.1 for regularization.
- modified version of L2 regularization proposed in (37), with w = 0.01 on all non bias or gain weights
- For the activation function, we used the Gaussian Error Linear Unit (GELU).
- We used learned position embeddings instead of the sinusoidal version proposed in the original work
- For finetuning: We add dropout to the classifier with a rate of 0.1. learning rate of 6.25e-5 and a batchsize of 32. 3 epochs. We use a linear learning rate decay schedule with warmup over 0.2% of training. λ was set to 0.5.
- GPT-1 model is 12 layers and d_model 768, ~117M params

#### Language Models are Unsupervised Multitask Learners (GPT-2)

- LayerNorm was moved to the input of each sub-block, similar to a pre-activation residual network
- an additional layer normalization was added after the final self-attention block.
- modified initialization which accounts for the accumulation on the residual path with model depth is used. We scale the weights of residual layers at initialization by a factor of 1/√N where N is the number of residual layers. (weird because in their released code i can only find a simple use of the old 0.02... in their release of image-gpt I found it used for c_proj, and even then only for attn, not for mlp. huh. https://github.com/openai/image-gpt/blob/master/src/model.py)
- the vocabulary is expanded to 50,257
- increase the context size from 512 to 1024 tokens
- larger batchsize of 512 is used
- GPT-2 used 48 layers and d_model 1600 (vs. original 12 layers and d_model 768). ~1.542B params

#### Language Models are Few-Shot Learners (GPT-3)

- GPT-3: 96 layers, 96 heads, with d_model of 12,288 (175B parameters).
- GPT-1-like: 12 layers, 12 heads, d_model 768 (125M)
- We use the same model and architecture as GPT-2, including the modified initialization, pre-normalization, and reversible tokenization described therein
- we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer
- we always have the feedforward layer four times the size of the bottleneck layer, dff = 4 ∗ dmodel
- all models use a context window of nctx = 2048 tokens.
- Adam with β1 = 0.9, β2 = 0.95, and eps = 10−8
- All models use weight decay of 0.1 to provide a small amount of regularization. (NOTE: GPT-1 used 0.01 I believe, see above)
- clip the global norm of the gradient at 1.0
- Linear LR warmup over the first 375 million tokens. Then use cosine decay for learning rate down to 10% of its value, over 260 billion tokens.
- gradually increase the batch size linearly from a small value (32k tokens) to the full value over the first 4-12 billion tokens of training, depending on the model size.
- full 2048-sized time context window is always used, with a special END OF DOCUMENT token delimiter
tinygrad allows you to choose hardware via env vars: `CLANG=1`, `CUDA=1`, `METAL=1`.

#### Generative Pretraining from Pixels (Image GPT)
And you can choose `DEBUG=` level for increasing amount of debug log. Refer to [tinygrad](https://github.com/tinygrad/tinygrad/blob/master/docs-legacy/env_vars.md) for more env vars control.

- When working with images, we pick the identity permutation πi = i for 1 ≤ i ≤ n, also known as raster order.
- we create our own 9-bit color palette by clustering (R, G, B) pixel values using k-means with k = 512.
- Our largest model, iGPT-XL, contains L = 60 layers and uses an embedding size of d = 3072 for a total of 6.8B parameters.
- Our next largest model, iGPT-L, is essentially identical to GPT-2 with L = 48 layers, but contains a slightly smaller embedding size of d = 1536 (vs 1600) for a total of 1.4B parameters.
- We use the same model code as GPT-2, except that we initialize weights in the layerdependent fashion as in Sparse Transformer (Child et al., 2019) and zero-initialize all projections producing logits.
- We also train iGPT-M, a 455M parameter model with L = 36 and d = 1024
- iGPT-S, a 76M parameter model with L = 24 and d = 512 (okay, and how many heads? looks like the Github code claims 8)
- When pre-training iGPT-XL, we use a batch size of 64 and train for 2M iterations, and for all other models we use a batch size of 128 and train for 1M iterations.
- Adam with β1 = 0.9 and β2 = 0.95
- The learning rate is warmed up for one epoch, and then decays to 0
- We did not use weight decay because applying a small weight decay of 0.01 did not change representation quality.
- iGPT-S lr 0.003
- No dropout is used.

### License

Expand Down
Binary file added hotzilla.avif
Binary file not shown.
Binary file removed mingpt.jpg
Binary file not shown.
1 change: 1 addition & 0 deletions projects/adder/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
out/
67 changes: 36 additions & 31 deletions projects/adder/adder.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,15 @@
import sys
import json

import torch
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader
import random
import tinygrad

from mingpt.model import GPT
from mingpt.trainer import Trainer
from mingpt.utils import set_seed, setup_logging, CfgNode as CN
from tinygrad import dtypes
from tinygpt.tinyloader import TinyDataLoader
from tinygpt.model import GPT
from tinygpt.trainer import Trainer
from tinygpt.utils import set_seed, setup_logging, CfgNode as CN
from tinygrad.tensor import Tensor

# -----------------------------------------------------------------------------

Expand Down Expand Up @@ -40,7 +42,7 @@ def get_config():

# -----------------------------------------------------------------------------

class AdditionDataset(Dataset):
class AdditionDataset:
"""
Creates n-digit addition problems. For example, if n=2, then an example
addition problem would be to add 85 + 50 = 135. This problem would be
Expand Down Expand Up @@ -79,9 +81,10 @@ def __init__(self, config, split):
ndigit = self.config.ndigit
assert ndigit <= 3, "the lines below would be very memory inefficient, in future maybe refactor to support"
num = (10**ndigit)**2 # total number of possible addition problems with ndigit numbers
rng = torch.Generator()
rng.manual_seed(1337)
perm = torch.randperm(num, generator=rng)
random.seed(1337)
perm = list(range(num))
random.shuffle(perm)
perm = tinygrad.tensor.Tensor(perm)
num_test = min(int(num*0.2), 500) # 20% of the whole dataset, or only up to 500
self.ixes = perm[:num_test] if split == 'test' else perm[num_test:]

Expand All @@ -95,6 +98,7 @@ def get_block_size(self):
return 3*self.config.ndigit + 1 - 1

def __len__(self):
# return self.ixes.numel()
return self.ixes.nelement()

def __getitem__(self, idx):
Expand All @@ -113,8 +117,8 @@ def __getitem__(self, idx):
render = astr + bstr + cstr
dix = [int(s) for s in render] # convert each character to its token index
# x will be input to GPT and y will be the associated expected outputs
x = torch.tensor(dix[:-1], dtype=torch.long)
y = torch.tensor(dix[1:], dtype=torch.long) # predict the next token in the sequence
x = Tensor(dix[:-1], dtype=dtypes.long)
y = Tensor(dix[1:], dtype=dtypes.long) # predict the next token in the sequence
y[:ndigit*2-1] = -1 # we will only train in the output locations. -1 will mask loss to zero
return x, y

Expand Down Expand Up @@ -147,10 +151,10 @@ def eval_split(trainer, split, max_batches=None):
ndigit = config.data.ndigit
results = []
mistakes_printed_already = 0
factors = torch.tensor([[10**i for i in range(ndigit+1)][::-1]]).to(trainer.device)
loader = DataLoader(dataset, batch_size=100, num_workers=0, drop_last=False)
factors = torch.tensor([[10**i for i in range(ndigit+1)][::-1]])
# factors = tinygrad.tensor.Tensor([[10**i for i in range(ndigit+1)][::-1]]).to(tmp_device)
loader = TinyDataLoader(dataset, batch_size=100)
for b, (x, y) in enumerate(loader):
x = x.to(trainer.device)
# isolate the first two digits of the input sequence alone
d1d2 = x[:, :ndigit*2]
# let the model sample the rest of the sequence
Expand All @@ -172,6 +176,7 @@ def eval_split(trainer, split, max_batches=None):
print("GPT claims that %d + %d = %d but gt is %d" % (d1i[i], d2i[i], d3i_pred[i], d3i_gt[i]))
if max_batches is not None and b+1 >= max_batches:
break
# rt = tinygrad.tensor.Tensor(results, dtype=tinygrad.dtypes.float)
rt = torch.tensor(results, dtype=torch.float)
print("%s final score: %d/%d = %.2f%% correct" % (split, rt.sum(), len(results), 100*rt.mean()))
return rt.sum()
Expand All @@ -184,22 +189,22 @@ def batch_end_callback(trainer):
if trainer.iter_num % 10 == 0:
print(f"iter_dt {trainer.iter_dt * 1000:.2f}ms; iter {trainer.iter_num}: train loss {trainer.loss.item():.5f}")

if trainer.iter_num % 500 == 0:
# evaluate both the train and test score
train_max_batches = {1: None, 2: None, 3: 5}[config.data.ndigit] # if ndigit=2 we can afford the whole train set, ow no
model.eval()
with torch.no_grad():
train_score = eval_split(trainer, 'train', max_batches=train_max_batches)
test_score = eval_split(trainer, 'test', max_batches=None)
score = train_score + test_score
# save the model if this is the best score we've seen so far
if score > top_score:
top_score = score
print(f"saving model with new top score of {score}")
ckpt_path = os.path.join(config.system.work_dir, "model.pt")
torch.save(model.state_dict(), ckpt_path)
# revert model to training mode
model.train()
# if trainer.iter_num % 500 == 0:
# # evaluate both the train and test score
# train_max_batches = {1: None, 2: None, 3: 5}[config.data.ndigit] # if ndigit=2 we can afford the whole train set, ow no
# # model.eval()
# with torch.no_grad():
# train_score = eval_split(trainer, 'train', max_batches=train_max_batches)
# test_score = eval_split(trainer, 'test', max_batches=None)
# score = train_score + test_score
# # save the model if this is the best score we've seen so far
# if score > top_score:
# top_score = score
# print(f"saving model with new top score of {score}")
# ckpt_path = os.path.join(config.system.work_dir, "model.pt")
# # torch.save(model.state_dict(), ckpt_path)
# # revert model to training mode
# model.train()

trainer.set_callback('on_batch_end', batch_end_callback)

Expand Down
6 changes: 3 additions & 3 deletions projects/chargpt/chargpt.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader

from mingpt.model import GPT
from mingpt.trainer import Trainer
from mingpt.utils import set_seed, setup_logging, CfgNode as CN
from tinygpt.model import GPT
from tinygpt.trainer import Trainer
from tinygpt.utils import set_seed, setup_logging, CfgNode as CN

# -----------------------------------------------------------------------------

Expand Down
10 changes: 6 additions & 4 deletions setup.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
from setuptools import setup

setup(name='minGPT',
setup(name='tinyGPT',
version='0.0.1',
author='Andrej Karpathy',
packages=['mingpt'],
description='A PyTorch re-implementation of GPT',
# author='Andrej Karpathy',
author='Ziliang Peng',
packages=['tinygpt'],
description='A tinygrad port of Andrej Karpathy\'s minGPT',
license='MIT',
install_requires=[
'torch',
'tinygrad',
],
)
4 changes: 2 additions & 2 deletions tests/test_huggingface_import.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@
import unittest
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from mingpt.model import GPT
from mingpt.bpe import BPETokenizer
from tinygpt.model import GPT
from tinygpt.bpe import BPETokenizer
# -----------------------------------------------------------------------------

class TestHuggingFaceImport(unittest.TestCase):
Expand Down
File renamed without changes.
2 changes: 1 addition & 1 deletion mingpt/bpe.py → tinygpt/bpe.py
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,7 @@ def get_encoder():
and handles caching of "database" files.
"""
home_dir = os.path.expanduser('~')
cache_dir = os.path.join(home_dir, '.cache', 'mingpt')
cache_dir = os.path.join(home_dir, '.cache', 'tinygpt')
os.makedirs(cache_dir, exist_ok=True)

# load encoder.json that has the raw mappings from token -> bpe index
Expand Down
Loading