diff --git a/.gitignore b/.gitignore index f92cfd22..e7de41ad 100644 --- a/.gitignore +++ b/.gitignore @@ -3,3 +3,4 @@ __pycache__/ *.swp .env .pylintrc +tinyGPT.egg-info/ diff --git a/LICENSE b/LICENSE index 3d899601..728449ed 100644 --- a/LICENSE +++ b/LICENSE @@ -1,4 +1,5 @@ The MIT License (MIT) Copyright (c) 2020 Andrej Karpathy +The MIT License (MIT) Copyright (c) 2024 Ziliang Peng Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: diff --git a/README.md b/README.md index 9debd177..f22a089b 100644 --- a/README.md +++ b/README.md @@ -1,146 +1,33 @@ +# You like karpathy? You like geohot? You love tinyGPT! ❤️ -# minGPT +![tinyGPT](hotzilla.avif) -![mingpt](mingpt.jpg) +tinyGPT is an attempt to port karpathy's minGPT to geohot's tinygrad. It serves a few purposes: +- demonstrate API compatibility and diff between PyTorch and Tinygrad +- Identify missing features/APIs in Tinygrad +- Benchmark and compare performance -A PyTorch re-implementation of [GPT](https://github.com/openai/gpt-2), both training and inference. minGPT tries to be small, clean, interpretable and educational, as most of the currently available GPT model implementations can a bit sprawling. GPT is not a complicated model and this implementation is appropriately about 300 lines of code (see [mingpt/model.py](mingpt/model.py)). All that's going on is that a sequence of indices feeds into a [Transformer](https://arxiv.org/abs/1706.03762), and a probability distribution over the next index in the sequence comes out. The majority of the complexity is just being clever with batching (both across examples and over sequence length) for efficiency. +### Library Installation and Test -**note (Jan 2023)**: though I may continue to accept and change some details, minGPT is in a semi-archived state. For more recent developments see my rewrite [nanoGPT](https://github.com/karpathy/nanoGPT). Basically, minGPT became referenced across a wide variety of places (notebooks, blogs, courses, books, etc.) which made me less willing to make the bigger changes I wanted to make to move the code forward. I also wanted to change the direction a bit, from a sole focus on education to something that is still simple and hackable but has teeth (reproduces medium-sized industry benchmarks, accepts some tradeoffs to gain runtime efficiency, etc). - -The minGPT library is three files: [mingpt/model.py](mingpt/model.py) contains the actual Transformer model definition, [mingpt/bpe.py](mingpt/bpe.py) contains a mildly refactored Byte Pair Encoder that translates between text and sequences of integers exactly like OpenAI did in GPT, [mingpt/trainer.py](mingpt/trainer.py) is (GPT-independent) PyTorch boilerplate code that trains the model. Then there are a number of demos and projects that use the library in the `projects` folder: - -- `projects/adder` trains a GPT from scratch to add numbers (inspired by the addition section in the GPT-3 paper) -- `projects/chargpt` trains a GPT to be a character-level language model on some input text file -- `demo.ipynb` shows a minimal usage of the `GPT` and `Trainer` in a notebook format on a simple sorting example -- `generate.ipynb` shows how one can load a pretrained GPT2 and generate text given some prompt - -### Library Installation - -If you want to `import mingpt` into your project: +If you want to `import tinygpt` into your project: ``` -git clone https://github.com/karpathy/minGPT.git -cd minGPT +git clone https://github.com/ziliangpeng/tinyGPT.git +cd tinyGPT pip install -e . ``` -### Usage +After that, you can run the demo project to see the result: -Here's how you'd instantiate a GPT-2 (124M param version): - -```python -from mingpt.model import GPT -model_config = GPT.get_default_config() -model_config.model_type = 'gpt2' -model_config.vocab_size = 50257 # openai's model vocabulary -model_config.block_size = 1024 # openai's model block_size (i.e. input context length) -model = GPT(model_config) ``` - -And here's how you'd train it: - -```python -# your subclass of torch.utils.data.Dataset that emits example -# torch LongTensor of lengths up to 1024, with integers from [0,50257) -train_dataset = YourDataset() - -from mingpt.trainer import Trainer -train_config = Trainer.get_default_config() -train_config.learning_rate = 5e-4 # many possible options, see the file -train_config.max_iters = 1000 -train_config.batch_size = 32 -trainer = Trainer(train_config, model, train_dataset) -trainer.run() -``` - -See `demo.ipynb` for a more concrete example. - -### Unit tests - -Coverage is not super amazing just yet but: - +cd project/adder +python adder.py ``` -python -m unittest discover tests -``` - -### todos - -- add gpt-2 finetuning demo on arbitrary given text file -- add dialog agent demo -- better docs of outcomes for existing projects (adder, chargpt) -- add mixed precision and related training scaling goodies -- distributed training support -- reproduce some benchmarks in projects/, e.g. text8 or other language modeling -- proper logging instead of print statement amateur hour haha -- i probably should have a requirements.txt file... -- it should be possible to load in many other model weights other than just gpt2-\* - -### References - -Code: - -- [openai/gpt-2](https://github.com/openai/gpt-2) has the model definition in TensorFlow, but not the training code -- [openai/image-gpt](https://github.com/openai/image-gpt) has some more modern gpt-3 like modification in its code, good reference as well -- [huggingface/transformers](https://github.com/huggingface/transformers) has a [language-modeling example](https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling). It is full-featured but as a result also somewhat challenging to trace. E.g. some large functions have as much as 90% unused code behind various branching statements that is unused in the default setting of simple language modeling - -Papers + some implementation notes: - -#### Improving Language Understanding by Generative Pre-Training (GPT-1) - -- Our model largely follows the original transformer work -- We trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads). For the position-wise feed-forward networks, we used 3072 dimensional inner states. -- Adam max learning rate of 2.5e-4. (later GPT-3 for this model size uses 6e-4) -- LR decay: increased linearly from zero over the first 2000 updates and annealed to 0 using a cosine schedule -- We train for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens. -- Since layernorm is used extensively throughout the model, a simple weight initialization of N(0, 0.02) was sufficient -- bytepair encoding (BPE) vocabulary with 40,000 merges -- residual, embedding, and attention dropouts with a rate of 0.1 for regularization. -- modified version of L2 regularization proposed in (37), with w = 0.01 on all non bias or gain weights -- For the activation function, we used the Gaussian Error Linear Unit (GELU). -- We used learned position embeddings instead of the sinusoidal version proposed in the original work -- For finetuning: We add dropout to the classifier with a rate of 0.1. learning rate of 6.25e-5 and a batchsize of 32. 3 epochs. We use a linear learning rate decay schedule with warmup over 0.2% of training. λ was set to 0.5. -- GPT-1 model is 12 layers and d_model 768, ~117M params - -#### Language Models are Unsupervised Multitask Learners (GPT-2) - -- LayerNorm was moved to the input of each sub-block, similar to a pre-activation residual network -- an additional layer normalization was added after the final self-attention block. -- modified initialization which accounts for the accumulation on the residual path with model depth is used. We scale the weights of residual layers at initialization by a factor of 1/√N where N is the number of residual layers. (weird because in their released code i can only find a simple use of the old 0.02... in their release of image-gpt I found it used for c_proj, and even then only for attn, not for mlp. huh. https://github.com/openai/image-gpt/blob/master/src/model.py) -- the vocabulary is expanded to 50,257 -- increase the context size from 512 to 1024 tokens -- larger batchsize of 512 is used -- GPT-2 used 48 layers and d_model 1600 (vs. original 12 layers and d_model 768). ~1.542B params - -#### Language Models are Few-Shot Learners (GPT-3) -- GPT-3: 96 layers, 96 heads, with d_model of 12,288 (175B parameters). -- GPT-1-like: 12 layers, 12 heads, d_model 768 (125M) -- We use the same model and architecture as GPT-2, including the modified initialization, pre-normalization, and reversible tokenization described therein -- we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer -- we always have the feedforward layer four times the size of the bottleneck layer, dff = 4 ∗ dmodel -- all models use a context window of nctx = 2048 tokens. -- Adam with β1 = 0.9, β2 = 0.95, and eps = 10−8 -- All models use weight decay of 0.1 to provide a small amount of regularization. (NOTE: GPT-1 used 0.01 I believe, see above) -- clip the global norm of the gradient at 1.0 -- Linear LR warmup over the first 375 million tokens. Then use cosine decay for learning rate down to 10% of its value, over 260 billion tokens. -- gradually increase the batch size linearly from a small value (32k tokens) to the full value over the first 4-12 billion tokens of training, depending on the model size. -- full 2048-sized time context window is always used, with a special END OF DOCUMENT token delimiter +tinygrad allows you to choose hardware via env vars: `CLANG=1`, `CUDA=1`, `METAL=1`. -#### Generative Pretraining from Pixels (Image GPT) +And you can choose `DEBUG=` level for increasing amount of debug log. Refer to [tinygrad](https://github.com/tinygrad/tinygrad/blob/master/docs-legacy/env_vars.md) for more env vars control. -- When working with images, we pick the identity permutation πi = i for 1 ≤ i ≤ n, also known as raster order. -- we create our own 9-bit color palette by clustering (R, G, B) pixel values using k-means with k = 512. -- Our largest model, iGPT-XL, contains L = 60 layers and uses an embedding size of d = 3072 for a total of 6.8B parameters. -- Our next largest model, iGPT-L, is essentially identical to GPT-2 with L = 48 layers, but contains a slightly smaller embedding size of d = 1536 (vs 1600) for a total of 1.4B parameters. -- We use the same model code as GPT-2, except that we initialize weights in the layerdependent fashion as in Sparse Transformer (Child et al., 2019) and zero-initialize all projections producing logits. -- We also train iGPT-M, a 455M parameter model with L = 36 and d = 1024 -- iGPT-S, a 76M parameter model with L = 24 and d = 512 (okay, and how many heads? looks like the Github code claims 8) -- When pre-training iGPT-XL, we use a batch size of 64 and train for 2M iterations, and for all other models we use a batch size of 128 and train for 1M iterations. -- Adam with β1 = 0.9 and β2 = 0.95 -- The learning rate is warmed up for one epoch, and then decays to 0 -- We did not use weight decay because applying a small weight decay of 0.01 did not change representation quality. -- iGPT-S lr 0.003 -- No dropout is used. ### License diff --git a/hotzilla.avif b/hotzilla.avif new file mode 100644 index 00000000..8a5193eb Binary files /dev/null and b/hotzilla.avif differ diff --git a/mingpt.jpg b/mingpt.jpg deleted file mode 100644 index 8070bcb8..00000000 Binary files a/mingpt.jpg and /dev/null differ diff --git a/projects/adder/.gitignore b/projects/adder/.gitignore new file mode 100644 index 00000000..89f9ac04 --- /dev/null +++ b/projects/adder/.gitignore @@ -0,0 +1 @@ +out/ diff --git a/projects/adder/adder.py b/projects/adder/adder.py index 55f03ee1..6b14bf6d 100644 --- a/projects/adder/adder.py +++ b/projects/adder/adder.py @@ -6,13 +6,15 @@ import sys import json -import torch -from torch.utils.data import Dataset -from torch.utils.data.dataloader import DataLoader +import random +import tinygrad -from mingpt.model import GPT -from mingpt.trainer import Trainer -from mingpt.utils import set_seed, setup_logging, CfgNode as CN +from tinygrad import dtypes +from tinygpt.tinyloader import TinyDataLoader +from tinygpt.model import GPT +from tinygpt.trainer import Trainer +from tinygpt.utils import set_seed, setup_logging, CfgNode as CN +from tinygrad.tensor import Tensor # ----------------------------------------------------------------------------- @@ -40,7 +42,7 @@ def get_config(): # ----------------------------------------------------------------------------- -class AdditionDataset(Dataset): +class AdditionDataset: """ Creates n-digit addition problems. For example, if n=2, then an example addition problem would be to add 85 + 50 = 135. This problem would be @@ -79,9 +81,10 @@ def __init__(self, config, split): ndigit = self.config.ndigit assert ndigit <= 3, "the lines below would be very memory inefficient, in future maybe refactor to support" num = (10**ndigit)**2 # total number of possible addition problems with ndigit numbers - rng = torch.Generator() - rng.manual_seed(1337) - perm = torch.randperm(num, generator=rng) + random.seed(1337) + perm = list(range(num)) + random.shuffle(perm) + perm = tinygrad.tensor.Tensor(perm) num_test = min(int(num*0.2), 500) # 20% of the whole dataset, or only up to 500 self.ixes = perm[:num_test] if split == 'test' else perm[num_test:] @@ -95,6 +98,7 @@ def get_block_size(self): return 3*self.config.ndigit + 1 - 1 def __len__(self): + # return self.ixes.numel() return self.ixes.nelement() def __getitem__(self, idx): @@ -113,8 +117,8 @@ def __getitem__(self, idx): render = astr + bstr + cstr dix = [int(s) for s in render] # convert each character to its token index # x will be input to GPT and y will be the associated expected outputs - x = torch.tensor(dix[:-1], dtype=torch.long) - y = torch.tensor(dix[1:], dtype=torch.long) # predict the next token in the sequence + x = Tensor(dix[:-1], dtype=dtypes.long) + y = Tensor(dix[1:], dtype=dtypes.long) # predict the next token in the sequence y[:ndigit*2-1] = -1 # we will only train in the output locations. -1 will mask loss to zero return x, y @@ -147,10 +151,10 @@ def eval_split(trainer, split, max_batches=None): ndigit = config.data.ndigit results = [] mistakes_printed_already = 0 - factors = torch.tensor([[10**i for i in range(ndigit+1)][::-1]]).to(trainer.device) - loader = DataLoader(dataset, batch_size=100, num_workers=0, drop_last=False) + factors = torch.tensor([[10**i for i in range(ndigit+1)][::-1]]) + # factors = tinygrad.tensor.Tensor([[10**i for i in range(ndigit+1)][::-1]]).to(tmp_device) + loader = TinyDataLoader(dataset, batch_size=100) for b, (x, y) in enumerate(loader): - x = x.to(trainer.device) # isolate the first two digits of the input sequence alone d1d2 = x[:, :ndigit*2] # let the model sample the rest of the sequence @@ -172,6 +176,7 @@ def eval_split(trainer, split, max_batches=None): print("GPT claims that %d + %d = %d but gt is %d" % (d1i[i], d2i[i], d3i_pred[i], d3i_gt[i])) if max_batches is not None and b+1 >= max_batches: break + # rt = tinygrad.tensor.Tensor(results, dtype=tinygrad.dtypes.float) rt = torch.tensor(results, dtype=torch.float) print("%s final score: %d/%d = %.2f%% correct" % (split, rt.sum(), len(results), 100*rt.mean())) return rt.sum() @@ -184,22 +189,22 @@ def batch_end_callback(trainer): if trainer.iter_num % 10 == 0: print(f"iter_dt {trainer.iter_dt * 1000:.2f}ms; iter {trainer.iter_num}: train loss {trainer.loss.item():.5f}") - if trainer.iter_num % 500 == 0: - # evaluate both the train and test score - train_max_batches = {1: None, 2: None, 3: 5}[config.data.ndigit] # if ndigit=2 we can afford the whole train set, ow no - model.eval() - with torch.no_grad(): - train_score = eval_split(trainer, 'train', max_batches=train_max_batches) - test_score = eval_split(trainer, 'test', max_batches=None) - score = train_score + test_score - # save the model if this is the best score we've seen so far - if score > top_score: - top_score = score - print(f"saving model with new top score of {score}") - ckpt_path = os.path.join(config.system.work_dir, "model.pt") - torch.save(model.state_dict(), ckpt_path) - # revert model to training mode - model.train() + # if trainer.iter_num % 500 == 0: + # # evaluate both the train and test score + # train_max_batches = {1: None, 2: None, 3: 5}[config.data.ndigit] # if ndigit=2 we can afford the whole train set, ow no + # # model.eval() + # with torch.no_grad(): + # train_score = eval_split(trainer, 'train', max_batches=train_max_batches) + # test_score = eval_split(trainer, 'test', max_batches=None) + # score = train_score + test_score + # # save the model if this is the best score we've seen so far + # if score > top_score: + # top_score = score + # print(f"saving model with new top score of {score}") + # ckpt_path = os.path.join(config.system.work_dir, "model.pt") + # # torch.save(model.state_dict(), ckpt_path) + # # revert model to training mode + # model.train() trainer.set_callback('on_batch_end', batch_end_callback) diff --git a/projects/chargpt/chargpt.py b/projects/chargpt/chargpt.py index 5de925b0..fb33ba9e 100644 --- a/projects/chargpt/chargpt.py +++ b/projects/chargpt/chargpt.py @@ -9,9 +9,9 @@ from torch.utils.data import Dataset from torch.utils.data.dataloader import DataLoader -from mingpt.model import GPT -from mingpt.trainer import Trainer -from mingpt.utils import set_seed, setup_logging, CfgNode as CN +from tinygpt.model import GPT +from tinygpt.trainer import Trainer +from tinygpt.utils import set_seed, setup_logging, CfgNode as CN # ----------------------------------------------------------------------------- diff --git a/setup.py b/setup.py index 9a2d64f6..2dd958cd 100644 --- a/setup.py +++ b/setup.py @@ -1,12 +1,14 @@ from setuptools import setup -setup(name='minGPT', +setup(name='tinyGPT', version='0.0.1', - author='Andrej Karpathy', - packages=['mingpt'], - description='A PyTorch re-implementation of GPT', + # author='Andrej Karpathy', + author='Ziliang Peng', + packages=['tinygpt'], + description='A tinygrad port of Andrej Karpathy\'s minGPT', license='MIT', install_requires=[ 'torch', + 'tinygrad', ], ) diff --git a/tests/test_huggingface_import.py b/tests/test_huggingface_import.py index dab52a82..146fe8f6 100644 --- a/tests/test_huggingface_import.py +++ b/tests/test_huggingface_import.py @@ -5,8 +5,8 @@ import unittest import torch from transformers import GPT2Tokenizer, GPT2LMHeadModel -from mingpt.model import GPT -from mingpt.bpe import BPETokenizer +from tinygpt.model import GPT +from tinygpt.bpe import BPETokenizer # ----------------------------------------------------------------------------- class TestHuggingFaceImport(unittest.TestCase): diff --git a/mingpt/__init__.py b/tinygpt/__init__.py similarity index 100% rename from mingpt/__init__.py rename to tinygpt/__init__.py diff --git a/mingpt/bpe.py b/tinygpt/bpe.py similarity index 99% rename from mingpt/bpe.py rename to tinygpt/bpe.py index b8468ef9..06e5986d 100644 --- a/mingpt/bpe.py +++ b/tinygpt/bpe.py @@ -226,7 +226,7 @@ def get_encoder(): and handles caching of "database" files. """ home_dir = os.path.expanduser('~') - cache_dir = os.path.join(home_dir, '.cache', 'mingpt') + cache_dir = os.path.join(home_dir, '.cache', 'tinygpt') os.makedirs(cache_dir, exist_ok=True) # load encoder.json that has the raw mappings from token -> bpe index diff --git a/mingpt/model.py b/tinygpt/model.py similarity index 52% rename from mingpt/model.py rename to tinygpt/model.py index 83ee22dc..b8ec61c7 100644 --- a/mingpt/model.py +++ b/tinygpt/model.py @@ -10,23 +10,26 @@ import math -import torch -import torch.nn as nn -from torch.nn import functional as F +from tinygrad import dtypes +from tinygrad.tensor import Tensor +import tinygrad.nn as nn -from mingpt.utils import CfgNode as CN +from tinygpt import tinyutils +from tinygpt.utils import CfgNode as CN # ----------------------------------------------------------------------------- -class NewGELU(nn.Module): +# NOTE: Tinygrad layer initialization doesn't support customizing the mean and std. + +class NewGELU: """ Implementation of the GELU activation function currently in Google BERT repo (identical to OpenAI GPT). Reference: Gaussian Error Linear Units (GELU) paper: https://arxiv.org/abs/1606.08415 """ - def forward(self, x): - return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0)))) + def __call__(self, x): + return 0.5 * x * (1.0 + (math.sqrt(2.0 / math.pi) * (x + 0.044715 * x.pow(3.0))).tanh()) -class CausalSelfAttention(nn.Module): +class CausalSelfAttention: """ A vanilla multi-head masked self-attention layer with a projection at the end. It is possible to use torch.nn.MultiheadAttention here but I am including an @@ -41,15 +44,15 @@ def __init__(self, config): # output projection self.c_proj = nn.Linear(config.n_embd, config.n_embd) # regularization - self.attn_dropout = nn.Dropout(config.attn_pdrop) - self.resid_dropout = nn.Dropout(config.resid_pdrop) + self.attn_pdrop = config.attn_pdrop + self.resid_pdrop = config.resid_pdrop # causal mask to ensure that attention is only applied to the left in the input sequence - self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size)) - .view(1, 1, config.block_size, config.block_size)) + self.bias = Tensor.ones(config.block_size, config.block_size).tril().view(1, 1, config.block_size, config.block_size) + self.bias.requires_grad = False self.n_head = config.n_head self.n_embd = config.n_embd - def forward(self, x): + def __call__(self, x): B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd) # calculate query, key, values for all heads in batch and move head forward to be the batch dim @@ -61,16 +64,16 @@ def forward(self, x): # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T) att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1))) att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf')) - att = F.softmax(att, dim=-1) - att = self.attn_dropout(att) + att = att.softmax(axis=-1) + att = att.dropout(self.attn_pdrop) y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs) y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side # output projection - y = self.resid_dropout(self.c_proj(y)) + y = self.c_proj(y).dropout(self.resid_pdrop) return y -class Block(nn.Module): +class Block: """ an unassuming Transformer block """ def __init__(self, config): @@ -78,21 +81,21 @@ def __init__(self, config): self.ln_1 = nn.LayerNorm(config.n_embd) self.attn = CausalSelfAttention(config) self.ln_2 = nn.LayerNorm(config.n_embd) - self.mlp = nn.ModuleDict(dict( + self.mlp = dict( c_fc = nn.Linear(config.n_embd, 4 * config.n_embd), c_proj = nn.Linear(4 * config.n_embd, config.n_embd), act = NewGELU(), - dropout = nn.Dropout(config.resid_pdrop), - )) + dropout = config.resid_pdrop, + ) m = self.mlp - self.mlpf = lambda x: m.dropout(m.c_proj(m.act(m.c_fc(x)))) # MLP forward + self.mlpf = lambda x: m['c_proj'](m['act'](m['c_fc'](x))).dropout(m['dropout']) # MLP forward - def forward(self, x): + def __call__(self, x): x = x + self.attn(self.ln_1(x)) x = x + self.mlpf(self.ln_2(x)) return x -class GPT(nn.Module): +class GPT: """ GPT Language Model """ @staticmethod @@ -141,151 +144,108 @@ def __init__(self, config): 'gpt-nano': dict(n_layer=3, n_head=3, n_embd=48), }[config.model_type]) - self.transformer = nn.ModuleDict(dict( + self.transformer = dict( wte = nn.Embedding(config.vocab_size, config.n_embd), wpe = nn.Embedding(config.block_size, config.n_embd), - drop = nn.Dropout(config.embd_pdrop), - h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]), + drop = config.embd_pdrop, # in Tinygrad dropout is a function not a Layer. + h = [Block(config) for _ in range(config.n_layer)], ln_f = nn.LayerNorm(config.n_embd), - )) + ) self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False) # init all weights, and apply a special scaled init to the residual projections, per GPT-2 paper - self.apply(self._init_weights) - for pn, p in self.named_parameters(): - if pn.endswith('c_proj.weight'): - torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * config.n_layer)) + # NOTE: Tinygrad doesn't support custom initialization # report number of parameters (note we don't count the decoder parameters in lm_head) - n_params = sum(p.numel() for p in self.transformer.parameters()) + n_params = sum(p.numel() for p in nn.state.get_parameters(self.transformer)) print("number of parameters: %.2fM" % (n_params/1e6,)) - def _init_weights(self, module): - if isinstance(module, nn.Linear): - torch.nn.init.normal_(module.weight, mean=0.0, std=0.02) - if module.bias is not None: - torch.nn.init.zeros_(module.bias) - elif isinstance(module, nn.Embedding): - torch.nn.init.normal_(module.weight, mean=0.0, std=0.02) - elif isinstance(module, nn.LayerNorm): - torch.nn.init.zeros_(module.bias) - torch.nn.init.ones_(module.weight) - - @classmethod - def from_pretrained(cls, model_type): - """ - Initialize a pretrained GPT model by copying over the weights - from a huggingface/transformers checkpoint. - """ - assert model_type in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'} - from transformers import GPT2LMHeadModel - - # create a from-scratch initialized minGPT model - config = cls.get_default_config() - config.model_type = model_type - config.vocab_size = 50257 # openai's model vocabulary - config.block_size = 1024 # openai's model block_size - model = GPT(config) - sd = model.state_dict() - - # init a huggingface/transformers model - model_hf = GPT2LMHeadModel.from_pretrained(model_type) - sd_hf = model_hf.state_dict() - - # copy while ensuring all of the parameters are aligned and match in names and shapes - keys = [k for k in sd_hf if not k.endswith('attn.masked_bias')] # ignore these - transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight'] - # basically the openai checkpoints use a "Conv1D" module, but we only want to use a vanilla nn.Linear. - # this means that we have to transpose these weights when we import them - assert len(keys) == len(sd) - for k in keys: - if any(k.endswith(w) for w in transposed): - # special treatment for the Conv1D weights we need to transpose - assert sd_hf[k].shape[::-1] == sd[k].shape - with torch.no_grad(): - sd[k].copy_(sd_hf[k].t()) - else: - # vanilla copy over the other parameters - assert sd_hf[k].shape == sd[k].shape - with torch.no_grad(): - sd[k].copy_(sd_hf[k]) - - return model + # @classmethod + # def from_pretrained(cls, model_type): + # """ + # Initialize a pretrained GPT model by copying over the weights + # from a huggingface/transformers checkpoint. + # """ + # assert model_type in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'} + # from transformers import GPT2LMHeadModel + + # # create a from-scratch initialized minGPT model + # config = cls.get_default_config() + # config.model_type = model_type + # config.vocab_size = 50257 # openai's model vocabulary + # config.block_size = 1024 # openai's model block_size + # model = GPT(config) + # sd = model.state_dict() + + # # init a huggingface/transformers model + # model_hf = GPT2LMHeadModel.from_pretrained(model_type) + # sd_hf = model_hf.state_dict() + + # # copy while ensuring all of the parameters are aligned and match in names and shapes + # keys = [k for k in sd_hf if not k.endswith('attn.masked_bias')] # ignore these + # transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight'] + # # basically the openai checkpoints use a "Conv1D" module, but we only want to use a vanilla nn.Linear. + # # this means that we have to transpose these weights when we import them + # assert len(keys) == len(sd) + # for k in keys: + # if any(k.endswith(w) for w in transposed): + # # special treatment for the Conv1D weights we need to transpose + # assert sd_hf[k].shape[::-1] == sd[k].shape + # # with torc.no_grad(): + # Tensor.no_grad = True + # sd[k].copy_(sd_hf[k].t()) + # Tensor.no_grad = False + # else: + # # vanilla copy over the other parameters + # assert sd_hf[k].shape == sd[k].shape + # # with torc.no_grad(): + # Tensor.no_grad = True + # sd[k].copy_(sd_hf[k]) + # Tensor.no_grad = False + + # return model def configure_optimizers(self, train_config): """ This long function is unfortunately doing something very simple and is being very defensive: We are separating out all parameters of the model into two buckets: those that will experience weight decay for regularization and those that won't (biases, and layernorm/embedding weights). - We are then returning the PyTorch optimizer object. + We are then returning the optimizer object. """ - # separate out all parameters to those that will and won't experience regularizing weight decay - decay = set() - no_decay = set() - whitelist_weight_modules = (torch.nn.Linear, ) - blacklist_weight_modules = (torch.nn.LayerNorm, torch.nn.Embedding) - for mn, m in self.named_modules(): - for pn, p in m.named_parameters(): - fpn = '%s.%s' % (mn, pn) if mn else pn # full param name - # random note: because named_modules and named_parameters are recursive - # we will see the same tensors p many many times. but doing it this way - # allows us to know which parent module any tensor p belongs to... - if pn.endswith('bias'): - # all biases will not be decayed - no_decay.add(fpn) - elif pn.endswith('weight') and isinstance(m, whitelist_weight_modules): - # weights of whitelist modules will be weight decayed - decay.add(fpn) - elif pn.endswith('weight') and isinstance(m, blacklist_weight_modules): - # weights of blacklist modules will NOT be weight decayed - no_decay.add(fpn) - - # validate that we considered every parameter - param_dict = {pn: p for pn, p in self.named_parameters()} - inter_params = decay & no_decay - union_params = decay | no_decay - assert len(inter_params) == 0, "parameters %s made it into both decay/no_decay sets!" % (str(inter_params), ) - assert len(param_dict.keys() - union_params) == 0, "parameters %s were not separated into either decay/no_decay set!" \ - % (str(param_dict.keys() - union_params), ) - - # create the pytorch optimizer object - optim_groups = [ - {"params": [param_dict[pn] for pn in sorted(list(decay))], "weight_decay": train_config.weight_decay}, - {"params": [param_dict[pn] for pn in sorted(list(no_decay))], "weight_decay": 0.0}, - ] - optimizer = torch.optim.AdamW(optim_groups, lr=train_config.learning_rate, betas=train_config.betas) + # NOTE: Tinygrad's AdamW doesn't support decay. hmmm... + params = list(filter(lambda x: x.requires_grad != False, nn.state.get_parameters(self))) # Do not optimize the requires_grad=False tensors + optimizer = nn.optim.AdamW(params, lr=train_config.learning_rate, b1=train_config.betas[0], b2=train_config.betas[1]) return optimizer - def forward(self, idx, targets=None): - device = idx.device + def __call__(self, idx, targets=None): b, t = idx.size() assert t <= self.block_size, f"Cannot forward sequence of length {t}, block size is only {self.block_size}" - pos = torch.arange(0, t, dtype=torch.long, device=device).unsqueeze(0) # shape (1, t) + pos = Tensor.arange(0, t, dtype=dtypes.long).unsqueeze(0) # shape (1, t) # forward the GPT model itself - tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embd) - pos_emb = self.transformer.wpe(pos) # position embeddings of shape (1, t, n_embd) - x = self.transformer.drop(tok_emb + pos_emb) - for block in self.transformer.h: + tok_emb = self.transformer['wte'](idx) # token embeddings of shape (b, t, n_embd) + pos_emb = self.transformer['wpe'](pos) # position embeddings of shape (1, t, n_embd) + x = (tok_emb + pos_emb).dropout(self.transformer['drop']) + for block in self.transformer['h']: x = block(x) - x = self.transformer.ln_f(x) + x = self.transformer['ln_f'](x) logits = self.lm_head(x) # if we are given some desired targets also calculate the loss loss = None if targets is not None: - loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1) + loss = logits.view(-1, logits.size(-1)).sparse_categorical_crossentropy( targets.view(-1), ignore_index=-1) return logits, loss - @torch.no_grad() def generate(self, idx, max_new_tokens, temperature=1.0, do_sample=False, top_k=None): """ Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete the sequence max_new_tokens times, feeding the predictions back into the model each time. Most likely you'll want to make sure to be in model.eval() mode of operation for this. """ + Tensor.no_grad = True for _ in range(max_new_tokens): # if the sequence context is growing too long we must crop it at block_size idx_cond = idx if idx.size(1) <= self.block_size else idx[:, -self.block_size:] @@ -295,16 +255,18 @@ def generate(self, idx, max_new_tokens, temperature=1.0, do_sample=False, top_k= logits = logits[:, -1, :] / temperature # optionally crop the logits to only the top k options if top_k is not None: - v, _ = torch.topk(logits, top_k) + v, _ = tinyutils.topk(logits, top_k) logits[logits < v[:, [-1]]] = -float('Inf') # apply softmax to convert logits to (normalized) probabilities - probs = F.softmax(logits, dim=-1) + probs = logits.softmax(axis=-1) # either sample from the distribution or take the most likely element if do_sample: - idx_next = torch.multinomial(probs, num_samples=1) + idx_next = Tensor.multinomial(probs, num_samples=1) else: - _, idx_next = torch.topk(probs, k=1, dim=-1) + _, idx_next = tinyutils.topk(probs, k=1, dim=-1) # append sampled index to the running sequence and continue - idx = torch.cat((idx, idx_next), dim=1) + idx = Tensor.cat((idx, idx_next), dim=1) + + Tensor.no_grad = False return idx diff --git a/tinygpt/tinyloader.py b/tinygpt/tinyloader.py new file mode 100644 index 00000000..dd65004a --- /dev/null +++ b/tinygpt/tinyloader.py @@ -0,0 +1,18 @@ +from tinygrad.tensor import Tensor + +class TinyDataLoader: + def __init__(self, dataset, batch_size=1) -> None: + self.dataset = dataset + self.batch_size = batch_size + + def __iter__(self): + while True: + xs = [] + ys = [] + for i in range(self.batch_size): + x, y = self.dataset[i] + xs.append(x) + ys.append(y) + xs = Tensor.stack(*xs) + ys = Tensor.stack(*ys) + yield xs, ys diff --git a/tinygpt/tinyutils.py b/tinygpt/tinyutils.py new file mode 100644 index 00000000..371ef880 --- /dev/null +++ b/tinygpt/tinyutils.py @@ -0,0 +1,22 @@ +import numpy as np +from tinygrad import Tensor + +def clip_grad_norm_(parameters, max_norm, norm_type=2.0, error_if_nonfinite=False, foreach=None): + import torch + return torch.nn.utils.clip_grad_norm_(parameters, max_norm, norm_type, error_if_nonfinite, foreach) + +def topk(input_, k, dim=-1, largest=True, sorted=False): + k = min(k, input_.shape[dim]-1) + input_ = input_.numpy() + if largest: input_ *= -1 + ind = np.argpartition(input_, k, axis=dim) + if largest: input_ *= -1 + ind = np.take(ind, np.arange(k), axis=dim) # k non-sorted indices + input_ = np.take_along_axis(input_, ind, axis=dim) # k non-sorted values + if not sorted: return Tensor(input_), ind + if largest: input_ *= -1 + ind_part = np.argsort(input_, axis=dim) + ind = np.take_along_axis(ind, ind_part, axis=dim) + if largest: input_ *= -1 + val = np.take_along_axis(input_, ind_part, axis=dim) + return Tensor(val), ind \ No newline at end of file diff --git a/mingpt/trainer.py b/tinygpt/trainer.py similarity index 71% rename from mingpt/trainer.py rename to tinygpt/trainer.py index c0d08521..166b306a 100644 --- a/mingpt/trainer.py +++ b/tinygpt/trainer.py @@ -6,17 +6,17 @@ import time from collections import defaultdict -import torch -from torch.utils.data.dataloader import DataLoader -from mingpt.utils import CfgNode as CN +from tinygrad.tensor import Tensor + +from tinygpt.tinyloader import TinyDataLoader +from tinygpt import tinyutils +from tinygpt.utils import CfgNode as CN class Trainer: @staticmethod def get_default_config(): C = CN() - # device to train on - C.device = 'auto' # dataloder parameters C.num_workers = 4 # optimizer parameters @@ -35,14 +35,6 @@ def __init__(self, config, model, train_dataset): self.train_dataset = train_dataset self.callbacks = defaultdict(list) - # determine the device we'll train on - if config.device == 'auto': - self.device = 'cuda' if torch.cuda.is_available() else 'cpu' - else: - self.device = config.device - self.model = self.model.to(self.device) - print("running on device", self.device) - # variables that will be assigned to trainer class later for logging and etc self.iter_num = 0 self.iter_time = 0.0 @@ -65,16 +57,10 @@ def run(self): self.optimizer = model.configure_optimizers(config) # setup the dataloader - train_loader = DataLoader( - self.train_dataset, - sampler=torch.utils.data.RandomSampler(self.train_dataset, replacement=True, num_samples=int(1e10)), - shuffle=False, - pin_memory=True, - batch_size=config.batch_size, - num_workers=config.num_workers, - ) - - model.train() + train_loader = TinyDataLoader(self.train_dataset, batch_size=config.batch_size) + + t = Tensor.train() + t.__enter__() self.iter_num = 0 self.iter_time = time.time() data_iter = iter(train_loader) @@ -86,16 +72,15 @@ def run(self): except StopIteration: data_iter = iter(train_loader) batch = next(data_iter) - batch = [t.to(self.device) for t in batch] x, y = batch # forward the model logits, self.loss = model(x, y) # backprop and update the parameters - model.zero_grad(set_to_none=True) + self.optimizer.zero_grad() self.loss.backward() - torch.nn.utils.clip_grad_norm_(model.parameters(), config.grad_norm_clip) + # NOTE: tinygrad does not have clip_grad_norm_ self.optimizer.step() self.trigger_callbacks('on_batch_end') @@ -107,3 +92,4 @@ def run(self): # termination conditions if config.max_iters is not None and self.iter_num >= config.max_iters: break + t.__exit__(0, 0, 0) diff --git a/mingpt/utils.py b/tinygpt/utils.py similarity index 97% rename from mingpt/utils.py rename to tinygpt/utils.py index af864ecb..db081894 100644 --- a/mingpt/utils.py +++ b/tinygpt/utils.py @@ -6,15 +6,14 @@ from ast import literal_eval import numpy as np -import torch +import tinygrad # ----------------------------------------------------------------------------- def set_seed(seed): random.seed(seed) np.random.seed(seed) - torch.manual_seed(seed) - torch.cuda.manual_seed_all(seed) + tinygrad.Tensor.manual_seed(seed) def setup_logging(config): """ monotonous bookkeeping """