This is a decoder only transformer, similar to a GPT structure that generates Shakespear-like text.
The Theoretical part of the model follows the decoder structure explained in the famous Attention is All You Need paper.
Other than the basic Multihead Self Attention and Feedforward Structure, also implemented LayerNorm, Residual Connections from Deep Residual Learning for Image Recognition, and Dropout from Dropout: A Simple Way to Prevent Neural Networks from Overfitting
lecture.ipynb
follows the instruction by Andrej Karpathy's lecture Let's Build GPT
Final.py
is the final Decoder transformer model with 10 million parameters and input.txt
as all the shakespear work input.