Transformer: https://arxiv.org/pdf/1706.03762.pdf
BERT: https://arxiv.org/pdf/1810.04805.pdf
Transformer XL: https://arxiv.org/pdf/1901.02860.pdf
Longformer: https://arxiv.org/pdf/2004.05150.pdf
Block Recurrent Transformer: https://arxiv.org/pdf/2203.07852.pdf
Memorizing Transformer: https://arxiv.org/pdf/2203.08913.pdf
One write head is all you need: https://arxiv.org/pdf/1911.02150.pdf
Unlimiformer: https://arxiv.org/pdf/2305.01625.pdf
S4D: https://arxiv.org/pdf/2206.11893.pdf
Block State Transformer: https://arxiv.org/pdf/2306.09539.pdf
Blockwise Parallel Transformer: https://arxiv.org/pdf/2305.19370.pdf