Skip to content

Implemented a Vision Transformer from famous paper 'An Image is Worth 16x16 Images'. Implemented Multi-Head Attention, MLP from scratch in PyTorch.

Notifications You must be signed in to change notification settings

suryansh-sinha/ViT-From-Scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vision Transformer from Scratch

Coding a vision transformer from scratch using python's pytorch framework. The Vision Transformer Architecture:

img

Let's break down the subparts of the transformer...

  • Initial patch embeddings: The input image is broken down into distinguishable patches of size nxn (n=16) and these patches are embedded using convolutional layers.

  • Conversion to projected vectors: These patches are then converted to key, query and value projection vectors using linear layers

  • Attention blocks: Consists of the self-attention layers to assign weights and context to patches

  • Multi-layer perceptron: A simple network with linear layers for image classification. Uses Gaussian Error Linear Rectified Unit activation.

  • Final classification layer: Linear layer which maps the output to respective class probabilities.

Inference

The vision transformer can be used in inference mode as from the file inference.py with necessary arguments. The result will be of the shape: (batch_size, nclasses) and can be changed to suit the application involved. The model and inputs can be sent to a pre-existing GPU using PyTorch's cuda capabilities.

Run

python inference.py --depth --proj_dropout --attn_dropout --gpu --weights

Dependencies

  • PyTorch (preferable with CUDA integration)
  • numpy

About

Implemented a Vision Transformer from famous paper 'An Image is Worth 16x16 Images'. Implemented Multi-Head Attention, MLP from scratch in PyTorch.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages