Follows Umar Jamil's lecture and the papers Denoising Diffusion Probabilistic Models and High-Resolution Image Synthesis with Latent Diffusion Models to build Stable Diffusion Model from scratch.
The Stable diffusion model includes Variational Autoencoder(VAE) from the paper Auto-Encoding Variational Bayes, Contrastive Language–Image Pre-training(CLIP) proposed by OpenAI in the paper Learning Transferable Visual Models From Natural Language Supervision, U-Net from the paper U-Net: Convolutional Networks for Biomedical Image Segmentation, Self Attention and Cross Attention from the Attention is All You Need paper, and the DDPM Sampler from the stable diffusion papers.
Weights and the tokenizer are loaded from the HuggingFace RunwayML Stable Diffusion V1.5 for text to image and image to image generation.