Skip to content

Latest commit

 

History

History
97 lines (54 loc) · 3.2 KB

computer-vision.md

File metadata and controls

97 lines (54 loc) · 3.2 KB

Computer vision

Contents

  1. Convolutions
  2. LeNet
  3. AlexNet
  4. ResNet
  5. UNet
  6. Diffusion
  7. Conclusion

Input tensor shape

The DataLoader for vision models will often load jpeg images that are jpeg-decoded on-the-fly and batched, resulting in input tensors with a shape:

[batch_size][height][width][n_channels]  =  e.g. [8][224][224][3]

This shape is denoted as NHWC and referred to as "channels last". Another convention is to use NCHW, referred to as "channels first".

Convolutions

  • The most import operation in computer vision is a convolution.
  • It is a matrix multiply that respects spacial symmetry; the same matrix is applied everywhere.

2D convolution with padding (source: https://github.com/vdumoulin/conv_arithmetic)

LeNet

LeNet-5

AlexNet

  • Other important labeled image datasets are CIFAR-10 and CIFAR-100 that have 10 and 100 classes, respectively.
  • Deng, J. et al. (2009). ImageNet: A large-scale hierarchical image database.
    • ImageNet-1k dataset: 1000 image classes with about 1000 examples each.
  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks.
    • Became known as "AlexNet"
  • Watershed moment in CV with deep learning

ResNet

ResNet v1 vs v2 (cv-tricks.com):

ResNet v1 vs v2 (source: cv-tricks.com.

UNet

The UNet architecture:

UNet architecture (source: https://arxiv.org/abs/1505.04597)

An example of image segmentation with UNet:

Example of image segmentation with UNet (source: https://arxiv.org/abs/1505.04597)

Diffusion

TODO

Conclusion

TODO