The DataLoader
for vision models will often load jpeg images that are jpeg-decoded
on-the-fly and batched, resulting in input tensors with a shape:
[batch_size][height][width][n_channels] = e.g. [8][224][224][3]
This shape is denoted as NHWC
and referred to as "channels last".
Another convention is to use NCHW
, referred to as "channels first".
- The most import operation in computer vision is a convolution.
- It is a matrix multiply that respects spacial symmetry; the same matrix is applied everywhere.
- Dumoulin, V. & Visin, F. (2016). A guide to convolution arithmetic for deep learning.
- LeCun, Y. et al. (1989). Backpropagation applied to handwritten zip code recognition.
- LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition.
- Used the MNIST dataset: handwritten digits 0-9
- Used for Optical Character Recogition (OCR) on checks and mail in the 1990s!
- This is arguably the first impactful application of deep learning?
- Other important labeled image datasets are CIFAR-10 and CIFAR-100 that have 10 and 100 classes, respectively.
- Deng, J. et al. (2009). ImageNet: A large-scale hierarchical image database.
- ImageNet-1k dataset: 1000 image classes with about 1000 examples each.
- Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks.
- Became known as "AlexNet"
- Watershed moment in CV with deep learning
ResNet v1 vs v2 (cv-tricks.com):
- He, K. et al. (2015). Deep residual learning for image recognition.
- Still an important MLPerf benchmark.
The UNet architecture:
An example of image segmentation with UNet:
- Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation.
TODO
TODO
- Up next: Natural language
- Previous: Introduction