Implementation and experimentation of the SRCNN model in TensorFlow 2.0
This repository contains an implementation of the SRCNN super resolution algorithm [1] implemented in TensorFlow 2.0 (v2.3.1). This library is used to construct, train, and evaluate the SRCNN model, as well as to explore variations on architecture and implementation.
The base SRCNN model consists of upscaling using bicubic interpolation three main convolutional layers [1]:
- A layer for mapping low-resolution patches as feature vectors, consisting of n1 x c (where c is the number of channels in the image) filters of size f1 x f1.
- A non-linear mapping layer consisting of n2 filters of size 1 x 1. This layer maps each low-resolution patch onto another high-resolution patch in the following layer. Can be more that one layer to increase non-linearity.
- A reconstruction layer consisting of c x n2 filters of size f3 x f3, where the layers are aggregated to product the final image.
Typical values for the parameters are:
- n1: 64
- n2: 32
- f1: 9
- f3: 5
The authors of [1] use the mean squared error, MSE, as a loss function and note that this favors a high PSNR measurement on performance testing. While PSNR corresponds to closely matching pixel values, it does not necessarily help create images with high perceptual image quality (e.g. satisfying to human vision).
Gains in perceptual quality can be made by using content or texture loss, or a weighted combination of these. This will be added in the future.
This work beings by using the same training and benchmarking data used in the original paper.
- T91
- Training data
- 91 images of varying resolution featuring plants, animals, humans, vehicles, and buildings
- Source: https://www.kaggle.com/ll01dm/t91-image-dataset
- Set5
- Evaluation data for scaling factors of 2, 3 and 4
- 5 images (baby, bird, butterfly, head, woman), varying resolution
- Source: https://www.kaggle.com/ll01dm/set-5-14-super-resolution-dataset
- Set14
- Evaluation data for scaling factor 3
- 14 images, varying resolution, humans, animals, plants
- Source: https://www.kaggle.com/ll01dm/set-5-14-super-resolution-dataset
- CIFAR-10
- 60,000 images, 32 x 32 pixels
- 10 image classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck
- Used to evaulate image quality by human judgement. The images is small and there is no high-resolution version to compare, so this is just for "eyeballing" on fresh images.
Training is done only on the T91 datset, but these images require preprocessing.
We will train on a series of 32 x 32 image patches pulled from the T91 dataset by stepping the patch window across each image by a stride of 14 pixels, which will give roughly 90,000 images. These images are downscaled and them blurred using a Gaussian kernel. Upscaling will happen in the model class.
Rotation - Each 32 x 32 pixel image is rotated by 90 degrees to give 4 images per input image.
Channel swap - Swapping the red, green, and blue channels of each image, creating 6 images per input image.
We will use the following to gauge the performance of upscaling:
- PSNR - Peak signal-to-noise ratio, https://en.wikipedia.org/wiki/Peak_signal-to-noise_ratio
- SSIM - Structural similarity index ratio, https://en.wikipedia.org/wiki/Structural_similarity
After creating and benchmarking the basic SRCNN model described in [1] we will explore the effect of changing several factors in an attempt to maximize the effectiveness of the model. The following will be altered:
- Number of filters in layers 1 and 3
- Size of filters in layers 1 and 3
- Number of non-linear layers
- Loss functions: content/texture/hyrbid loss
- Batch normalization and dropout
- Ensembling: multiple trained SRCNN models with fused prediction output
- Self-ensembling: Predict on separate 90 degree rotations of the same input image and combine the predictions
- C. Dong, C. C. Loy, K. He, X. Tang, "Learning a deep convolutional network for image super-resolution," in ECCV, 2014.
This repository is built on the work of the following projects: