In this report, the main goal is replicate the results of the paper Colorful Image Colorization (R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” CoRR, vol. abs/1603.08511, 2016.) to produce plausible color versions of photographs given a grayscale input. The original model has been trained using the whole ImageNet dataset that contains more than a milion images. Given our resource (and time) constraints, we decided to use only 50'000 images as to speed up the training, at the expenses of suboptimal results. However, the results we obtained were satisfactory considering the limited time and resources we had available, but are not close to the ones obtained by the authors. We also explored different network structures to see how the results could be improved. At the end we performed what the authors refer to as the "Colorization Turing Test", where people are asked to choose between the original image and the one generated by the network, pointing to the one they thing is true.
We show the results we obtained when trying to replicate the results of the paper by recreating the same CNN network structure implemented by the authors using Tensorflow Keras. The CNN model takes as input the L channel of the image, encoded using the Lab color space, and learns a probability distribution over the quantized ab channels to return a prediction on the ab channels.
Fig: Top: examples of L channels used as input for the network; bottom: real version of the images.Fig: Structure of the CNN network.
Compared to previous colorization methods, this model is made in a way that encourages the production of vibrant and realistic colors, as opposed to dull and desaturated ones. This is possible through class rebalancing, which is a way to give different importance to colors that are more or less common, accordingly; this is then used in a custom loss function that helps us take this into account.
Fig: Brief description of the loss function and various components.Once they are put together, we obtain a realistic representation of how the image could have been with colors. The model is trained using only a subset of the original dataset, given our time constraints, but it is still able to obtain remarkable results.
Fig: Some of the best output samples we obtained, next to the ground truth version of the image.One of the tasks of the project was also try out different implementations of the network, with the purpose of improving the results. What were combinations of the following techniques:
- using some regularization technique (eg. L2)
- changing optimizer for Gradient Descent (eg. Adam, AdaDelta)
- reducing the number of layers
- adding dropout
- reducing the resolution of the output
The results of the experiments can be read in the report, but what we noticed could help the most was reducing the output resolution, so the model had less pixels to predict and the colors were more accurate.
Fig: Results we obtained with different implementations of the network.