The model was inspired by a U-Net, SegNet and Fully Convolutional Networks models and is a convolutional encoder-decoder network with skipped-connections. I used a pre-trained VGG16 network with Batch Normalization as an encoder. Decoder network was designed by me. Model was trained from scratch with Pytorch.
I trained model on 2012 VOCSegmentation's train dataset (1464 samples). For validation (725 samples) and test (725 samples) I used randomly splitted 2012 VOCSegmentation's validation dataset.
For spacial augmentations I used: horiontal flip, rotation, translation and scaling. As a color augmentations I used Gaussian Blur, ColorJitter and brightness & contrast change. Below are examples from the training and validation sets.
Training hyperparameters are presented below. Unfortunately I didn't have enough computational power to get the best hyperparameters set or train the model for hours.
Hyperparameter | Value |
---|---|
Optimizer | Adam (lr=1e-4) |
Scheduler | One Cycle LR |
Epochs | 300 |
Patience | 30 |
L1 regularization coeficient | 1e-6 |
Moreover, beacuse of data inbalance I used weighted cross-entropy loss. Model was trained for 1 hour 49 minutes on Nvidia GeForce RTX 3090 Ti.
The model achieved an pixel-level accuracy at 87.95 % on half of the VOC Segmentation's validation set (which was my test dataset), which is comparable to the results achieved in the literature.