The aim of this project is to train DETR on a custom dataset consisting of objects from construction domain (around 48 classes) for Object Detection and Panoptic Segmentation.
Let us now understand how DETR works and try to answer few questions.
First the object detection model was trained for 200 epochs using pre-trained weights. Then a panoptic head was added on top of this and trained for another 50 epochs. This time the object detection model was freezed and only panoptic head was trained.
We train DETR with AdamW setting learning rate in the transformer to 1e-4 and 1e-5 in the backbone. Horizontal flips, scales and crops are used for augmentation. Images are rescaled to have min size 800 and max size 1333. The transformer is trained with dropout of 0.1, and the whole model is trained with grad clip of 0.1.
-
Fine-tuning of DETR on construction dataset for Object Detection (click here)
-
Panoptic segmentation training (click here)
Bounding box detection evaluation results for the construction dataset after training for 200 epochs
IoU metric: bbox
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.753
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.864
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.801
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.387
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.609
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.782
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.716
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.857
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.871
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.505
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.728
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.899
Segmentation Metric: (Panoptic, Segmentation, Recognition Quality) after training panoptic head for 50 epochs
| PQ SQ RQ N
--------------------------------------
All | 53.1 80.0 60.7 61
Things | 61.6 82.9 69.6 46
Stuff | 27.0 71.2 33.5 15
Check out the below YouTube link below to see predictions from the trained model
The project shows that fine tuning can lead to a score of 53 PQ in about 50 epochs. The results are satisfactory. Transformers are good in global reasoning but are computational expensive with long inputs (high resolution images), making difficult to attain good results with small objects.
Further works include
- explore new image augmentation techniques like RICAP for better detection results
- reduce leakage of orginial COCO class while creating ground truth. (eg: red areas around wheel loader in image)
- add few images from COCO dataset so that PQ for stuff could be increased
- Implement Spatially Modulated Co-Attention (SMCA) which is a plug and play module to replace and help achieve faster convergence. Refer this link
- Explore and implement this paper from Google which would allow to skip the BBox detection and directly train for Panoptic segmentation.