Our detection code is developed on top of MMDetection v2.13.0.
For details see Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions.
If you use this code for a paper please cite:
PVTv1
@misc{wang2021pyramid,
title={Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions},
author={Wenhai Wang and Enze Xie and Xiang Li and Deng-Ping Fan and Kaitao Song and Ding Liang and Tong Lu and Ping Luo and Ling Shao},
year={2021},
eprint={2102.12122},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
PVTv2
@misc{wang2021pvtv2,
title={PVTv2: Improved Baselines with Pyramid Vision Transformer},
author={Wenhai Wang and Enze Xie and Xiang Li and Deng-Ping Fan and Kaitao Song and Ding Liang and Tong Lu and Ping Luo and Ling Shao},
year={2021},
eprint={2106.13797},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Install MMDetection v2.13.0.
or
pip install mmdet==2.13.0 --user
Apex (optional):
git clone https://github.com/NVIDIA/apex
cd apex
python setup.py install --cpp_ext --cuda_ext --user
If you would like to disable apex, modify the type of runner as EpochBasedRunner
and comment out the following code block in the configuration files:
fp16 = None
optimizer_config = dict(
type="DistOptimizerHook",
update_interval=1,
grad_clip=None,
coalesce=True,
bucket_size_mb=-1,
use_fp16=True,
)
Prepare COCO according to the guidelines in MMDetection v2.13.0.
- PVTv2 on COCO
Method | Backbone | Pretrain | Lr schd | Aug | box AP | mask AP | Config | Download |
---|---|---|---|---|---|---|---|---|
RetinaNet | PVTv2-b0 | ImageNet-1K | 1x | No | 37.2 | - | config | log & model |
RetinaNet | PVTv2-b1 | ImageNet-1K | 1x | No | 41.2 | - | config | log & model |
RetinaNet | PVTv2-b2-li | ImageNet-1K | 1x | No | 43.6 | - | config | log & model |
RetinaNet | PVTv2-b2 | ImageNet-1K | 1x | No | 44.6 | - | config | log & model |
RetinaNet | PVTv2-b3 | ImageNet-1K | 1x | No | 45.9 | - | config | log & model |
RetinaNet | PVTv2-b4 | ImageNet-1K | 1x | No | 46.1 | - | config | log & model |
RetinaNet | PVTv2-b5 | ImageNet-1K | 1x | No | 46.2 | - | config | log & model |
Mask R-CNN | PVTv2-b0 | ImageNet-1K | 1x | No | 38.2 | 36.2 | config | log & model |
Mask R-CNN | PVTv2-b1 | ImageNet-1K | 1x | No | 41.8 | 38.8 | config | log & model |
Mask R-CNN | PVTv2-b2-li | ImageNet-1K | 1x | No | 44.1 | 40.5 | config | log & model |
Mask R-CNN | PVTv2-b2 | ImageNet-1K | 1x | No | 45.3 | 41.2 | config | log & model |
Mask R-CNN | PVTv2-b3 | ImageNet-1K | 1x | No | 47.0 | 42.5 | config | log & model |
Mask R-CNN | PVTv2-b4 | ImageNet-1K | 1x | No | 47.5 | 42.7 | config | log & model |
Mask R-CNN | PVTv2-b5 | ImageNet-1K | 1x | No | 47.4 | 42.5 | config | log & model |
Method | Backbone | Pretrain | Lr schd | Aug | box AP | mask AP | Config | Download |
---|---|---|---|---|---|---|---|---|
Mask R-CNN | PVTv2-b0 | ImageNet-1K | 3x | Yes | 41.6 | 38.2 | config | log & model |
Mask R-CNN | PVTv2-b2 | ImageNet-1K | 3x | Yes | 47.8 | 43.1 | config | log & model |
Method | Backbone | Pretrain | Lr schd | Aug | box AP | mask AP | Config | Download |
---|---|---|---|---|---|---|---|---|
Cascade Mask R-CNN | PVTv2-b2-Linear | ImageNet-1K | 3x | Yes | 50.9 | 44.0 | config | log & model |
Cascade Mask R-CNN | PVTv2-b2 | ImageNet-1K | 3x | Yes | 51.1 | 44.4 | config | log & model |
ATSS | PVTv2-b2-Linear | ImageNet-1K | 3x | Yes | 48.9 | - | config | log & model |
ATSS | PVTv2-b2 | ImageNet-1K | 3x | Yes | 49.9 | - | config | log & model |
GFL | PVTv2-b2-Linear | ImageNet-1K | 3x | Yes | 49.2 | - | config | log & model |
GFL | PVTv2-b2 | ImageNet-1K | 3x | Yes | 50.2 | - | config | log & model |
Sparse R-CNN | PVTv2-b2-Linear | ImageNet-1K | 3x | Yes | 48.9 | - | config | log & model |
Sparse R-CNN | PVTv2-b2 | ImageNet-1K | 3x | Yes | 50.1 | - | config | log & model |
- PVTv1 on COCO
Method | Backbone | Pretrain | Lr schd | box AP | mask AP | Config | Download |
---|---|---|---|---|---|---|---|
RetinaNet | PVT-Tiny | ImageNet-1K | 1x | 36.7 | - | config | log & model |
RetinaNet (640x) | PVT-Small | ImageNet-1K | 1x | 38.7 | - | config | log & model |
RetinaNet (800x) | PVT-Small | ImageNet-1K | 1x | 40.4 | - | config | log & model |
RetinaNet | PVT-Medium | ImageNet-1K | 1x | 41.9 | - | config | log & model |
RetinaNet | PVT-Large | ImageNet-1K | 1x | 42.6 | - | config | log & model |
Mask R-CNN | PVT-Tiny | ImageNet-1K | 1x | 36.7 | 35.1 | config | log & model |
Mask R-CNN | PVT-Small | ImageNet-1K | 1x | 40.4 | 37.8 | config | log & model |
Mask R-CNN | PVT-Medium | ImageNet-1K | 1x | 42.0 | 39.0 | config | log & model |
Mask R-CNN | PVT-Large | ImageNet-1K | 1x | 42.9 | 39.5 | config | log & model |
DETR | PVT-Small | ImageNet-1K | 50ep | 34.7 | - | config | log & model |
To evaluate PVT-Small + RetinaNet (640x) on COCO val2017 on a single node with 8 gpus run:
dist_test.sh configs/retinanet_pvt_s_fpn_1x_coco_640.py /path/to/checkpoint_file 8 --out results.pkl --eval bbox
This should give
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.387
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.593
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.408
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.212
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.416
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.544
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.545
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.545
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.545
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.329
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.583
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.721
To train PVT-Small + RetinaNet (640x) on COCO train2017 on a single node with 8 gpus for 12 epochs run:
dist_train.sh configs/retinanet_pvt_s_fpn_1x_coco_640.py 8
python demo.py demo.jpg /path/to/config_file /path/to/checkpoint_file
python get_flops.py configs/gfl_pvt_v2_b2_fpn_3x_mstrain_fp16.py
This should give
Input shape: (3, 1280, 800)
Flops: 260.65 GFLOPs
Params: 33.11 M
This repository is released under the Apache 2.0 license as found in the LICENSE file.