Model ZOO

This conversion document is adapted from the configs document of mmyolo. The Download link below downloads the original PyTorch models. For convenience, you can use the YOLO series ONNX models we have uploaded to HuggingFace.

You can download the ONNX model of your choice from the following link: https://huggingface.co/CtrlX/JetYOLO/tree/main

Note：

The model names preceded by (×) indicate that these models have not been exported to ONNX and uploaded to HuggingFace. You can refer to the doc/model_convert.md document to try exporting the ONNX models yourself.

If you are wondering why the onnx is divided into two categories (Backend as efficientNMS and only decode), please refer to the doc/model_convert.md document.

RTMdet

Object Detection

Model	size	Params(M)	FLOPs(G)	TRT-FP16-Latency(ms)	box AP	TTA box AP	Config	Download
RTMDet-tiny	640	4.8	8.1	0.98	41.0	42.7	config	model \| log
(×) RTMDet-tiny *	640	4.8	8.1	0.98	41.8 (+0.8)	43.2 (+0.5)	config	model \| log
RTMDet-s	640	8.89	14.8	1.22	44.6	45.8	config	model \| log
(×) RTMDet-s *	640	8.89	14.8	1.22	45.7 (+1.1)	47.3 (+1.5)	config	model \| log
RTMDet-m	640	24.71	39.27	1.62	49.3	50.9	config	model \| log
(×) RTMDet-m *	640	24.71	39.27	1.62	50.2 (+0.9)	51.9 (+1.0)	config	model \| log
RTMDet-l	640	52.3	80.23	2.44	51.4	53.1	config	model \| log
(×) RTMDet-l *	640	52.3	80.23	2.44	52.3 (+0.9)	53.7 (+0.6)	config	model \| log
RTMDet-x	640	94.86	141.67	3.10	52.8	54.2	config	model \| log

Note:

The inference speed of RTMDet is measured on an NVIDIA 3090 GPU with TensorRT 8.4.3, cuDNN 8.2.0, FP16, batch size=1, and without NMS.

For a fair comparison, the config of bbox postprocessing is changed to be consistent with YOLOv5/6/7 after PR#9494, bringing about 0.1~0.3% AP improvement.

TTA means that Test Time Augmentation. It's perform 3 multi-scaling transformations on the image, followed by 2 flipping transformations (flipping and not flipping). You only need to specify --tta when testing to enable. see TTA for details.

* means checkpoints are trained with knowledge distillation. More details can be found in RTMDet distillation.

YOLOv5

COCO

Backbone	Arch	size	Mask Refine	SyncBN	AMP	Mem (GB)	box AP	TTA box AP	Config	Download
YOLOv5-n	P5	640	No	Yes	Yes	1.5	28.0	30.7	config	model \| log
YOLOv5-n	P5	640	Yes	Yes	Yes	1.5	28.0		config	model \| log
YOLOv5u-n	P5	640	Yes	Yes	Yes				config	model \| log
YOLOv5-s	P5	640	No	Yes	Yes	2.7	37.7	40.2	config	model \| log
YOLOv5-s	P5	640	Yes	Yes	Yes	2.7	38.0 (+0.3)		config	model \| log
YOLOv5u-s	P5	640	Yes	Yes	Yes				config	model \| log
YOLOv5-m	P5	640	No	Yes	Yes	5.0	45.3	46.9	config	model \| log
YOLOv5-m	P5	640	Yes	Yes	Yes	5.0	45.3		config	model \| log
YOLOv5u-m	P5	640	Yes	Yes	Yes				config	model \| log
YOLOv5-l	P5	640	No	Yes	Yes	8.1	48.8	49.9	config	model \| log
YOLOv5-l	P5	640	Yes	Yes	Yes	8.1	49.3 (+0.5)		config	model \| log
YOLOv5u-l	P5	640	Yes	Yes	Yes				config	model \| log
YOLOv5-x	P5	640	No	Yes	Yes	12.2	50.2		config	model \| log
YOLOv5-x	P5	640	Yes	Yes	Yes	12.2	50.9 (+0.7)		config	model \| log
YOLOv5u-x	P5	640	Yes	Yes	Yes				config	model \| log
(×) YOLOv5-n	P6	1280	No	Yes	Yes	5.8	35.9		config	model \| log
(×) YOLOv5-s	P6	1280	No	Yes	Yes	10.5	44.4		config	model \| log
(×) YOLOv5-m	P6	1280	No	Yes	Yes	19.1	51.3		config	model \| log
(×) YOLOv5-l	P6	1280	No	Yes	Yes	30.5	53.7		config	model \| log

Note:

fast means that YOLOv5DetDataPreprocessor and yolov5_collate are used for data preprocessing, which is faster for training, but less flexible for multitasking. Recommended to use fast version config if you only care about object detection.

detect means that the network input is fixed to 640x640 and the post-processing thresholds is modified.

SyncBN means use SyncBN, AMP indicates training with mixed precision.

We use 8x A100 for training, and the single-GPU batch size is 16. This is different from the official code.

The performance is unstable and may fluctuate by about 0.4 mAP and the highest performance weight in COCO training in YOLOv5 may not be the last epoch.

TTA means that Test Time Augmentation. It's perform 3 multi-scaling transformations on the image, followed by 2 flipping transformations (flipping and not flipping). You only need to specify --tta when testing to enable. see TTA for details.

The performance of Mask Refine training is for the weight performance officially released by YOLOv5. Mask Refine means refining bbox by mask while loading annotations and transforming after YOLOv5RandomAffine, Copy Paste means using YOLOv5CopyPaste.

YOLOv5u models use the same loss functions and split Detect head as YOLOv8 models for improved performance, but only requires 300 epochs.

YOLOv6

COCO

Backbone	Arch	Size	Epoch	SyncBN	AMP	Mem (GB)	Box AP	Config	Download
YOLOv6-n	P5	640	400	Yes	Yes	6.04	36.2	config	model \| log
YOLOv6-t	P5	640	400	Yes	Yes	8.13	41.0	config	model \| log
YOLOv6-s	P5	640	400	Yes	Yes	8.88	44.0	config	model \| log
YOLOv6-m	P5	640	300	Yes	Yes	16.69	48.4	config	model \| log
YOLOv6-l	P5	640	300	Yes	Yes	20.86	51.0	config	model \| log

Note:

The official m and l models use knowledge distillation, but our version does not support it, which will be implemented in MMRazor in the future.

The performance is unstable and may fluctuate by about 0.3 mAP.

If users need the weight of 300 epoch for nano, tiny and small model, they can train according to the configs of 300 epoch provided by us, or convert the official weight according to the converter script.

We have observed that the base model has been officially released in v6 recently. Although the accuracy has decreased, it is more efficient. We will also provide the base model configuration in the future.

YOLOv7

COCO

Backbone	Arch	Size	SyncBN	AMP	Mem (GB)	Box AP	Config	Download
YOLOv7-tiny	P5	640	Yes	Yes	2.7	37.5	config	model \| log
YOLOv7-l	P5	640	Yes	Yes	10.3	50.9	config	model \| log
YOLOv7-x	P5	640	Yes	Yes	13.7	52.8	config	model \| log
YOLOv7-w	P6	1280	Yes	Yes	27.0	54.1	config	model \| log
YOLOv7-e	P6	1280	Yes	Yes	42.5	55.1	config	model \| log

Note: In the official YOLOv7 code, the random_perspective data augmentation in COCO object detection task training uses mask annotation information, which leads to higher performance. Object detection should not use mask annotation, so only box annotation information is used in MMYOLO. We will use the mask annotation information in the instance segmentation task.

The performance is unstable and may fluctuate by about 0.3 mAP. The performance shown above is the best model.

If users need the weight of YOLOv7-e2e, they can train according to the configs provided by us, or convert the official weight according to the converter script.

fast means that YOLOv5DetDataPreprocessor and yolov5_collate are used for data preprocessing, which is faster for training, but less flexible for multitasking. Recommended to use fast version config if you only care about object detection.

SyncBN means use SyncBN, AMP indicates training with mixed precision.

We use 8x A100 for training, and the single-GPU batch size is 16. This is different from the official code.

YOLOv8

COCO

Backbone	Arch	size	Mask Refine	SyncBN	AMP	Mem (GB)	box AP	TTA box AP	Config	Download
YOLOv8-n	P5	640	No	Yes	Yes	2.8	37.2		config	model \| log
YOLOv8-n	P5	640	Yes	Yes	Yes	2.5	37.4 (+0.2)	39.9	config	model \| log
YOLOv8-s	P5	640	No	Yes	Yes	4.0	44.2		config	model \| log
YOLOv8-s	P5	640	Yes	Yes	Yes	4.0	45.1 (+0.9)	46.8	config	model \| log
YOLOv8-m	P5	640	No	Yes	Yes	7.2	49.8		config	model \| log
YOLOv8-m	P5	640	Yes	Yes	Yes	7.0	50.6 (+0.8)	52.3	config	model \| log
YOLOv8-l	P5	640	No	Yes	Yes	9.8	52.1		config	model \| log
YOLOv8-l	P5	640	Yes	Yes	Yes	9.1	53.0 (+0.9)	54.4	config	model \| log
YOLOv8-x	P5	640	No	Yes	Yes	12.2	52.7		config	model \| log
YOLOv8-x	P5	640	Yes	Yes	Yes	12.4	54.0 (+1.3)	55.0	config	model \| log

Note

We use 8x A100 for training, and the single-GPU batch size is 16. This is different from the official code, but has no effect on performance.

The performance is unstable and may fluctuate by about 0.3 mAP and the highest performance weight in COCO training in YOLOv8 may not be the last epoch. The performance shown above is the best model.

We provide scripts to convert official weights to MMYOLO.

SyncBN means using SyncBN, AMP indicates training with mixed precision.

The performance of Mask Refine training is for the weight performance officially released by YOLOv8. Mask Refine means refining bbox by mask while loading annotations and transforming after YOLOv5RandomAffine, and the L and X models use Copy Paste.

TTA means that Test Time Augmentation. It's perform 3 multi-scaling transformations on the image, followed by 2 flipping transformations (flipping and not flipping). You only need to specify --tta when testing to enable. see TTA for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model_zoo.md

model_zoo.md

Model ZOO

RTMdet

Object Detection

YOLOv5

COCO

YOLOv6

COCO

YOLOv7

COCO

YOLOv8

COCO

Files

model_zoo.md

Latest commit

History

model_zoo.md

File metadata and controls

Model ZOO

RTMdet

Object Detection

YOLOv5

COCO

YOLOv6

COCO

YOLOv7

COCO

YOLOv8

COCO