Benchmark¶

Backends¶

CPU: ncnn, ONNXRuntime, OpenVINO

GPU: ncnn, TensorRT, PPLNN

Latency benchmark¶

Platform¶

Ubuntu 18.04
ncnn 20211208
Cuda 11.3
TensorRT 7.2.3.4
Docker 20.10.8
NVIDIA tesla T4 tensor core GPU for TensorRT

Other settings¶

Static graph
Batch size 1
Synchronize devices after each inference.
We count the average inference performance of 100 images of the dataset.
Warm up. For ncnn, we warm up 30 iters for all codebases. As for other backends: for classification, we warm up 1010 iters; for other codebases, we warm up 10 iters.
Input resolution varies for different datasets of different codebases. All inputs are real images except for mmagic because the dataset is not large enough.

Users can directly test the speed through model profiling. And here is the benchmark in our environment.

mmpretrain		TensorRT(ms)					PPLNN(ms)		ncnn(ms)		Ascend(ms)
model	spatial	T4			JetsonNano2GB		Jetson TX2	T4	SnapDragon888	Adreno660	Ascend310
model	spatial	fp32	fp16	int8	fp32	fp16	fp32	fp16	fp32	fp32	fp32
ResNet	224x224	2.97	1.26	1.21	59.32	30.54	24.13	1.30	33.91	25.93	2.49
ResNeXt	224x224	4.31	1.42	1.37	88.10	49.18	37.45	1.36	133.44	69.38	-
SE-ResNet	224x224	3.41	1.66	1.51	74.59	48.78	29.62	1.91	107.84	80.85	-
ShuffleNetV2	224x224	1.37	1.19	1.13	15.26	10.23	7.37	4.69	9.55	10.66	-

mmdet part1		TensorRT(ms)				PPLNN(ms)
model	spatial	T4			Jetson TX2	T4
model	spatial	fp32	fp16	int8	fp32	fp16
YOLOv3	320x320	14.76	24.92	24.92	-	18.07
SSD-Lite	320x320	8.84	9.21	8.04	1.28	19.72
RetinaNet	800x1344	97.09	25.79	16.88	780.48	38.34
FCOS	800x1344	84.06	23.15	17.68	-	-
FSAF	800x1344	82.96	21.02	13.50	-	30.41
Faster R-CNN	800x1344	88.08	26.52	19.14	733.81	65.40
Mask R-CNN	800x1344	104.83	58.27	-	-	86.80

mmdet part2		ncnn
model	spatial	SnapDragon888	Adreno660
model	spatial	fp32	fp32
MobileNetv2-YOLOv3	320x320	48.57	66.55
SSD-Lite	320x320	44.91	66.19
YOLOX	416x416	111.60	134.50

mmagic		TensorRT(ms)				PPLNN(ms)
model	spatial	T4			Jetson TX2	T4
model	spatial	fp32	fp16	int8	fp32	fp16
ESRGAN	32x32	12.64	12.42	12.45	-	7.67
SRCNN	32x32	0.70	0.35	0.26	58.86	0.56

mmocr		TensorRT(ms)			PPLNN(ms)	ncnn(ms)
model	spatial	T4			T4	SnapDragon888	Adreno660
model	spatial	fp32	fp16	int8	fp16	fp32	fp32
DBNet	640x640	10.70	5.62	5.00	34.84	-	-
CRNN	32x32	1.93	1.40	1.36	-	10.57	20.00

mmseg		TensorRT(ms)				PPLNN(ms)
model	spatial	T4			Jetson TX2	T4
model	spatial	fp32	fp16	int8	fp32	fp16
FCN	512x1024	128.42	23.97	18.13	1682.54	27.00
PSPNet	1x3x512x1024	119.77	24.10	16.33	1586.19	27.26
DeepLabV3	512x1024	226.75	31.80	19.85	-	36.01
DeepLabV3+	512x1024	151.25	47.03	50.38	2534.96	34.80

Performance benchmark¶

Users can directly test the performance through how_to_evaluate_a_model.md. And here is the benchmark in our environment.

mmpretrain		PyTorch	TorchScript	ONNX Runtime	TensorRT			PPLNN	Ascend
model	metric	fp32	fp32	fp32	fp32	fp16	int8	fp16	fp32
ResNet-18	top-1	69.90	69.90	69.88	69.88	69.86	69.86	69.86	69.91
ResNet-18	top-5	89.43	89.43	89.34	89.34	89.33	89.38	89.34	89.43
ResNeXt-50	top-1	77.90	77.90	77.90	77.90	-	77.78	77.89	-
ResNeXt-50	top-5	93.66	93.66	93.66	93.66	-	93.64	93.65	-
SE-ResNet-50	top-1	77.74	77.74	77.74	77.74	77.75	77.63	77.73	-
SE-ResNet-50	top-5	93.84	93.84	93.84	93.84	93.83	93.72	93.84	-
ShuffleNetV1 1.0x	top-1	68.13	68.13	68.13	68.13	68.13	67.71	68.11	-
ShuffleNetV1 1.0x	top-5	87.81	87.81	87.81	87.81	87.81	87.58	87.80	-
ShuffleNetV2 1.0x	top-1	69.55	69.55	69.55	69.55	69.54	69.10	69.54	-
ShuffleNetV2 1.0x	top-5	88.92	88.92	88.92	88.92	88.91	88.58	88.92	-
MobileNet V2	top-1	71.86	71.86	71.86	71.86	71.87	70.91	71.84	71.87
MobileNet V2	top-5	90.42	90.42	90.42	90.42	90.40	89.85	90.41	90.42
Vision Transformer	top-1	85.43	85.43	-	85.43	85.42	-	-	85.43
Vision Transformer	top-5	97.77	97.77	-	97.77	97.76	-	-	97.77
Swin Transformer	top-1	81.18	81.18	81.18	81.18	81.18	-	-	-
Swin Transformer	top-5	95.61	95.61	95.61	95.61	95.61	-	-	-
EfficientFormer	top-1	80.46	80.45	80.46	80.46	-	-	-	-
EfficientFormer	top-5	94.99	94.98	94.99	94.99	-	-	-	-

mmdet				Pytorch	TorchScript	ONNXRuntime	TensorRT			PPLNN	Ascend	OpenVINO
model	task	dataset	metric	fp32	fp32	fp32	fp32	fp16	int8	fp16	fp32	fp32
YOLOV3	Object Detection	COCO2017	box AP	33.7	33.7	-	33.5	33.5	33.5	-	-	-
SSD	Object Detection	COCO2017	box AP	25.5	25.5	-	25.5	25.5	-	-	-	-
RetinaNet	Object Detection	COCO2017	box AP	36.5	36.4	-	36.4	36.4	36.3	36.5	36.4	-
FCOS	Object Detection	COCO2017	box AP	36.6	-	-	36.6	36.5	-	-	-	-
FSAF	Object Detection	COCO2017	box AP	37.4	37.4	-	37.4	37.4	37.2	37.4	-	-
CenterNet	Object Detection	COCO2017	box AP	25.9	26.0	26.0	26.0	25.8	-	-	-	-
YOLOX	Object Detection	COCO2017	box AP	40.5	40.3	-	40.3	40.3	29.3	-	-	-
Faster R-CNN	Object Detection	COCO2017	box AP	37.4	37.3	-	37.3	37.3	37.1	37.3	37.2	-
ATSS	Object Detection	COCO2017	box AP	39.4	-	-	39.4	39.4	-	-	-	-
Cascade R-CNN	Object Detection	COCO2017	box AP	40.4	-	-	40.4	40.4	-	40.4	-	-
GFL	Object Detection	COCO2017	box AP	40.2	-	40.2	40.2	40.0	-	-	-	-
RepPoints	Object Detection	COCO2017	box AP	37.0	-	-	36.9	-	-	-	-	-
DETR	Object Detection	COCO2017	box AP	40.1	40.1	-	40.1	40.1	-	-	-	-
Mask R-CNN	Instance Segmentation	COCO2017	box AP	38.2	38.1	-	38.1	38.1	-	38.0	-	-
Mask R-CNN	Instance Segmentation	COCO2017	mask AP	34.7	34.7	-	33.7	33.7	-	-	-	-
Swin-Transformer	Instance Segmentation	COCO2017	box AP	42.7	-	42.7	42.5	37.7	-	-	-	-
Swin-Transformer	Instance Segmentation	COCO2017	mask AP	39.3	-	39.3	39.3	35.4	-	-	-	-
SOLO	Instance Segmentation	COCO2017	mask AP	33.1	-	32.7	-	-	-	-	-	32.7
SOLOv2	Instance Segmentation	COCO2017	mask AP	34.8	-	34.5	-	-	-	-	-	34.5

mmagic				Pytorch	TorchScript	ONNX Runtime	TensorRT			PPLNN
model	task	dataset	metric	fp32	fp32	fp32	fp32	fp16	int8	fp16
SRCNN	Super Resolution	Set5	PSNR	28.4316	28.4120	28.4323	28.4323	28.4286	28.1995	28.4311
SRCNN	Super Resolution	Set5	SSIM	0.8099	0.8106	0.8097	0.8097	0.8096	0.7934	0.8096
ESRGAN	Super Resolution	Set5	PSNR	28.2700	28.2619	28.2592	28.2592	-	-	28.2624
ESRGAN	Super Resolution	Set5	SSIM	0.7778	0.7784	0.7764	0.7774	-	-	0.7765
ESRGAN-PSNR	Super Resolution	Set5	PSNR	30.6428	30.6306	30.6444	30.6430	-	-	27.0426
ESRGAN-PSNR	Super Resolution	Set5	SSIM	0.8559	0.8565	0.8558	0.8558	-	-	0.8557
SRGAN	Super Resolution	Set5	PSNR	27.9499	27.9252	27.9408	27.9408	-	-	27.9388
SRGAN	Super Resolution	Set5	SSIM	0.7846	0.7851	0.7839	0.7839	-	-	0.7839
SRResNet	Super Resolution	Set5	PSNR	30.2252	30.2069	30.2300	30.2300	-	-	30.2294
SRResNet	Super Resolution	Set5	SSIM	0.8491	0.8497	0.8488	0.8488	-	-	0.8488
Real-ESRNet	Super Resolution	Set5	PSNR	28.0297	-	27.7016	27.7016	-	-	27.7049
Real-ESRNet	Super Resolution	Set5	SSIM	0.8236	-	0.8122	0.8122	-	-	0.8123
EDSR	Super Resolution	Set5	PSNR	30.2223	30.2192	30.2214	30.2214	30.2211	30.1383	-
EDSR	Super Resolution	Set5	SSIM	0.8500	0.8507	0.8497	0.8497	0.8497	0.8469	-

mmocr				Pytorch	TorchScript	ONNXRuntime	TensorRT			PPLNN	OpenVINO
model	task	dataset	metric	fp32	fp32	fp32	fp32	fp16	int8	fp16	fp32
DBNet*	TextDetection	ICDAR2015	recall	0.7310	0.7308	0.7304	0.7198	0.7179	0.7111	0.7304	0.7309
			precision	0.8714	0.8718	0.8714	0.8677	0.8674	0.8688	0.8718	0.8714
			hmean	0.7950	0.7949	0.7950	0.7868	0.7856	0.7821	0.7949	0.7950
DBNetpp	TextDetection	ICDAR2015	recall	0.8209	0.8209	0.8209	0.8199	0.8204	0.8204	-	0.8209
			precision	0.9079	0.9079	0.9079	0.9117	0.9117	0.9142	-	0.9079
			hmean	0.8622	0.8622	0.8622	0.8634	0.8637	0.8648	-	0.8622
PSENet	TextDetection	ICDAR2015	recall	0.7526	0.7526	0.7526	0.7526	0.7520	0.7496	-	0.7526
			precision	0.8669	0.8669	0.8669	0.8669	0.8668	0.8550	-	0.8669
			hmean	0.8057	0.8057	0.8057	0.8057	0.8054	0.7989	-	0.8057
PANet	TextDetection	ICDAR2015	recall	0.7401	0.7401	0.7401	0.7357	0.7366	-	-	0.7401
			precision	0.8601	0.8601	0.8601	0.8570	0.8586	-	-	0.8601
			hmean	0.7955	0.7955	0.7955	0.7917	0.7930	-	-	0.7955
TextSnake	TextDetection	CTW1500	recall	0.8052	0.8052	0.8052	0.8055	-	-	-	-
			precision	0.8535	0.8535	0.8535	0.8538	-	-	-	-
			hmean	0.8286	0.8286	0.8286	0.8290	-	-	-	-
MaskRCNN	TextDetection	ICDAR2015	recall	0.7766	0.7766	0.7766	0.7766	0.7761	0.7670	-	-
			precision	0.8644	0.8644	0.8644	0.8644	0.8630	0.8705	-	-
			hmean	0.8182	0.8182	0.8182	0.8182	0.8172	0.8155	-	-
CRNN	TextRecognition	IIIT5K	acc	0.8067	0.8067	0.8067	0.8067	0.8063	0.8067	0.8067	-
SAR	TextRecognition	IIIT5K	acc	0.9517	-	0.9287	-	-	-	-	-
SATRN	TextRecognition	IIIT5K	acc	0.9470	0.9487	0.9487	0.9487	0.9483	0.9483	-	-
ABINet	TextRecognition	IIIT5K	acc	0.9603	0.9563	0.9563	0.9573	0.9507	0.9510	-	-

mmseg			Pytorch	TorchScript	ONNXRuntime	TensorRT			PPLNN	Ascend
model	dataset	metric	fp32	fp32	fp32	fp32	fp16	int8	fp16	fp32
FCN	Cityscapes	mIoU	72.25	72.36	-	72.36	72.35	74.19	72.35	72.35
PSPNet	Cityscapes	mIoU	78.55	78.66	-	78.26	78.24	77.97	78.09	78.67
deeplabv3	Cityscapes	mIoU	79.09	79.12	-	79.12	79.12	78.96	79.12	79.06
deeplabv3+	Cityscapes	mIoU	79.61	79.60	-	79.60	79.60	79.43	79.60	79.51
Fast-SCNN	Cityscapes	mIoU	70.96	70.96	-	70.93	70.92	66.00	70.92	-
UNet	Cityscapes	mIoU	69.10	-	-	69.10	69.10	68.95	-	-
ANN	Cityscapes	mIoU	77.40	-	-	77.32	77.32	-	-	-
APCNet	Cityscapes	mIoU	77.40	-	-	77.32	77.32	-	-	-
BiSeNetV1	Cityscapes	mIoU	74.44	-	-	74.44	74.43	-	-	-
BiSeNetV2	Cityscapes	mIoU	73.21	-	-	73.21	73.21	-	-	-
CGNet	Cityscapes	mIoU	68.25	-	-	68.27	68.27	-	-	-
EMANet	Cityscapes	mIoU	77.59	-	-	77.59	77.6	-	-	-
EncNet	Cityscapes	mIoU	75.67	-	-	75.66	75.66	-	-	-
ERFNet	Cityscapes	mIoU	71.08	-	-	71.08	71.07	-	-	-
FastFCN	Cityscapes	mIoU	79.12	-	-	79.12	79.12	-	-	-
GCNet	Cityscapes	mIoU	77.69	-	-	77.69	77.69	-	-	-
ICNet	Cityscapes	mIoU	76.29	-	-	76.36	76.36	-	-	-
ISANet	Cityscapes	mIoU	78.49	-	-	78.49	78.49	-	-	-
OCRNet	Cityscapes	mIoU	74.30	-	-	73.66	73.67	-	-	-
PointRend	Cityscapes	mIoU	76.47	76.47	-	76.41	76.42	-	-	-
Semantic FPN	Cityscapes	mIoU	74.52	-	-	74.52	74.52	-	-	-
STDC	Cityscapes	mIoU	75.10	-	-	75.10	75.10	-	-	-
STDC	Cityscapes	mIoU	77.17	-	-	77.17	77.17	-	-	-
UPerNet	Cityscapes	mIoU	77.10	-	-	77.19	77.18	-	-	-
Segmenter	ADE20K	mIoU	44.32	44.29	44.29	44.29	43.34	43.35	-	-

mmpose				Pytorch	ONNXRuntime	TensorRT		PPLNN	OpenVINO
model	task	dataset	metric	fp32	fp32	fp32	fp16	fp16	fp32
HRNet	Pose Detection	COCO	AP	0.748	0.748	0.748	0.748	-	0.748
HRNet	Pose Detection	COCO	AR	0.802	0.802	0.802	0.802	-	0.802
LiteHRNet	Pose Detection	COCO	AP	0.663	0.663	0.663	-	-	0.663
LiteHRNet	Pose Detection	COCO	AR	0.728	0.728	0.728	-	-	0.728
MSPN	Pose Detection	COCO	AP	0.762	0.762	0.762	0.762	-	0.762
MSPN	Pose Detection	COCO	AR	0.825	0.825	0.825	0.825	-	0.825
Hourglass	Pose Detection	COCO	AP	0.717	0.717	0.717	0.717	-	0.717
Hourglass	Pose Detection	COCO	AR	0.774	0.774	0.774	0.774	-	0.774
SimCC	Pose Detection	COCO	AP	0.607	-	0.608	-	-	-
SimCC	Pose Detection	COCO	AR	0.668	-	0.672	-	-	-

mmrotate				Pytorch	ONNXRuntime	TensorRT		PPLNN	OpenVINO
model	task	dataset	metrics	fp32	fp32	fp32	fp16	fp16	fp32
RotatedRetinaNet	Rotated Detection	DOTA-v1.0	mAP	0.698	0.698	0.698	0.697	-	-
Oriented RCNN	Rotated Detection	DOTA-v1.0	mAP	0.756	0.756	0.758	0.730	-	-
GlidingVertex	Rotated Detection	DOTA-v1.0	mAP	0.732	-	0.733	0.731	-	-
RoI Transformer	Rotated Detection	DOTA-v1.0	mAP	0.761	-	0.758	-	-	-

mmaction2				Pytorch	ONNXRuntime	TensorRT		PPLNN	OpenVINO
model	task	dataset	metrics	fp32	fp32	fp32	fp16	fp16	fp32
TSN	Recognition	Kinetics-400	top-1	69.71	-	69.71	-	-	-
TSN	Recognition	Kinetics-400	top-5	88.75	-	88.75	-	-	-
SlowFast	Recognition	Kinetics-400	top-1	74.45	-	75.62	-	-	-
SlowFast	Recognition	Kinetics-400	top-5	91.55	-	92.10	-	-	-

## Notes

As some datasets contain images with various resolutions in codebase like MMDet. The speed benchmark is gained through static configs in MMDeploy, while the performance benchmark is gained through dynamic ones.
Some int8 performance benchmarks of TensorRT require Nvidia cards with tensor core, or the performance would drop heavily.
DBNet uses the interpolate mode nearest in the neck of the model, which TensorRT-7 applies a quite different strategy from Pytorch. To make the repository compatible with TensorRT-7, we rewrite the neck to use the interpolate mode bilinear which improves final detection performance. To get the matched performance with Pytorch, TensorRT-8+ is recommended, which the interpolate methods are all the same as Pytorch.
Mask AP of Mask R-CNN drops by 1% for the backend. The main reason is that the predicted masks are directly interpolated to original image in PyTorch, while they are at first interpolated to the preprocessed input image of the model and then to original image in other backends.
MMPose models are tested with flip_test explicitly set to False in model configs.
Some models might get low accuracy in fp16 mode. Please adjust the model to avoid value overflow.