Shortcuts

Benchmark

Backends

CPU: ncnn, ONNXRuntime, OpenVINO

GPU: ncnn, TensorRT, PPLNN

Latency benchmark

Platform

  • Ubuntu 18.04

  • ncnn 20211208

  • Cuda 11.3

  • TensorRT 7.2.3.4

  • Docker 20.10.8

  • NVIDIA tesla T4 tensor core GPU for TensorRT

Other settings

  • Static graph

  • Batch size 1

  • Synchronize devices after each inference.

  • We count the average inference performance of 100 images of the dataset.

  • Warm up. For ncnn, we warm up 30 iters for all codebases. As for other backends: for classification, we warm up 1010 iters; for other codebases, we warm up 10 iters.

  • Input resolution varies for different datasets of different codebases. All inputs are real images except for mmediting because the dataset is not large enough.

Users can directly test the speed through model profiling. And here is the benchmark in our environment.

mmcls TensorRT(ms) PPLNN(ms) ncnn(ms) Ascend(ms)
model spatial T4 JetsonNano2GB Jetson TX2 T4 SnapDragon888 Adreno660 Ascend310
fp32 fp16 int8 fp32 fp16 fp32 fp16 fp32 fp32 fp32
ResNet 224x224 2.97 1.26 1.21 59.32 30.54 24.13 1.30 33.91 25.93 2.49
ResNeXt 224x224 4.31 1.42 1.37 88.10 49.18 37.45 1.36 133.44 69.38 -
SE-ResNet 224x224 3.41 1.66 1.51 74.59 48.78 29.62 1.91 107.84 80.85 -
ShuffleNetV2 224x224 1.37 1.19 1.13 15.26 10.23 7.37 4.69 9.55 10.66 -
mmdet part1 TensorRT(ms) PPLNN(ms)
model spatial T4 Jetson TX2 T4
fp32 fp16 int8 fp32 fp16
YOLOv3 320x320 14.76 24.92 24.92 - 18.07
SSD-Lite 320x320 8.84 9.21 8.04 1.28 19.72
RetinaNet 800x1344 97.09 25.79 16.88 780.48 38.34
FCOS 800x1344 84.06 23.15 17.68 - -
FSAF 800x1344 82.96 21.02 13.50 - 30.41
Faster R-CNN 800x1344 88.08 26.52 19.14 733.81 65.40
Mask R-CNN 800x1344 104.83 58.27 - - 86.80
mmdet part2 ncnn
model spatial SnapDragon888 Adreno660
fp32 fp32
MobileNetv2-YOLOv3 320x320 48.57 66.55
SSD-Lite 320x320 44.91 66.19
YOLOX 416x416 111.60 134.50
mmedit TensorRT(ms) PPLNN(ms)
model spatial T4 Jetson TX2 T4
fp32 fp16 int8 fp32 fp16
ESRGAN 32x32 12.64 12.42 12.45 - 7.67
SRCNN 32x32 0.70 0.35 0.26 58.86 0.56
mmocr TensorRT(ms) PPLNN(ms) ncnn(ms)
model spatial T4 T4 SnapDragon888 Adreno660
fp32 fp16 int8 fp16 fp32 fp32
DBNet 640x640 10.70 5.62 5.00 34.84 - -
CRNN 32x32 1.93 1.40 1.36 - 10.57 20.00
mmseg TensorRT(ms) PPLNN(ms)
model spatial T4 Jetson TX2 T4
fp32 fp16 int8 fp32 fp16
FCN 512x1024 128.42 23.97 18.13 1682.54 27.00
PSPNet 1x3x512x1024 119.77 24.10 16.33 1586.19 27.26
DeepLabV3 512x1024 226.75 31.80 19.85 - 36.01
DeepLabV3+ 512x1024 151.25 47.03 50.38 2534.96 34.80

Performance benchmark

Users can directly test the performance through how_to_evaluate_a_model.md. And here is the benchmark in our environment.

mmcls PyTorch TorchScript ONNX Runtime TensorRT PPLNN Ascend
model metric fp32 fp32 fp32 fp32 fp16 int8 fp16 fp32
ResNet-18 top-1 69.90 69.90 69.88 69.88 69.86 69.86 69.86 69.91
top-5 89.43 89.43 89.34 89.34 89.33 89.38 89.34 89.43
ResNeXt-50 top-1 77.90 77.90 77.90 77.90 - 77.78 77.89 -
top-5 93.66 93.66 93.66 93.66 - 93.64 93.65 -
SE-ResNet-50 top-1 77.74 77.74 77.74 77.74 77.75 77.63 77.73 -
top-5 93.84 93.84 93.84 93.84 93.83 93.72 93.84 -
ShuffleNetV1 1.0x top-1 68.13 68.13 68.13 68.13 68.13 67.71 68.11 -
top-5 87.81 87.81 87.81 87.81 87.81 87.58 87.80 -
ShuffleNetV2 1.0x top-1 69.55 69.55 69.55 69.55 69.54 69.10 69.54 -
top-5 88.92 88.92 88.92 88.92 88.91 88.58 88.92 -
MobileNet V2 top-1 71.86 71.86 71.86 71.86 71.87 70.91 71.84 71.87
top-5 90.42 90.42 90.42 90.42 90.40 89.85 90.41 90.42
Vision Transformer top-1 85.43 85.43 - 85.43 85.42 - - 85.43
top-5 97.77 97.77 - 97.77 97.76 - - 97.77
Swin Transformer top-1 81.18 81.18 81.18 81.18 81.18 - -
top-5 95.61 95.61 95.61 95.61 95.61 - -
mmdet Pytorch TorchScript ONNXRuntime TensorRT PPLNN Ascend
model task dataset metric fp32 fp32 fp32 fp32 fp16 int8 fp16 fp32
YOLOV3 Object Detection COCO2017 box AP 33.7 33.7 - 33.5 33.5 33.5 - -
SSD Object Detection COCO2017 box AP 25.5 25.5 - 25.5 25.5 - - -
RetinaNet Object Detection COCO2017 box AP 36.5 36.4 - 36.4 36.4 36.3 36.5 36.4
FCOS Object Detection COCO2017 box AP 36.6 - - 36.6 36.5 - - -
FSAF Object Detection COCO2017 box AP 37.4 37.4 - 37.4 37.4 37.2 37.4 -
YOLOX Object Detection COCO2017 box AP 40.5 40.3 - 40.3 40.3 29.3 - -
Faster R-CNN Object Detection COCO2017 box AP 37.4 37.3 - 37.3 37.3 37.1 37.3 37.2
ATSS Object Detection COCO2017 box AP 39.4 - - 39.4 39.4 - - -
Cascade R-CNN Object Detection COCO2017 box AP 40.4 - - 40.4 40.4 - 40.4 -
GFL Object Detection COCO2017 box AP 40.2 - 40.2 40.2 40.0 - - -
RepPoints Object Detection COCO2017 box AP 37.0 - - 36.9 - - - -
DETR Object Detection COCO2017 box AP 40.1 40.1 - 40.1 40.1 - -
Mask R-CNN Instance Segmentation COCO2017 box AP 38.2 38.1 - 38.1 38.1 - 38.0 -
mask AP 34.7 34.7 - 33.7 33.7 - - -
Swin-Transformer Instance Segmentation COCO2017 box AP 42.7 - 42.7 42.5 37.7 - - -
mask AP 39.3 - 39.3 39.3 35.4 - - -
mmedit Pytorch TorchScript ONNX Runtime TensorRT PPLNN NCNN
model task dataset metric fp32 fp32 fp32 fp32 fp16 int8 fp16 fp32 int8
SRCNN Super Resolution Set5 PSNR 28.4316 28.4120 28.4323 28.4323 28.4286 28.1995 28.4311 - -
SSIM 0.8099 0.8106 0.8097 0.8097 0.8096 0.7934 0.8096 - -
ESRGAN Super Resolution Set5 PSNR 28.2700 28.2619 28.2592 28.2592 - - 28.2624 - -
SSIM 0.7778 0.7784 0.7764 0.7774 - - 0.7765 - -
ESRGAN-PSNR Super Resolution Set5 PSNR 30.6428 30.6306 30.6444 30.6430 - - 27.0426 - -
SSIM 0.8559 0.8565 0.8558 0.8558 - - 0.8557 - -
SRGAN Super Resolution Set5 PSNR 27.9499 27.9252 27.9408 27.9408 - - 27.9388 - -
SSIM 0.7846 0.7851 0.7839 0.7839 - - 0.7839 - -
SRResNet Super Resolution Set5 PSNR 30.2252 30.2069 30.2300 30.2300 - - 30.2294 - -
SSIM 0.8491 0.8497 0.8488 0.8488 - - 0.8488 - -
Real-ESRNet Super Resolution Set5 PSNR 28.0297 - 27.7016 27.7016 - - 27.7049 - -
SSIM 0.8236 - 0.8122 0.8122 - - 0.8123 - -
EDSRx4 Super Resolution Set5 PSNR 30.2223 30.2192 30.2214 30.2214 30.2211 30.1383 - 30.2194 29.9340
SSIM 0.8500 0.8507 0.8497 0.8497 0.8497 0.8469 - 0.8498 0.8409
EDSRx2 Super Resolution Set5 PSNR 35.7592 - - - - - - 35.7733 35.4266
SSIM 0.9372 - - - - - - 0.9365 0.9334
mmocr Pytorch TorchScript ONNXRuntime TensorRT PPLNN OpenVINO
model task dataset metric fp32 fp32 fp32 fp32 fp16 int8 fp16 fp32
DBNet* TextDetection ICDAR2015 recall 0.7310 0.7308 0.7304 0.7198 0.7179 0.7111 0.7304 0.7309
precision 0.8714 0.8718 0.8714 0.8677 0.8674 0.8688 0.8718 0.8714
hmean 0.7950 0.7949 0.7950 0.7868 0.7856 0.7821 0.7949 0.7950
PSENet TextDetection ICDAR2015 recall 0.7526 0.7526 0.7526 0.7526 0.7520 0.7496 - 0.7526
precision 0.8669 0.8669 0.8669 0.8669 0.8668 0.8550 - 0.8669
hmean 0.8057 0.8057 0.8057 0.8057 0.8054 0.7989 - 0.8057
PANet TextDetection ICDAR2015 recall 0.7401 0.7401 0.7401 0.7357 0.7366 - - 0.7401
precision 0.8601 0.8601 0.8601 0.8570 0.8586 - - 0.8601
hmean 0.7955 0.7955 0.7955 0.7917 0.7930 - - 0.7955
CRNN TextRecognition IIIT5K acc 0.8067 0.8067 0.8067 0.8067 0.8063 0.8067 0.8067 -
SAR TextRecognition IIIT5K acc 0.9517 - 0.9287 - - - - -
SATRN TextRecognition IIIT5K acc 0.9470 0.9487 0.9487 0.9487 0.9483 0.9483 - -
mmseg Pytorch TorchScript ONNXRuntime TensorRT PPLNN Ascend
model dataset metric fp32 fp32 fp32 fp32 fp16 int8 fp16 fp32
FCN Cityscapes mIoU 72.25 72.36 - 72.36 72.35 74.19 72.35 72.35
PSPNet Cityscapes mIoU 78.55 78.66 - 78.26 78.24 77.97 78.09 78.67
deeplabv3 Cityscapes mIoU 79.09 79.12 - 79.12 79.12 78.96 79.12 79.06
deeplabv3+ Cityscapes mIoU 79.61 79.60 - 79.60 79.60 79.43 79.60 79.51
Fast-SCNN Cityscapes mIoU 70.96 70.96 - 70.93 70.92 66.00 70.92 -
UNet Cityscapes mIoU 69.10 - - 69.10 69.10 68.95 - -
ANN Cityscapes mIoU 77.40 - - 77.32 77.32 - - -
APCNet Cityscapes mIoU 77.40 - - 77.32 77.32 - - -
BiSeNetV1 Cityscapes mIoU 74.44 - - 74.44 74.43 - - -
BiSeNetV2 Cityscapes mIoU 73.21 - - 73.21 73.21 - - -
CGNet Cityscapes mIoU 68.25 - - 68.27 68.27 - - -
EMANet Cityscapes mIoU 77.59 - - 77.59 77.6 - - -
EncNet Cityscapes mIoU 75.67 - - 75.66 75.66 - - -
ERFNet Cityscapes mIoU 71.08 - - 71.08 71.07 - - -
FastFCN Cityscapes mIoU 79.12 - - 79.12 79.12 - - -
GCNet Cityscapes mIoU 77.69 - - 77.69 77.69 - - -
ICNet Cityscapes mIoU 76.29 - - 76.36 76.36 - - -
ISANet Cityscapes mIoU 78.49 - - 78.49 78.49 - - -
OCRNet Cityscapes mIoU 74.30 - - 73.66 73.67 - - -
PointRend Cityscapes mIoU 76.47 76.47 - 76.41 76.42 - - -
Semantic FPN Cityscapes mIoU 74.52 - - 74.52 74.52 - - -
STDC Cityscapes mIoU 75.10 - - 75.10 75.10 - - -
STDC Cityscapes mIoU 77.17 - - 77.17 77.17 - - -
UPerNet Cityscapes mIoU 77.10 - - 77.19 77.18 - - -
Segmenter ADE20K mIoU 44.32 44.29 44.29 44.29 43.34 43.35 - -
mmpose Pytorch ONNXRuntime TensorRT PPLNN OpenVINO
model task dataset metric fp32 fp32 fp32 fp16 fp16 fp32
HRNet Pose Detection COCO AP 0.748 0.748 0.748 0.748 - 0.748
AR 0.802 0.802 0.802 0.802 - 0.802
LiteHRNet Pose Detection COCO AP 0.663 0.663 0.663 - - 0.663
AR 0.728 0.728 0.728 - - 0.728
MSPN Pose Detection COCO AP 0.762 0.762 0.762 0.762 - 0.762
AR 0.825 0.825 0.825 0.825 - 0.825
Hourglass Pose Detection COCO AP 0.717 0.717 0.717 0.717 - 0.717
AR 0.774 0.774 0.774 0.774 - 0.774
mmrotate Pytorch ONNXRuntime TensorRT PPLNN OpenVINO
model task dataset metrics fp32 fp32 fp32 fp16 fp16 fp32
RotatedRetinaNet Rotated Detection DOTA-v1.0 mAP 0.698 0.698 0.698 0.697 - -
Oriented RCNN Rotated Detection DOTA-v1.0 mAP 0.756 0.756 0.758 0.730 - -
GlidingVertex Rotated Detection DOTA-v1.0 mAP 0.732 - 0.733 0.731 - -
RoI Transformer Rotated Detection DOTA-v1.0 mAP 0.761 - 0.758 - - -
mmaction2 Pytorch ONNXRuntime TensorRT PPLNN OpenVINO
model task dataset metrics fp32 fp32 fp32 fp16 fp16 fp32
TSN Recognition Kinetics-400 top-1 69.71 - 69.71 - - -
top-5 88.75 - 88.75 - - -
SlowFast Recognition Kinetics-400 top-1 74.45 - 75.62 - - -
top-5 91.55 - 92.10 - - -

Notes

  • As some datasets contain images with various resolutions in codebase like MMDet. The speed benchmark is gained through static configs in MMDeploy, while the performance benchmark is gained through dynamic ones.

  • Some int8 performance benchmarks of TensorRT require Nvidia cards with tensor core, or the performance would drop heavily.

  • DBNet uses the interpolate mode nearest in the neck of the model, which TensorRT-7 applies a quite different strategy from Pytorch. To make the repository compatible with TensorRT-7, we rewrite the neck to use the interpolate mode bilinear which improves final detection performance. To get the matched performance with Pytorch, TensorRT-8+ is recommended, which the interpolate methods are all the same as Pytorch.

  • Mask AP of Mask R-CNN drops by 1% for the backend. The main reason is that the predicted masks are directly interpolated to original image in PyTorch, while they are at first interpolated to the preprocessed input image of the model and then to original image in other backends.

  • MMPose models are tested with flip_test explicitly set to False in model configs.

  • Some models might get low accuracy in fp16 mode. Please adjust the model to avoid value overflow.

Read the Docs v: latest
Versions
latest
stable
dev-1.x
Downloads
pdf
html
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.