GPU-level YOLOv3 Performance on CPUs

Our sparsified YOLOv3 object detection model and recipe are ready for your use on edge or cost-effective servers. Read our blog to see the results and how much sparsification (pruning plus quantization) can help in achieving GPU-class performance on commodity CPUs.

We show that by leveraging the robust YOLO training framework from Ultralytics with SparseML’s sparsification recipes it is easy to create highly pruned and INT8 quantized YOLO models that deliver more than a 6x increase in performance over state-of-the-art PyTorch and ONNX Runtime CPU implementations.

In addition, we prove that model sparsification (pruning and quantization) doesn’t have to be a hard and daunting task when using Neural Magic open-source tools and recipe-driven approaches. We’d love for you to try replicating our benchmarks and report any feedback by responding to this email.

Sparse-quantized throughput YOLOv3 performance on an AWS c5.12xlarge CPU:

Earlier this year, we released sparse-quantized support for ResNet-50. Stay tuned for BERT and other popular models coming in the next few months. We urge you to try unsupported models and report back to us through the GitHub issue queue as we broaden our sparse and sparse-quantized model offerings.

-Team Neural Magic