Two months ago, here at Neural Magic we used compound sparsification to enable faster and smaller Hugging Face BERT base models.
We then took it a little further and have some exciting news to share. By applying additional state-of-the-art sparsification research to the Hugging Face BERT base model, we’ve seen even faster inference performance on CPUs and even smaller file sizes:
- 12-layer BERT: 6.4x faster and 8.9x smaller model
- 6-layer BERT (DistilBERT): 11.4x faster and 13.2x smaller model
- 3-layer BERT: 21.2x faster and 17.5x smaller model (recovers to 90% of the original accuracy)
Check out our 2-minute video blog to see how we did it, why it matters, and how you can do the same with your data. All assets and tools you’ll need are outlined on the same page, below the video. Feedback is welcomed, whether something isn’t clear or simply tell us “it’s just right.”