Pruning Hugging Face BERT: Using "Compound Sparsification" to Increase BERT CPU Performance up to 14x

:zap: New Blog Post :zap:
Pruning Hugging Face BERT: Apply both pruning and layer dropping sparsification methods to increase BERT performance anywhere from 3.3x to 14x on CPUs depending on accuracy constraints
In this post, we go into detail on pruning Hugging Face BERT and describe how sparsification combined with the DeepSparse Engine improves BERT model performance on CPUs. We’ll show:

  1. A current state of pruning BERT models for better inference performance;
  2. How compound sparsification (pruning and layer dropping) enables faster and smaller models;
  3. How to leverage Neural Magic recipes and open-source tools to create faster and smaller BERT models on your own pipelines;
  4. Short-term roadmap for even more performant BERT models.

Take a spin and provide any feedback you have.