A Developer’s Guide to Compressing Pre-Trained...

A Developer’s Guide to Compressing Pre-Trained Transformer Neural Networks Across Different Domains

Abstract

Transformer neural networks have become the benchmark across a variety of domains in part thanks to their competence in comprehending the importance and dependencies between large input segments. Pre-trained transformer models typically contain millions of parameters that have been trained with comprehensive datasets improving their generalization, allowing them to be fine-tuned quicker and with less data. Therefore, starting with a pre-trained model provides a substantial advantage over developing a network from the ground up. However, the race to exceed the accuracy of previous model generations leads pre-trained model to increase in size and the required computational resources for inference, as well as training. This is where compression techniques are vital to maximize the impact and performance of the models, while minimizing the cost to deploy. This work covers six of the most successful compression techniques including pruning, knowledge distillation, quantization, adaptive inference (early exit), low-rank factorization and mixture of experts. The study presents a comprehensive overview of their methods, strengths, and weaknesses meant to serve as a guide for machine learning practitioners. The key areas explored are model speedup, model size, accuracy retention, and training requirements.