Model compression - Reference.org

On this page

Model compression

Techniques for lossy compression of neural networks

Model compression is a machine learning technique that reduces the size of trained models to lower resource requirements without significantly impacting performance. Smaller models use less storage, memory, and compute during inference, enabling deployment on resource-constrained devices like smartphones, embedded systems, edge computing devices, and consumer electronics. This efficiency benefits not only mobile devices but also large corporations that provide model inference via APIs, reducing computational costs and improving response times. It is distinct from knowledge distillation, where a separate smaller "student" model is trained to mimic a larger "teacher" model’s behavior.

We don't have any images related to Model compression yet.

You can add one yourself here.

We don't have any YouTube videos related to Model compression yet.

You can add one yourself here.

We don't have any PDF documents related to Model compression yet.

You can add one yourself here.

We don't have any Books related to Model compression yet.

You can add one yourself here.

We don't have any archived web articles related to Model compression yet.

You can submit a link to a page to archive here.

Techniques

Several techniques are employed for model compression.

Pruning

Pruning sparsifies a large model by setting some parameters to exactly zero. This effectively reduces the number of parameters. This allows the use of sparse matrix operations, which are faster than dense matrix operations.

Pruning criteria can be based on magnitudes of parameters, the statistical pattern of neural activations, Hessian values, etc.¹ ²

Quantization

Quantization reduces the numerical precision of weights and activations. For example, instead of storing weights as 32-bit floating-point numbers, they can be represented using 8-bit integers. Low-precision parameters take up less space, and takes less compute to perform arithmetic with.

It is also possible to quantize some parameters more aggressively than others, so for example, a less important parameter can have 8-bit precision while another, more important parameter, can have 16-bit precision. Inference with such models requires mixed-precision arithmetic.³ ⁴

Quantized models can also be used during training (rather than after training). PyTorch implements automatic mixed-precision (AMP), which performs autocasting, gradient scaling, and loss scaling.⁵ ⁶

Low-rank factorization

Main article: Low-rank approximation

Weight matrices can be approximated by low-rank matrices. Let W {\displaystyle W} be a weight matrix of shape m × n {\displaystyle m\times n} . A low-rank approximation is W ≈ U V T {\displaystyle W\approx UV^{T}} , where U {\displaystyle U} and V {\displaystyle V} are matrices of shapes m × k , n × k {\displaystyle m\times k,n\times k} . When k {\displaystyle k} is small, this both reduces the number of parameters needed to represent W {\displaystyle W} approximately, and accelerates matrix multiplication by W {\displaystyle W} .

Low-rank approximations can be found by singular value decomposition (SVD). The choice of rank for each weight matrix is a hyperparameter, and jointly optimized as a mixed discrete-continuous optimization problem.⁷ The rank of weight matrices may also be pruned after training, taking into account the effect of activation functions like ReLU on the implicit rank of the weight matrices.⁸

Training

Model compression may be decoupled from training, that is, a model is first trained without regard for how it might be compressed, then it is compressed. However, it may also be combined with training.

The "train big, then compress" method trains a large model for a small number of training steps (less than it would be if it were trained to convergence), then heavily compress the model. It is found that at the same compute budget, this method results in a better model than lightly compressed, small models.⁹

In Deep Compression,¹⁰ the compression has three steps.

First loop (pruning): prune all weights lower than a threshold, then finetune the network, then prune again, etc.
Second loop (quantization): cluster weights, then enforce weight sharing among all weights in each cluster, then finetune the network, then cluster again, etc.
Third step: Use Huffman coding to losslessly compress the model.

The SqueezeNet paper reported that Deep Compression achieved a compression ratio of 35 on AlexNet, and a ratio of ~10 on SqueezeNets.¹¹

Review papers
- Li, Zhuo; Li, Hengyi; Meng, Lin (March 12, 2023). "Model Compression for Deep Neural Networks: A Survey". Computers. 12 (3). MDPI AG: 60. doi:10.3390/computers12030060. ISSN 2073-431X.
- Deng, By Lei; Li, Guoqi; Han, Song; Shi, Luping; Xie, Yuan (March 20, 2020). "Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey". Proceedings of the IEEE. 108 (4): 485–532. doi:10.1109/JPROC.2020.2976475. Retrieved October 18, 2024.
- Cheng, Yu; Wang, Duo; Zhou, Pan; Zhang, Tao (October 23, 2017). "A Survey of Model Compression and Acceleration for Deep Neural Networks". arXiv:1710.09282 [cs.LG].
- Choudhary, Tejalal; Mishra, Vipul; Goswami, Anurag; Sarangapani, Jagannathan (February 8, 2020). "A comprehensive survey on model compression and acceleration". Artificial Intelligence Review. 53 (7). Springer Science and Business Media LLC: 5113–5155. doi:10.1007/s10462-020-09816-7. ISSN 0269-2821.

References

Reed, R. (September 1993). "Pruning algorithms-a survey". IEEE Transactions on Neural Networks. 4 (5): 740–747. doi:10.1109/72.248452. PMID 18276504. https://ieeexplore.ieee.org/document/248452 ↩
Blalock, Davis; Gonzalez Ortiz, Jose Javier; Frankle, Jonathan; Guttag, John (2020-03-15). "What is the State of Neural Network Pruning?". Proceedings of Machine Learning and Systems. 2: 129–146. https://proceedings.mlsys.org/paper_files/paper/2020/hash/6c44dc73014d66ba49b28d483a8f8b0d-Abstract.html ↩
Abdelfattah, Ahmad; Anzt, Hartwig; Boman, Erik G.; Carson, Erin; Cojean, Terry; Dongarra, Jack; Gates, Mark; Grützmacher, Thomas; Higham, Nicholas J.; Li, Sherry; Lindquist, Neil; Liu, Yang; Loe, Jennifer; Luszczek, Piotr; Nayak, Pratik; Pranesh, Sri; Rajamanickam, Siva; Ribizel, Tobias; Smith, Barry; Swirydowicz, Kasia; Thomas, Stephen; Tomov, Stanimire; Tsai, Yaohung M.; Yamazaki, Ichitaro; Urike Meier Yang (2020). "A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic". arXiv:2007.06674 [cs.MS]. /wiki/Ulrike_Meier_Yang ↩
Micikevicius, Paulius; Narang, Sharan; Alben, Jonah; Diamos, Gregory; Elsen, Erich; Garcia, David; Ginsburg, Boris; Houston, Michael; Kuchaiev, Oleksii (2018-02-15). "Mixed Precision Training". arXiv:1710.03740 [cs.AI]. /wiki/ArXiv_(identifier) ↩
"Mixed Precision — PyTorch Training Performance Guide". residentmario.github.io. Retrieved 2024-09-10. https://residentmario.github.io/pytorch-training-performance-guide/mixed-precision.html ↩
"What Every User Should Know About Mixed Precision Training in PyTorch". PyTorch. Retrieved 2024-09-10. https://pytorch.org/blog/what-every-user-should-know-about-mixed-precision-training-in-pytorch/ ↩
Idelbayev, Yerlan; Carreira-Perpiñán, Miguel Á. (2020). "Low-Rank Compression of Neural Nets: Learning the Rank of Each Layer". 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13–19, 2020. Computer Vision Foundation / IEEE. pp. 8046–8056. doi:10.1109/CVPR42600.2020.00807. ISBN 978-1-7281-7168-5. 978-1-7281-7168-5 ↩
Dittmer, Sören; King, Emily J.; Maass, Peter (2020). "Singular Values for ReLU Layers". IEEE Transactions on Neural Networks and Learning Systems. Vol. 31. IEEE. pp. 3594–3605. arXiv:1812.02566. doi:10.1109/TNNLS.2019.2945113. https://ieeexplore.ieee.org/document/8891761 ↩
Li, Zhuohan; Wallace, Eric; Shen, Sheng; Lin, Kevin; Keutzer, Kurt; Klein, Dan; Gonzalez, Joey (2020-11-21). "Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers". Proceedings of the 37th International Conference on Machine Learning. PMLR: 5958–5968. https://proceedings.mlr.press/v119/li20m.html ↩
Han, Song; Mao, Huizi; Dally, William J. (2016-02-15). "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding". arXiv:1510.00149 [cs.CV]. /wiki/ArXiv_(identifier) ↩
Iandola, Forrest N; Han, Song; Moskewicz, Matthew W; Ashraf, Khalid; Dally, William J; Keutzer, Kurt (2016). "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size". arXiv:1602.07360 [cs.CV]. /wiki/ArXiv_(identifier) ↩