Aussie AI
45. Knowledge Distillation
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
“We're twins. We're basically the same.”
— Twins, 1988.
What is Knowledge Distillation?
Knowledge Distillation (KD) is an inference speedup technique where a larger pre-trained model is used to train a smaller more-efficient model. The “teacher” model trains the “student” model. When used successfully, the result is a smaller model with faster inference that closely matches the accuracy of the larger model. Hence, it is a type of “model compression” because the large model is effectively “compressed” into a smaller model. The method is basically:
- Start with a big model.
- Repeatedly query the big model.
- Transfer results to train a small model.
- Use only the small model for inference.
The use of distillation is widespread in both industry and research settings. It is a well-known effective technique of generating a more efficient version of a large model.
Distillation is not technically an ensemble method, because the larger model is not used during inference. Hence, it is not the same as “big-small” dual inference architectures.
Distillation differs from “fine tuning” or “re-training”, which involve extra training on the (large) model, whereas knowledge distillation involves training a new, smaller model from scratch. Distillation is not a training speedup because it still requires training the larger model first, and then the smaller model. This increases training cost overall so as to reduce future inference cost.
Distillation is more technically involved than the commonly-used methods of training a new model from the output of another large model (sometimes called “dataset distillation” or “synthetic data”). That technique is not actually distillation in its proper sense. Rather, knowledge distillation algorithms involve more complex transfer of learning from the internals of a large model to inside the small model.
Recent advances in Knowledge Distillation include (a) novel ways to directly transfer the learning, with weighting approaches rather than exact probability transfer, and (b) multi-model distillation approaches whereby the smaller student model can gain information from multiple teachers.
Research on Knowledge Distillation
KD is a longstanding method of optimizing model inference that is one of the most popular techniques. There are many research papers on distillation:
Survey Papers on Knowledge Distillation:
- Y Tian, S Pei, X Zhang, C Zhang, 2023, Knowledge Distillation on Graphs: A Survey, arXiv preprint, https://arxiv.org/abs/2302.00219
- Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. 2017, A survey of model compression and acceleration for deep neural networks, CoRR, abs/1710.09282, https://arxiv.org/abs/1710.09282 (A survey paper from 2017 that includes KD.)
- Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, Lei Li, 2021, A Survey on Green Deep Learning, Nov 2021, https://arxiv.org/abs/2111.05193 (Extensive survey paper on multiple areas, including a section on Knowledge Distillation.)
- Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang, A Survey on Model Compression for Large Language Models, 2023, arXiv preprint arXiv:2308.07633, Aug 2023 https://arxiv.org/abs/2308.07633 (Recent 2023 survey paper on various model compression approaches including knowledge distillation.)
- Wang L, Yoon KJ. 2021, Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks, 2021;44:3048-3068 https://arxiv.org/abs/2004.05937 (Distillation in vision context.)
Specific research papers on distillation include:
- Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015, Distilling the knowledge in a neural network, arXiv preprint arXiv:1503.02531, 2015. https://arxiv.org/abs/1503.02531 (The early paper that seems to have coined the name.)
- Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf, 2019, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, Oct 2019 (revised March 2020), arXiv preprint arXiv:1910.01108 (2019), https://arxiv.org/abs/1910.01108
- Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu, 2019, TinyBERT: Distilling BERT for Natural Language Understanding, arXiv preprint arXiv:1909.10351, Sep 2019 (updated Oct 2020), https://arxiv.org/abs/1909.10351, Code: https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT
- Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, Denny Zhou, MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices, 2020, arXiv preprint arXiv:2004.02984 (2020), https://arxiv.org/abs/2004.02984
- Siqi Sun, Yu Cheng, Zhe Gan, Jingjing Liu, 2019, Patient Knowledge Distillation for BERT Model Compression, arXiv preprint arXiv:1908.09355 (Aug 2019), https://arxiv.org/abs/1908.09355
- Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2021, MINILMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers, arXiv preprint arXiv:2002.10957, 2020 (revised June 2021), https://arxiv.org/abs/2012.15828
- Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, Jimmy Lin, 2019, Distilling Task-Specific Knowledge from BERT into Simple Neural Networks, arXiv preprint arXiv:1903.12136 (Mar 2019), https://arxiv.org/abs/1903.12136
- Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Haotang Deng, and Qi Ju. 2020, FastBERT: a self-distilling BERT with adaptive inference time, arXiv preprint arXiv:2004.02178, 2020. https://arxiv.org/abs/2004.02178
- Jean Senellart, Dakun Zhang, Bo Wang, Guillaume Klein, Jean-Pierre Ramatchandirin, Josep Crego, and Alexander Rush. 2018, OpenNMT system description for WNMT 2018: 800 words/sec on a single-core CPU, In Proc. of WNG, 2018. https://www.aclweb.org/anthology/W18-2715
- Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Well-read students learn better: The impact of student initialization on knowledge distillation, arXiv preprint arXiv:1908.08962, https://arxiv.org/abs/1908.08962v1
- Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Arivazhagan, Xin Li, and Amelia Archer. 2019. Small and Practical BERT Models for Sequence Labeling, arXiv preprint arXiv:1909.00100. https://arxiv.org/abs/1909.00100
- Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. 2019. Distilling Task-Specific Knowledge from BERT into Simple Neural Networks, arXiv preprint arXiv:1903.12136, https://arxiv.org/abs/1903.12136
- Kevin Clark, Minh-Thang Luong, Urvashi Khandelwal, Christopher D Manning, and Quoc V Le. July 2019. BAM! Born-Again Multi-Task Networks for Natural Language Understanding, arXiv preprint arXiv:1907.04829. https://arxiv.org/abs/1907.04829
- Yoon Kim and Alexander M Rush. Sep 2016. Sequence-level knowledge distillation, arXiv preprint arXiv:1606.07947, https://arxiv.org/abs/1606.07947
- Kaixin Wu, Bojie Hu, and Qi Ju. 2021. TenTrans High-Performance Inference Toolkit for WMT2021 Efficiency Task, In Proceedings of the Sixth Conference on Machine Translation, pages 795–798, Online. Association for Computational Linguistics, https://aclanthology.org/2021.wmt-1.77/, Code: https://github.com/TenTrans/TenTrans
- Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker. 2017. Learning efficient object detection models with knowledge distillation, In Advances in Neural Information Processing Systems, pages 742–751. https://dl.acm.org/doi/10.5555/3294771.3294842
- Mao, Y.; Wang, Y.; Wu, C.; Zhang, C.; Wang, Y.; Zhang, Q.; Yang, Y.; Tong, Y.; and Bai, J. 2020. LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression, In COLING, 3225–3234. International Committee on Computational Linguistics. https://arxiv.org/abs/2004.04124 (A combination of weight pruning, matrix factorization and knowledge distillation.)
- Lin, S.C.; Yang, J.H.; Lin, J., 2020, Distilling dense representations for ranking using tightly-coupled teachers, arXiv preprint arXiv:2010.11386 2020. https://arxiv.org/abs/2010.11386
- Dawei Li, Xiaolong Wang, and Deguang Kong. 2018. DeepRebirth: Accelerating Deep Neural Network Execution on Mobile Devices, In AAAI’18, https://arxiv.org/abs/1708.04728 (Includes a distinct type of distillation.)
- Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. 2006, Model compression, In Tina Eliassi-Rad, Lyle H. Ungar, Mark Craven, and Dimitrios Gunopulos (eds.), Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, August 20-23, 2006, pp. 535–541. ACM, 2006. https://www.semanticscholar.org/paper/Model-compression-Bucila-Caruana/30c9bb327b7f2b9f1d1e5b69b9d0c97b410948d9, PDF: http://www.cs.cornell.edu/~caruana/compression.kdd06.pdf (Early 2006 paper on teaching models before it became called “distillation” in 2015.)
- Lili Mou, Ran Jia, Yan Xu, Ge Li, Lu Zhang, and Zhi Jin. 2016, Distilling word embeddings: An encoding approach, In CIKM, pp. 1977–1980. ACM, 2016. https://arxiv.org/abs/1506.04488 (Distillation of embeddings.)
- Canwen Xu, Wangchunshu Zhou, Tao Ge, Ke Xu, Julian J. McAuley, and Furu Wei. 2021, Beyond preserved accuracy: Evaluating loyalty and robustness of BERT compression, CoRR, abs/2109.03228, 2021, https://arxiv.org/abs/2109.03228 (Evaluation of the efficiency of distillation.)
- Samuel Stanton, Pavel Izmailov, Polina Kirichenko, Alexander A. Alemi, and Andrew Gordon Wilson. 2021, Does knowledge distillation really work?, CoRR, abs/2106.05945, 2021, https://arxiv.org/abs/2106.05945 (Evaluation of the efficacy of distillation.)
- Dae Young Park, Moon-Hyun Cha, Changwook Jeong, Daesin Kim, and Bohyung Han. 2021, Learning student-friendly teacher networks for knowledge distillation, CoRR, abs/2102.07650, 2021. https://arxiv.org/abs/2102.07650
- Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014, Fitnets: Hints for thin deep nets, arXiv preprint arXiv:1412.6550, 2014 https://arxiv.org/abs/1412.6550
- Anil Kag, 2023, Novel neural architectures & algorithms for efficient inference, Ph.D. thesis, College of Engineering, Boston University, https://open.bu.edu/handle/2144/46649, PDF: https://open.bu.edu/bitstream/handle/2144/46649/Kag_bu_0017E_18472.pdf?sequence=8&isAllowed=y Code: https://github.com/anilkagak2/DiSK_Distilling_Scaffolded_Knowledge (See Chapter 13: Distilling Selective/Scaffolded Knowledge)
- Zhang, C.; Yang, Y.; Liu, J.; Wang, J.; Xian, Y.; Wang, B.; and Song, D. 2023. Lifting the Curse of Capacity Gap in Distilling Language Models, arXiv:2305.12129. https://arxiv.org/abs/2305.12129
- Chen X, He B, Hui K, Sun L, Sun Y. 2020, Simplified Tinybert: Knowledge Distillation for Document Retrieval, 2020. Arxiv preprint, https://arxiv.org/abs/2009.07531
- Tian Y, Krishnan D, Isola P. 2019, Contrastive Representation Distillation, 2019. Arxiv Preprint: https://arxiv.org/pdf/1910.10699.pdf
- Do T, Tran H, Do T, Tjiputra E, Tran Q. 2019, Compact Trilinear Interaction for Visual Question Answering, In: Proceedings of the proceedings of the IEEE International Conference on Computer Vision. 2019:392-401. https://arxiv.org/abs/1909.11874
- T Chen, S Liu, Z Chen, W Hu, D Chen, Y Wang, Q Lyu, 2023, Faster, Stronger, and More Interpretable: Massive Transformer Architectures for Vision-Language Tasks, https://www.oajaiml.com/uploads/archivepdf/27841181.pdf
- B. Heo, M. Lee, S. Yun and J. Y. Choi, 2019, Knowledge transfer via distillation of activation boundaries formed by hidden neurons, Proc. AAAI Conf. Artif. Intell. (AAAI), vol. 33, no. 1, pp. 3779-3787, 2019, https://arxiv.org/abs/1811.03233
- Guorui Zhou, Ying Fan, Runpeng Cui, Weijie Bian, Xiaoqiang Zhu, Kun Gai, 2018, Rocket launching: A universal and efficient framework for training well-performing light net, Proc. AAAI Conf. Artif. Intell., pp. 1-8, 2018. https://arxiv.org/abs/1708.04106 (Combined training of teacher and student models.)
- A. Chaulwar et al., 2022, Extreme compression of sentence-transformer ranker models: Faster inference longer battery life and less storage on edge devices, arXiv:2207.12852, 2022. https://arxiv.org/abs/2207.12852v1 (Distillation from the point of view of embeddings.)
- T Shen, C Lee, V Narayanan, Oct 2023, Multi-Exit Vision Transformer with Custom Fine-Tuning for Fine-Grained Image Recognition, 2023 IEEE International Conference on Image Processing (ICIP), https://ieeexplore.ieee.org/abstract/document/10222298 (Early exit from multiple places, combined with self-distillation.)
- Z Zhao, Q Liu, H Gui, B An, L Hong, H Chi, Oct 2023, Talking Models: Distill Pre-trained Knowledge to Downstream Models via Interactive Communication, arXiv preprint arXiv:2310.03188, https://arxiv.org/pdf/2310.03188.pdf
- T Udagawa, A Trivedi, M Merler, B Bhattacharjee, Oct 2023, A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models, arXiv preprint arXiv:2310.08797, https://arxiv.org/abs/2310.08797
- Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, Rishabh Agarwal, Oct 2023, DistillSpec: Improving Speculative Decoding via Knowledge Distillation, https://arxiv.org/abs/2310.08461
For more research papers on Knowledge Distillation see: https://www.aussieai.com/research/knowledge-distillation.
Multi-Teacher Knowledge Distillation
Ensemble knowledge distillation generalizes the basic distillation algorithm to have more than one teacher model, rather than a single teacher-student pair of models. There is research to suggest that distillation can be even more effective with multiple teacher models. Ensemble distillation is a type of “ensemble learning.”
Various strategies have been employed, such as sequential versus parallel teaching, and model monitoring where a teacher monitors the student for correctness of its results. There are many papers, but although basic knowledge distillation is mainstream, the use of ensemble distillation remains mostly a research method.
Research papers on ensemble distillation:
- Wenxian Shi, Yuxuan Song, Hao Zhou, Bohan Li, and Lei Li. 2021, Learning from deep model via exploring local targets, 2021. https://openreview.net/forum?id=5slGDu_bVc6 (Distillation with multiple teachers)
- Seyed-Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. 2020, Improved knowledge distillation via teacher assistant, In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 5191– 5198. AAAI Press, 2020. https://arxiv.org/abs/1902.03393 (multiple teachers)
- Jangho Kim, Minsung Hyun, Inseop Chung, and Nojun Kwak. 2020, Feature fusion for online mutual knowledge distillation, In 25th International Conference on Pattern Recognition, ICPR 2020, Virtual Event / Milan, Italy, January 10-15, 2021, pp. 4619–4625. IEEE, 2020. https://arxiv.org/abs/1904.09058 (Ensemble methods for distillation.)
- Inseop Chung, Seonguk Park, Jangho Kim, and Nojun Kwak. 2020, Feature-map-level online adversarial knowledge distillation, In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 2006–2015. PMLR, 2020. https://arxiv.org/abs/2002.01775 (Multiple teacher models.)
- Defang Chen, Jian-Ping Mei, Can Wang, Yan Feng, and Chun Chen. 2020, Online knowledge distillation with diverse peers, In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 3430–3437. AAAI Press, 2020a https://arxiv.org/abs/1912.00350 (Ensemble distillation with multiple “peer” teachers.)
- Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Orm ´ andi, George E. Dahl, and Geoffrey E. Hinton. 2018, Large scale distributed neural network training through online distillation, In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. https://arxiv.org/abs/1804.03235
- Mehdi Rezagholizadeh, Aref Jafari, Puneeth Salad, Pranav Sharma, Ali Saheb Pasand, Ali Ghodsi, 2021, Pro-KD: Progressive Distillation by Following the Footsteps of the Teacher, arXiv preprint arXiv:2110.08532, 2021. https://arxiv.org/abs/2110.08532
- Y. Zhang, T. Xiang, T. M. Hospedales and H. Lu, 2018, Deep mutual learning, Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 4320-4328, Jun. 2018. https://arxiv.org/abs/1706.00384
- L. Yuan, F. E. Tay, G. Li, T. Wang and J. Feng, 2020, Revisiting knowledge distillation via label smoothing regularization, Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 3903-3911, Jun. 2020. https://arxiv.org/abs/1909.11723 (Improved learning, and also looks at reverse student-to-teacher learning.)
For more research papers on ensemble KD, see https://www.aussieai.com/research/knowledge-distillation#ensemble.
Dataset Distillation
The technique of “dataset distillation” borrows the same terminology, but is a different technique to knowledge distillation. This term refers to methods to reduce a training dataset to a derived set of training data, such as to (theoretically) sidestep privacy or copyright concerns. The dataset is smaller and theoretically can be used to train a similarly capable model.
Research papers on dataset distillation:
- T. Wang, J.-Y. Zhu, A. Torralba and A. A. Efros, 2018, Dataset distillation, arXiv:1811.10959, 2018. https://arxiv.org/abs/1811.10959
- Yu R, Liu S, Wang X, 2023, Dataset Distillation: A Comprehensive Review, https://arxiv.org/abs/2301.07014 Or Honovich, Thomas Scialom, Omer Levy, Timo Schick, Dec 2022, Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor, https://arxiv.org/abs/2212.09689, https://github.com/orhonovich/unnatural-instructions (Using a model to automatically create a training data set, including automatically creating both instructions and responses.)
For additional research papers on dataset distillation, see https://www.aussieai.com/research/knowledge-distillation#dataset-distillation.
• Next: Chapter 46. Structured Pruning • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |