Aussie AI

45. Knowledge Distillation

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“We're twins. We're basically the same.”

— Twins, 1988.

What is Knowledge Distillation?

Knowledge Distillation (KD) is an inference speedup technique where a larger pre-trained model is used to train a smaller more-efficient model. The “teacher” model trains the “student” model. When used successfully, the result is a smaller model with faster inference that closely matches the accuracy of the larger model. Hence, it is a type of “model compression” because the large model is effectively “compressed” into a smaller model. The method is basically:

Start with a big model.
Repeatedly query the big model.
Transfer results to train a small model.
Use only the small model for inference.

The use of distillation is widespread in both industry and research settings. It is a well-known effective technique of generating a more efficient version of a large model.

Distillation is not technically an ensemble method, because the larger model is not used during inference. Hence, it is not the same as “big-small” dual inference architectures.

Distillation differs from “fine tuning” or “re-training”, which involve extra training on the (large) model, whereas knowledge distillation involves training a new, smaller model from scratch. Distillation is not a training speedup because it still requires training the larger model first, and then the smaller model. This increases training cost overall so as to reduce future inference cost.

Distillation is more technically involved than the commonly-used methods of training a new model from the output of another large model (sometimes called “dataset distillation” or “synthetic data”). That technique is not actually distillation in its proper sense. Rather, knowledge distillation algorithms involve more complex transfer of learning from the internals of a large model to inside the small model.

Recent advances in Knowledge Distillation include (a) novel ways to directly transfer the learning, with weighting approaches rather than exact probability transfer, and (b) multi-model distillation approaches whereby the smaller student model can gain information from multiple teachers.

Research on Knowledge Distillation

KD is a longstanding method of optimizing model inference that is one of the most popular techniques. There are many research papers on distillation:

Survey Papers on Knowledge Distillation:

Y Tian, S Pei, X Zhang, C Zhang, 2023, Knowledge Distillation on Graphs: A Survey, arXiv preprint, https://arxiv.org/abs/2302.00219
Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. 2017, A survey of model compression and acceleration for deep neural networks, CoRR, abs/1710.09282, https://arxiv.org/abs/1710.09282 (A survey paper from 2017 that includes KD.)
Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, Lei Li, 2021, A Survey on Green Deep Learning, Nov 2021, https://arxiv.org/abs/2111.05193 (Extensive survey paper on multiple areas, including a section on Knowledge Distillation.)
Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang, A Survey on Model Compression for Large Language Models, 2023, arXiv preprint arXiv:2308.07633, Aug 2023 https://arxiv.org/abs/2308.07633 (Recent 2023 survey paper on various model compression approaches including knowledge distillation.)
Wang L, Yoon KJ. 2021, Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks, 2021;44:3048-3068 https://arxiv.org/abs/2004.05937 (Distillation in vision context.)

Specific research papers on distillation include:

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015, Distilling the knowledge in a neural network, arXiv preprint arXiv:1503.02531, 2015. https://arxiv.org/abs/1503.02531 (The early paper that seems to have coined the name.)
Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf, 2019, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, Oct 2019 (revised March 2020), arXiv preprint arXiv:1910.01108 (2019), https://arxiv.org/abs/1910.01108
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu, 2019, TinyBERT: Distilling BERT for Natural Language Understanding, arXiv preprint arXiv:1909.10351, Sep 2019 (updated Oct 2020), https://arxiv.org/abs/1909.10351, Code: https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT
Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, Denny Zhou, MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices, 2020, arXiv preprint arXiv:2004.02984 (2020), https://arxiv.org/abs/2004.02984
Siqi Sun, Yu Cheng, Zhe Gan, Jingjing Liu, 2019, Patient Knowledge Distillation for BERT Model Compression, arXiv preprint arXiv:1908.09355 (Aug 2019), https://arxiv.org/abs/1908.09355
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2021, MINILMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers, arXiv preprint arXiv:2002.10957, 2020 (revised June 2021), https://arxiv.org/abs/2012.15828
Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, Jimmy Lin, 2019, Distilling Task-Specific Knowledge from BERT into Simple Neural Networks, arXiv preprint arXiv:1903.12136 (Mar 2019), https://arxiv.org/abs/1903.12136
Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Haotang Deng, and Qi Ju. 2020, FastBERT: a self-distilling BERT with adaptive inference time, arXiv preprint arXiv:2004.02178, 2020. https://arxiv.org/abs/2004.02178
Jean Senellart, Dakun Zhang, Bo Wang, Guillaume Klein, Jean-Pierre Ramatchandirin, Josep Crego, and Alexander Rush. 2018, OpenNMT system description for WNMT 2018: 800 words/sec on a single-core CPU, In Proc. of WNG, 2018. https://www.aclweb.org/anthology/W18-2715
Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Well-read students learn better: The impact of student initialization on knowledge distillation, arXiv preprint arXiv:1908.08962, https://arxiv.org/abs/1908.08962v1
Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Arivazhagan, Xin Li, and Amelia Archer. 2019. Small and Practical BERT Models for Sequence Labeling, arXiv preprint arXiv:1909.00100. https://arxiv.org/abs/1909.00100
Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. 2019. Distilling Task-Specific Knowledge from BERT into Simple Neural Networks, arXiv preprint arXiv:1903.12136, https://arxiv.org/abs/1903.12136
Kevin Clark, Minh-Thang Luong, Urvashi Khandelwal, Christopher D Manning, and Quoc V Le. July 2019. BAM! Born-Again Multi-Task Networks for Natural Language Understanding, arXiv preprint arXiv:1907.04829. https://arxiv.org/abs/1907.04829
Yoon Kim and Alexander M Rush. Sep 2016. Sequence-level knowledge distillation, arXiv preprint arXiv:1606.07947, https://arxiv.org/abs/1606.07947
Kaixin Wu, Bojie Hu, and Qi Ju. 2021. TenTrans High-Performance Inference Toolkit for WMT2021 Efficiency Task, In Proceedings of the Sixth Conference on Machine Translation, pages 795–798, Online. Association for Computational Linguistics, https://aclanthology.org/2021.wmt-1.77/, Code: https://github.com/TenTrans/TenTrans
Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker. 2017. Learning efficient object detection models with knowledge distillation, In Advances in Neural Information Processing Systems, pages 742–751. https://dl.acm.org/doi/10.5555/3294771.3294842
Mao, Y.; Wang, Y.; Wu, C.; Zhang, C.; Wang, Y.; Zhang, Q.; Yang, Y.; Tong, Y.; and Bai, J. 2020. LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression, In COLING, 3225–3234. International Committee on Computational Linguistics. https://arxiv.org/abs/2004.04124 (A combination of weight pruning, matrix factorization and knowledge distillation.)
Lin, S.C.; Yang, J.H.; Lin, J., 2020, Distilling dense representations for ranking using tightly-coupled teachers, arXiv preprint arXiv:2010.11386 2020. https://arxiv.org/abs/2010.11386
Dawei Li, Xiaolong Wang, and Deguang Kong. 2018. DeepRebirth: Accelerating Deep Neural Network Execution on Mobile Devices, In AAAI’18, https://arxiv.org/abs/1708.04728 (Includes a distinct type of distillation.)
Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. 2006, Model compression, In Tina Eliassi-Rad, Lyle H. Ungar, Mark Craven, and Dimitrios Gunopulos (eds.), Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, August 20-23, 2006, pp. 535–541. ACM, 2006. https://www.semanticscholar.org/paper/Model-compression-Bucila-Caruana/30c9bb327b7f2b9f1d1e5b69b9d0c97b410948d9, PDF: http://www.cs.cornell.edu/~caruana/compression.kdd06.pdf (Early 2006 paper on teaching models before it became called “distillation” in 2015.)
Lili Mou, Ran Jia, Yan Xu, Ge Li, Lu Zhang, and Zhi Jin. 2016, Distilling word embeddings: An encoding approach, In CIKM, pp. 1977–1980. ACM, 2016. https://arxiv.org/abs/1506.04488 (Distillation of embeddings.)
Canwen Xu, Wangchunshu Zhou, Tao Ge, Ke Xu, Julian J. McAuley, and Furu Wei. 2021, Beyond preserved accuracy: Evaluating loyalty and robustness of BERT compression, CoRR, abs/2109.03228, 2021, https://arxiv.org/abs/2109.03228 (Evaluation of the efficiency of distillation.)
Samuel Stanton, Pavel Izmailov, Polina Kirichenko, Alexander A. Alemi, and Andrew Gordon Wilson. 2021, Does knowledge distillation really work?, CoRR, abs/2106.05945, 2021, https://arxiv.org/abs/2106.05945 (Evaluation of the efficacy of distillation.)
Dae Young Park, Moon-Hyun Cha, Changwook Jeong, Daesin Kim, and Bohyung Han. 2021, Learning student-friendly teacher networks for knowledge distillation, CoRR, abs/2102.07650, 2021. https://arxiv.org/abs/2102.07650
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014, Fitnets: Hints for thin deep nets, arXiv preprint arXiv:1412.6550, 2014 https://arxiv.org/abs/1412.6550
Anil Kag, 2023, Novel neural architectures & algorithms for efficient inference, Ph.D. thesis, College of Engineering, Boston University, https://open.bu.edu/handle/2144/46649, PDF: https://open.bu.edu/bitstream/handle/2144/46649/Kag_bu_0017E_18472.pdf?sequence=8&isAllowed=y Code: https://github.com/anilkagak2/DiSK_Distilling_Scaffolded_Knowledge (See Chapter 13: Distilling Selective/Scaffolded Knowledge)
Zhang, C.; Yang, Y.; Liu, J.; Wang, J.; Xian, Y.; Wang, B.; and Song, D. 2023. Lifting the Curse of Capacity Gap in Distilling Language Models, arXiv:2305.12129. https://arxiv.org/abs/2305.12129
Chen X, He B, Hui K, Sun L, Sun Y. 2020, Simplified Tinybert: Knowledge Distillation for Document Retrieval, 2020. Arxiv preprint, https://arxiv.org/abs/2009.07531
Tian Y, Krishnan D, Isola P. 2019, Contrastive Representation Distillation, 2019. Arxiv Preprint: https://arxiv.org/pdf/1910.10699.pdf
Do T, Tran H, Do T, Tjiputra E, Tran Q. 2019, Compact Trilinear Interaction for Visual Question Answering, In: Proceedings of the proceedings of the IEEE International Conference on Computer Vision. 2019:392-401. https://arxiv.org/abs/1909.11874
T Chen, S Liu, Z Chen, W Hu, D Chen, Y Wang, Q Lyu, 2023, Faster, Stronger, and More Interpretable: Massive Transformer Architectures for Vision-Language Tasks, https://www.oajaiml.com/uploads/archivepdf/27841181.pdf
B. Heo, M. Lee, S. Yun and J. Y. Choi, 2019, Knowledge transfer via distillation of activation boundaries formed by hidden neurons, Proc. AAAI Conf. Artif. Intell. (AAAI), vol. 33, no. 1, pp. 3779-3787, 2019, https://arxiv.org/abs/1811.03233
Guorui Zhou, Ying Fan, Runpeng Cui, Weijie Bian, Xiaoqiang Zhu, Kun Gai, 2018, Rocket launching: A universal and efficient framework for training well-performing light net, Proc. AAAI Conf. Artif. Intell., pp. 1-8, 2018. https://arxiv.org/abs/1708.04106 (Combined training of teacher and student models.)
A. Chaulwar et al., 2022, Extreme compression of sentence-transformer ranker models: Faster inference longer battery life and less storage on edge devices, arXiv:2207.12852, 2022. https://arxiv.org/abs/2207.12852v1 (Distillation from the point of view of embeddings.)
T Shen, C Lee, V Narayanan, Oct 2023, Multi-Exit Vision Transformer with Custom Fine-Tuning for Fine-Grained Image Recognition, 2023 IEEE International Conference on Image Processing (ICIP), https://ieeexplore.ieee.org/abstract/document/10222298 (Early exit from multiple places, combined with self-distillation.)
Z Zhao, Q Liu, H Gui, B An, L Hong, H Chi, Oct 2023, Talking Models: Distill Pre-trained Knowledge to Downstream Models via Interactive Communication, arXiv preprint arXiv:2310.03188, https://arxiv.org/pdf/2310.03188.pdf
T Udagawa, A Trivedi, M Merler, B Bhattacharjee, Oct 2023, A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models, arXiv preprint arXiv:2310.08797, https://arxiv.org/abs/2310.08797
Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, Rishabh Agarwal, Oct 2023, DistillSpec: Improving Speculative Decoding via Knowledge Distillation, https://arxiv.org/abs/2310.08461

For more research papers on Knowledge Distillation see: https://www.aussieai.com/research/knowledge-distillation.

Multi-Teacher Knowledge Distillation

Ensemble knowledge distillation generalizes the basic distillation algorithm to have more than one teacher model, rather than a single teacher-student pair of models. There is research to suggest that distillation can be even more effective with multiple teacher models. Ensemble distillation is a type of “ensemble learning.”

Various strategies have been employed, such as sequential versus parallel teaching, and model monitoring where a teacher monitors the student for correctness of its results. There are many papers, but although basic knowledge distillation is mainstream, the use of ensemble distillation remains mostly a research method.

Research papers on ensemble distillation:

Wenxian Shi, Yuxuan Song, Hao Zhou, Bohan Li, and Lei Li. 2021, Learning from deep model via exploring local targets, 2021. https://openreview.net/forum?id=5slGDu_bVc6 (Distillation with multiple teachers)
Seyed-Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. 2020, Improved knowledge distillation via teacher assistant, In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 5191– 5198. AAAI Press, 2020. https://arxiv.org/abs/1902.03393 (multiple teachers)
Jangho Kim, Minsung Hyun, Inseop Chung, and Nojun Kwak. 2020, Feature fusion for online mutual knowledge distillation, In 25th International Conference on Pattern Recognition, ICPR 2020, Virtual Event / Milan, Italy, January 10-15, 2021, pp. 4619–4625. IEEE, 2020. https://arxiv.org/abs/1904.09058 (Ensemble methods for distillation.)
Inseop Chung, Seonguk Park, Jangho Kim, and Nojun Kwak. 2020, Feature-map-level online adversarial knowledge distillation, In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 2006–2015. PMLR, 2020. https://arxiv.org/abs/2002.01775 (Multiple teacher models.)
Defang Chen, Jian-Ping Mei, Can Wang, Yan Feng, and Chun Chen. 2020, Online knowledge distillation with diverse peers, In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 3430–3437. AAAI Press, 2020a https://arxiv.org/abs/1912.00350 (Ensemble distillation with multiple “peer” teachers.)
Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Orm ´ andi, George E. Dahl, and Geoffrey E. Hinton. 2018, Large scale distributed neural network training through online distillation, In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. https://arxiv.org/abs/1804.03235
Mehdi Rezagholizadeh, Aref Jafari, Puneeth Salad, Pranav Sharma, Ali Saheb Pasand, Ali Ghodsi, 2021, Pro-KD: Progressive Distillation by Following the Footsteps of the Teacher, arXiv preprint arXiv:2110.08532, 2021. https://arxiv.org/abs/2110.08532
Y. Zhang, T. Xiang, T. M. Hospedales and H. Lu, 2018, Deep mutual learning, Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 4320-4328, Jun. 2018. https://arxiv.org/abs/1706.00384
L. Yuan, F. E. Tay, G. Li, T. Wang and J. Feng, 2020, Revisiting knowledge distillation via label smoothing regularization, Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 3903-3911, Jun. 2020. https://arxiv.org/abs/1909.11723 (Improved learning, and also looks at reverse student-to-teacher learning.)

For more research papers on ensemble KD, see https://www.aussieai.com/research/knowledge-distillation#ensemble.

Dataset Distillation

The technique of “dataset distillation” borrows the same terminology, but is a different technique to knowledge distillation. This term refers to methods to reduce a training dataset to a derived set of training data, such as to (theoretically) sidestep privacy or copyright concerns. The dataset is smaller and theoretically can be used to train a similarly capable model.

Research papers on dataset distillation:

T. Wang, J.-Y. Zhu, A. Torralba and A. A. Efros, 2018, Dataset distillation, arXiv:1811.10959, 2018. https://arxiv.org/abs/1811.10959
Yu R, Liu S, Wang X, 2023, Dataset Distillation: A Comprehensive Review, https://arxiv.org/abs/2301.07014 Or Honovich, Thomas Scialom, Omer Levy, Timo Schick, Dec 2022, Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor, https://arxiv.org/abs/2212.09689, https://github.com/orhonovich/unnatural-instructions (Using a model to automatically create a training data set, including automatically creating both instructions and responses.)

For additional research papers on dataset distillation, see https://www.aussieai.com/research/knowledge-distillation#dataset-distillation.

• Next: Chapter 46. Structured Pruning

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs