Aussie AI
Research on Knowledge Distillation
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Research on Knowledge Distillation
KD is a longstanding method of optimizing model inference that is one of the most popular techniques. There are many research papers on distillation:
Survey Papers on Knowledge Distillation:
- Y Tian, S Pei, X Zhang, C Zhang, 2023, Knowledge Distillation on Graphs: A Survey, arXiv preprint, https://arxiv.org/abs/2302.00219
- Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. 2017, A survey of model compression and acceleration for deep neural networks, CoRR, abs/1710.09282, https://arxiv.org/abs/1710.09282 (A survey paper from 2017 that includes KD.)
- Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, Lei Li, 2021, A Survey on Green Deep Learning, Nov 2021, https://arxiv.org/abs/2111.05193 (Extensive survey paper on multiple areas, including a section on Knowledge Distillation.)
- Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang, A Survey on Model Compression for Large Language Models, 2023, arXiv preprint arXiv:2308.07633, Aug 2023 https://arxiv.org/abs/2308.07633 (Recent 2023 survey paper on various model compression approaches including knowledge distillation.)
- Wang L, Yoon KJ. 2021, Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks, 2021;44:3048-3068 https://arxiv.org/abs/2004.05937 (Distillation in vision context.)
Specific research papers on distillation include:
- Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015, Distilling the knowledge in a neural network, arXiv preprint arXiv:1503.02531, 2015. https://arxiv.org/abs/1503.02531 (The early paper that seems to have coined the name.)
- Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf, 2019, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, Oct 2019 (revised March 2020), arXiv preprint arXiv:1910.01108 (2019), https://arxiv.org/abs/1910.01108
- Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu, 2019, TinyBERT: Distilling BERT for Natural Language Understanding, arXiv preprint arXiv:1909.10351, Sep 2019 (updated Oct 2020), https://arxiv.org/abs/1909.10351, Code: https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT
- Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, Denny Zhou, MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices, 2020, arXiv preprint arXiv:2004.02984 (2020), https://arxiv.org/abs/2004.02984
- Siqi Sun, Yu Cheng, Zhe Gan, Jingjing Liu, 2019, Patient Knowledge Distillation for BERT Model Compression, arXiv preprint arXiv:1908.09355 (Aug 2019), https://arxiv.org/abs/1908.09355
- Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2021, MINILMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers, arXiv preprint arXiv:2002.10957, 2020 (revised June 2021), https://arxiv.org/abs/2012.15828
- Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, Jimmy Lin, 2019, Distilling Task-Specific Knowledge from BERT into Simple Neural Networks, arXiv preprint arXiv:1903.12136 (Mar 2019), https://arxiv.org/abs/1903.12136
- Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Haotang Deng, and Qi Ju. 2020, FastBERT: a self-distilling BERT with adaptive inference time, arXiv preprint arXiv:2004.02178, 2020. https://arxiv.org/abs/2004.02178
- Jean Senellart, Dakun Zhang, Bo Wang, Guillaume Klein, Jean-Pierre Ramatchandirin, Josep Crego, and Alexander Rush. 2018, OpenNMT system description for WNMT 2018: 800 words/sec on a single-core CPU, In Proc. of WNG, 2018. https://www.aclweb.org/anthology/W18-2715
- Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Well-read students learn better: The impact of student initialization on knowledge distillation, arXiv preprint arXiv:1908.08962, https://arxiv.org/abs/1908.08962v1
- Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Arivazhagan, Xin Li, and Amelia Archer. 2019. Small and Practical BERT Models for Sequence Labeling, arXiv preprint arXiv:1909.00100. https://arxiv.org/abs/1909.00100
- Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. 2019. Distilling Task-Specific Knowledge from BERT into Simple Neural Networks, arXiv preprint arXiv:1903.12136, https://arxiv.org/abs/1903.12136
- Kevin Clark, Minh-Thang Luong, Urvashi Khandelwal, Christopher D Manning, and Quoc V Le. July 2019. BAM! Born-Again Multi-Task Networks for Natural Language Understanding, arXiv preprint arXiv:1907.04829. https://arxiv.org/abs/1907.04829
- Yoon Kim and Alexander M Rush. Sep 2016. Sequence-level knowledge distillation, arXiv preprint arXiv:1606.07947, https://arxiv.org/abs/1606.07947
- Kaixin Wu, Bojie Hu, and Qi Ju. 2021. TenTrans High-Performance Inference Toolkit for WMT2021 Efficiency Task, In Proceedings of the Sixth Conference on Machine Translation, pages 795–798, Online. Association for Computational Linguistics, https://aclanthology.org/2021.wmt-1.77/, Code: https://github.com/TenTrans/TenTrans
- Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker. 2017. Learning efficient object detection models with knowledge distillation, In Advances in Neural Information Processing Systems, pages 742–751. https://dl.acm.org/doi/10.5555/3294771.3294842
- Mao, Y.; Wang, Y.; Wu, C.; Zhang, C.; Wang, Y.; Zhang, Q.; Yang, Y.; Tong, Y.; and Bai, J. 2020. LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression, In COLING, 3225–3234. International Committee on Computational Linguistics. https://arxiv.org/abs/2004.04124 (A combination of weight pruning, matrix factorization and knowledge distillation.)
- Lin, S.C.; Yang, J.H.; Lin, J., 2020, Distilling dense representations for ranking using tightly-coupled teachers, arXiv preprint arXiv:2010.11386 2020. https://arxiv.org/abs/2010.11386
- Dawei Li, Xiaolong Wang, and Deguang Kong. 2018. DeepRebirth: Accelerating Deep Neural Network Execution on Mobile Devices, In AAAI’18, https://arxiv.org/abs/1708.04728 (Includes a distinct type of distillation.)
- Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. 2006, Model compression, In Tina Eliassi-Rad, Lyle H. Ungar, Mark Craven, and Dimitrios Gunopulos (eds.), Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, August 20-23, 2006, pp. 535–541. ACM, 2006. https://www.semanticscholar.org/paper/Model-compression-Bucila-Caruana/30c9bb327b7f2b9f1d1e5b69b9d0c97b410948d9, PDF: http://www.cs.cornell.edu/~caruana/compression.kdd06.pdf (Early 2006 paper on teaching models before it became called “distillation” in 2015.)
- Lili Mou, Ran Jia, Yan Xu, Ge Li, Lu Zhang, and Zhi Jin. 2016, Distilling word embeddings: An encoding approach, In CIKM, pp. 1977–1980. ACM, 2016. https://arxiv.org/abs/1506.04488 (Distillation of embeddings.)
- Canwen Xu, Wangchunshu Zhou, Tao Ge, Ke Xu, Julian J. McAuley, and Furu Wei. 2021, Beyond preserved accuracy: Evaluating loyalty and robustness of BERT compression, CoRR, abs/2109.03228, 2021, https://arxiv.org/abs/2109.03228 (Evaluation of the efficiency of distillation.)
- Samuel Stanton, Pavel Izmailov, Polina Kirichenko, Alexander A. Alemi, and Andrew Gordon Wilson. 2021, Does knowledge distillation really work?, CoRR, abs/2106.05945, 2021, https://arxiv.org/abs/2106.05945 (Evaluation of the efficacy of distillation.)
- Dae Young Park, Moon-Hyun Cha, Changwook Jeong, Daesin Kim, and Bohyung Han. 2021, Learning student-friendly teacher networks for knowledge distillation, CoRR, abs/2102.07650, 2021. https://arxiv.org/abs/2102.07650
- Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014, Fitnets: Hints for thin deep nets, arXiv preprint arXiv:1412.6550, 2014 https://arxiv.org/abs/1412.6550
- Anil Kag, 2023, Novel neural architectures & algorithms for efficient inference, Ph.D. thesis, College of Engineering, Boston University, https://open.bu.edu/handle/2144/46649, PDF: https://open.bu.edu/bitstream/handle/2144/46649/Kag_bu_0017E_18472.pdf?sequence=8&isAllowed=y Code: https://github.com/anilkagak2/DiSK_Distilling_Scaffolded_Knowledge (See Chapter 13: Distilling Selective/Scaffolded Knowledge)
- Zhang, C.; Yang, Y.; Liu, J.; Wang, J.; Xian, Y.; Wang, B.; and Song, D. 2023. Lifting the Curse of Capacity Gap in Distilling Language Models, arXiv:2305.12129. https://arxiv.org/abs/2305.12129
- Chen X, He B, Hui K, Sun L, Sun Y. 2020, Simplified Tinybert: Knowledge Distillation for Document Retrieval, 2020. Arxiv preprint, https://arxiv.org/abs/2009.07531
- Tian Y, Krishnan D, Isola P. 2019, Contrastive Representation Distillation, 2019. Arxiv Preprint: https://arxiv.org/pdf/1910.10699.pdf
- Do T, Tran H, Do T, Tjiputra E, Tran Q. 2019, Compact Trilinear Interaction for Visual Question Answering, In: Proceedings of the proceedings of the IEEE International Conference on Computer Vision. 2019:392-401. https://arxiv.org/abs/1909.11874
- T Chen, S Liu, Z Chen, W Hu, D Chen, Y Wang, Q Lyu, 2023, Faster, Stronger, and More Interpretable: Massive Transformer Architectures for Vision-Language Tasks, https://www.oajaiml.com/uploads/archivepdf/27841181.pdf
- B. Heo, M. Lee, S. Yun and J. Y. Choi, 2019, Knowledge transfer via distillation of activation boundaries formed by hidden neurons, Proc. AAAI Conf. Artif. Intell. (AAAI), vol. 33, no. 1, pp. 3779-3787, 2019, https://arxiv.org/abs/1811.03233
- Guorui Zhou, Ying Fan, Runpeng Cui, Weijie Bian, Xiaoqiang Zhu, Kun Gai, 2018, Rocket launching: A universal and efficient framework for training well-performing light net, Proc. AAAI Conf. Artif. Intell., pp. 1-8, 2018. https://arxiv.org/abs/1708.04106 (Combined training of teacher and student models.)
- A. Chaulwar et al., 2022, Extreme compression of sentence-transformer ranker models: Faster inference longer battery life and less storage on edge devices, arXiv:2207.12852, 2022. https://arxiv.org/abs/2207.12852v1 (Distillation from the point of view of embeddings.)
- T Shen, C Lee, V Narayanan, Oct 2023, Multi-Exit Vision Transformer with Custom Fine-Tuning for Fine-Grained Image Recognition, 2023 IEEE International Conference on Image Processing (ICIP), https://ieeexplore.ieee.org/abstract/document/10222298 (Early exit from multiple places, combined with self-distillation.)
- Z Zhao, Q Liu, H Gui, B An, L Hong, H Chi, Oct 2023, Talking Models: Distill Pre-trained Knowledge to Downstream Models via Interactive Communication, arXiv preprint arXiv:2310.03188, https://arxiv.org/pdf/2310.03188.pdf
- T Udagawa, A Trivedi, M Merler, B Bhattacharjee, Oct 2023, A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models, arXiv preprint arXiv:2310.08797, https://arxiv.org/abs/2310.08797
- Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, Rishabh Agarwal, Oct 2023, DistillSpec: Improving Speculative Decoding via Knowledge Distillation, https://arxiv.org/abs/2310.08461
For more research papers on Knowledge Distillation see: https://www.aussieai.com/research/knowledge-distillation.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |