Aussie AI

Embeddings Matrix Pruning

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Embeddings Matrix Pruning

Pruning of embeddings doesn't receive much research attention, because it isn't a bottleneck in most Transformers. Most of the research on pruning embeddings has been on compacting the space of the embedding matrices for use on smaller devices, rather than for speeding it up. The conversion into an embeddings vector uses a single embedding matrix, which can be large if the model's vocabulary size is large. Various pruning approaches exist using matrix compression techniques such as sparsity or hashing.

Research papers on embeddings matrix pruning:

Daochen Zha, Louis Feng, Bhargav Bhushanam, Dhruv Choudhary, Jade Nie, Yuandong Tian, Jay Chae, Yinbin Ma, Arun Kejariwal, Xia Hu, 2022, AutoShard: Automated Embedding Table Sharding for Recommender Systems, https://dl.acm.org/doi/abs/10.1145/3534678.3539034, https://arxiv.org/abs/2208.06399
A Desai, L Chou, A Shrivastava, 2022, Random Offset Block Embedding (ROBE) for compressed embedding tables in deep learning recommendation systems, Conference on Machine Learning and Systems, https://arxiv.org/abs/2108.02191
Xiangyu Zhao, Haochen Liu, Hui Liu, Jiliang Tang, Weiwei Guo, Jun Shi, Sida Wang, Huiji Gao, and Bo Long. 2020. Memory-efficient embedding for recommendations, arXiv preprint arXiv:2006.14827 (2020), https://arxiv.org/abs/2006.14827
Antonio Ginart, Maxim Naumov, Dheevatsa Mudigere, Jiyan Yang, and James Zou. 2019. Mixed Dimension Embeddings with Application to Memory-Efficient Recommendation Systems, arXiv preprint arXiv:1909.11810 (2019), https://arxiv.org/abs/1909.11810
Nicola Tonellotto, Craig Macdonald, 2021, Query Embedding Pruning for Dense Retrieval, CIKM ’21, November 1–5, 2021, Virtual Event, QLD, Australia, https://arxiv.org/abs/2108.10341
IamAdiSri, 2021, Pruning a model embedding matrix for memory efficiency, April 2021, Hugging Face discussion board, https://discuss.huggingface.co/t/pruning-a-model-embedding-matrix-for-memory-efficiency/5502/7
Raphael Shu and Hideki Nakayama. 2017, Compressing word embeddings via deep compositional code learning, In ICLR (Poster). OpenReview.net, 2018, https://arxiv.org/abs/1711.01068
Hao-Jun Michael Shi, Dheevatsa Mudigere, Maxim Naumov, and Jiyan Yang. 2020, Compositional embeddings using complementary partitions for memory-efficient recommendation systems, In KDD, pp. 165-175. ACM, 2020, https://arxiv.org/abs/1909.02107
Valentin Khrulkov, Oleksii Hrinchuk, Leyla Mirvakhabova, and Ivan Oseledets. 2019. Tensorized Embedding Layers for Efficient Model Compression, arXiv preprint arXiv:1901.10787 (2019), updated Feb 2020, https://arxiv.org/abs/1901.10787v1
Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah A. Smith. 2015, Sparse overcomplete word vector representations, In ACL (1), pp. 1491-1500. The Association for Computer Linguistics, 2015, https://arxiv.org/abs/1506.02004
Yunchuan Chen, Lili Mou, Yan Xu, Ge Li, and Zhi Jin. 2016, Compressing neural language models by sparse word representations, In ACL (1). The Association for Computer Linguistics, 2016, https://arxiv.org/abs/1610.03950 (Sparse matrix via common and rare word embeddings)
Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, Lei Li, 2021, A Survey on Green Deep Learning, Nov 2021, https://arxiv.org/abs/2111.05193 (Extensive survey paper with section on “Compact Embeddings”.)
Wei Deng, Junwei Pan, Tian Zhou, Deguang Kong, Aaron Flores, and Guang Lin. 2021. DeepLight: Deep Lightweight Feature Interactions for Accelerating CTR Predictions in Ad Serving, In Proceedings of the 14th ACM international conference on Web search and data mining. 922–930, https://arxiv.org/abs/2002.06987
Jun Suzuki and Masaaki Nagata. 2016. Learning Compact Neural Word Embeddings by Parameter Space Sharing, In IJCAI. 2046–2052, https://dl.acm.org/doi/10.5555/3060832.3060907
Aliakbar Panahi, Seyran Saeedi, and Tom Arodz. 2019. word2ket: Space-efficient Word Embeddings inspired by Quantum Entanglement, In ICLR. https://arxiv.org/abs/1911.04975
Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. 2019, AutoInt: Automatic feature interaction learning via self-attentive neural networks, In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1161–1170, 2019, https://arxiv.org/abs/1810.11921, Code: https://github.com/DeepGraphLearning/RecommenderSystems
Antonio Ginart, Maxim Naumov, Dheevatsa Mudigere, Jiyan Yang, and James Zou. 2019, Mixed dimension embeddings with application to memory-efficient recommendation systems, arXiv preprint arXiv:1909.11810, 2019 (preprint revised Feb 2021), https://arxiv.org/abs/1909.11810
Xiaorui Wu, Hong Xu, Honglin Zhang, Huaming Chen, and Jian Wang. 2019, Saec: Similarity-aware embedding compression in recommendation systems, CoRR, abs/1903.00103, 2019, https://arxiv.org/abs/1903.00103
Martin Andrews. 2016, Compressing word embeddings, CoRR, abs/1511.06397, 2015 (revised May 2016), https://arxiv.org/abs/1511.06397v2
Lili Mou, Ran Jia, Yan Xu, Ge Li, Lu Zhang, and Zhi Jin. 2016, Distilling word embeddings: An encoding approach, In CIKM, pp. 1977–1980. ACM, 2016. https://arxiv.org/abs/1506.04488 (Distillation of embeddings.)
Patrick H. Chen, Si Si, Yang Li, Ciprian Chelba, and Cho-Jui Hsieh. 2018, GroupReduce: Block-wise low-rank approximation for neural language model shrinking, In NeurIPS, pp. 11011–11021, 2018. https://arxiv.org/abs/1806.06950 (Using low-rank matrices for vocabulary and embeddings.)
Maximilian Lam. 2018, Word2bits - quantized word vectors, CoRR, abs/1803.05651, 2018, https://arxiv.org/abs/1803.05651 (Quantization ideas leads to compression of word vectors and embeddings.)
Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah A. Smith. 2015, Sparse overcomplete word vector representations, In ACL (1), pp. 1491–1500. The Association for Computer Linguistics, 2015. https://arxiv.org/abs/1506.02004 (Binary quantization in relation to word vector embeddings.)
Alexei Baevski and Michael Auli. 2019, Adaptive input representations for neural language modeling, In ICLR, 2019, https://arxiv.org/abs/1809.10853 (Faster training with adaptive embeddings size.)

For more research papers on embeddings matrix pruning and optimizations, see https://www.aussieai.com/research/embeddings.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Embeddings Matrix Pruning

Embeddings Matrix Pruning

Quick Links

Product

New to Writing?

Writing Styles