Aussie AI

Embeddings Matrix Pruning

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Embeddings Matrix Pruning

Pruning of embeddings doesn't receive much research attention, because it isn't a bottleneck in most Transformers. Most of the research on pruning embeddings has been on compacting the space of the embedding matrices for use on smaller devices, rather than for speeding it up. The conversion into an embeddings vector uses a single embedding matrix, which can be large if the model's vocabulary size is large. Various pruning approaches exist using matrix compression techniques such as sparsity or hashing.

Research papers on embeddings matrix pruning:

  1. Daochen Zha, Louis Feng, Bhargav Bhushanam, Dhruv Choudhary, Jade Nie, Yuandong Tian, Jay Chae, Yinbin Ma, Arun Kejariwal, Xia Hu, 2022, AutoShard: Automated Embedding Table Sharding for Recommender Systems, https://dl.acm.org/doi/abs/10.1145/3534678.3539034, https://arxiv.org/abs/2208.06399
  2. A Desai, L Chou, A Shrivastava, 2022, Random Offset Block Embedding (ROBE) for compressed embedding tables in deep learning recommendation systems, Conference on Machine Learning and Systems, https://arxiv.org/abs/2108.02191
  3. Xiangyu Zhao, Haochen Liu, Hui Liu, Jiliang Tang, Weiwei Guo, Jun Shi, Sida Wang, Huiji Gao, and Bo Long. 2020. Memory-efficient embedding for recommendations, arXiv preprint arXiv:2006.14827 (2020), https://arxiv.org/abs/2006.14827
  4. Antonio Ginart, Maxim Naumov, Dheevatsa Mudigere, Jiyan Yang, and James Zou. 2019. Mixed Dimension Embeddings with Application to Memory-Efficient Recommendation Systems, arXiv preprint arXiv:1909.11810 (2019), https://arxiv.org/abs/1909.11810
  5. Nicola Tonellotto, Craig Macdonald, 2021, Query Embedding Pruning for Dense Retrieval, CIKM ’21, November 1–5, 2021, Virtual Event, QLD, Australia, https://arxiv.org/abs/2108.10341
  6. IamAdiSri, 2021, Pruning a model embedding matrix for memory efficiency, April 2021, Hugging Face discussion board, https://discuss.huggingface.co/t/pruning-a-model-embedding-matrix-for-memory-efficiency/5502/7
  7. Raphael Shu and Hideki Nakayama. 2017, Compressing word embeddings via deep compositional code learning, In ICLR (Poster). OpenReview.net, 2018, https://arxiv.org/abs/1711.01068
  8. Hao-Jun Michael Shi, Dheevatsa Mudigere, Maxim Naumov, and Jiyan Yang. 2020, Compositional embeddings using complementary partitions for memory-efficient recommendation systems, In KDD, pp. 165-175. ACM, 2020, https://arxiv.org/abs/1909.02107
  9. Valentin Khrulkov, Oleksii Hrinchuk, Leyla Mirvakhabova, and Ivan Oseledets. 2019. Tensorized Embedding Layers for Efficient Model Compression, arXiv preprint arXiv:1901.10787 (2019), updated Feb 2020, https://arxiv.org/abs/1901.10787v1
  10. Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah A. Smith. 2015, Sparse overcomplete word vector representations, In ACL (1), pp. 1491-1500. The Association for Computer Linguistics, 2015, https://arxiv.org/abs/1506.02004
  11. Yunchuan Chen, Lili Mou, Yan Xu, Ge Li, and Zhi Jin. 2016, Compressing neural language models by sparse word representations, In ACL (1). The Association for Computer Linguistics, 2016, https://arxiv.org/abs/1610.03950 (Sparse matrix via common and rare word embeddings)
  12. Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, Lei Li, 2021, A Survey on Green Deep Learning, Nov 2021, https://arxiv.org/abs/2111.05193 (Extensive survey paper with section on “Compact Embeddings”.)
  13. Wei Deng, Junwei Pan, Tian Zhou, Deguang Kong, Aaron Flores, and Guang Lin. 2021. DeepLight: Deep Lightweight Feature Interactions for Accelerating CTR Predictions in Ad Serving, In Proceedings of the 14th ACM international conference on Web search and data mining. 922–930, https://arxiv.org/abs/2002.06987
  14. Jun Suzuki and Masaaki Nagata. 2016. Learning Compact Neural Word Embeddings by Parameter Space Sharing, In IJCAI. 2046–2052, https://dl.acm.org/doi/10.5555/3060832.3060907
  15. Aliakbar Panahi, Seyran Saeedi, and Tom Arodz. 2019. word2ket: Space-efficient Word Embeddings inspired by Quantum Entanglement, In ICLR. https://arxiv.org/abs/1911.04975
  16. Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. 2019, AutoInt: Automatic feature interaction learning via self-attentive neural networks, In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1161–1170, 2019, https://arxiv.org/abs/1810.11921, Code: https://github.com/DeepGraphLearning/RecommenderSystems
  17. Antonio Ginart, Maxim Naumov, Dheevatsa Mudigere, Jiyan Yang, and James Zou. 2019, Mixed dimension embeddings with application to memory-efficient recommendation systems, arXiv preprint arXiv:1909.11810, 2019 (preprint revised Feb 2021), https://arxiv.org/abs/1909.11810
  18. Xiaorui Wu, Hong Xu, Honglin Zhang, Huaming Chen, and Jian Wang. 2019, Saec: Similarity-aware embedding compression in recommendation systems, CoRR, abs/1903.00103, 2019, https://arxiv.org/abs/1903.00103
  19. Martin Andrews. 2016, Compressing word embeddings, CoRR, abs/1511.06397, 2015 (revised May 2016), https://arxiv.org/abs/1511.06397v2
  20. Lili Mou, Ran Jia, Yan Xu, Ge Li, Lu Zhang, and Zhi Jin. 2016, Distilling word embeddings: An encoding approach, In CIKM, pp. 1977–1980. ACM, 2016. https://arxiv.org/abs/1506.04488 (Distillation of embeddings.)
  21. Patrick H. Chen, Si Si, Yang Li, Ciprian Chelba, and Cho-Jui Hsieh. 2018, GroupReduce: Block-wise low-rank approximation for neural language model shrinking, In NeurIPS, pp. 11011–11021, 2018. https://arxiv.org/abs/1806.06950 (Using low-rank matrices for vocabulary and embeddings.)
  22. Maximilian Lam. 2018, Word2bits - quantized word vectors, CoRR, abs/1803.05651, 2018, https://arxiv.org/abs/1803.05651 (Quantization ideas leads to compression of word vectors and embeddings.)
  23. Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah A. Smith. 2015, Sparse overcomplete word vector representations, In ACL (1), pp. 1491–1500. The Association for Computer Linguistics, 2015. https://arxiv.org/abs/1506.02004 (Binary quantization in relation to word vector embeddings.)
  24. Alexei Baevski and Michael Auli. 2019, Adaptive input representations for neural language modeling, In ICLR, 2019, https://arxiv.org/abs/1809.10853 (Faster training with adaptive embeddings size.)

For more research papers on embeddings matrix pruning and optimizations, see https://www.aussieai.com/research/embeddings.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++