Aussie AI
Memory Optimization
-
Last Updated 12 December, 2024
-
by David Spuler, Ph.D.
Memory optimization involves using less memory during model inference. This means that inference requires less resources, and can also reduce CPU usage by leading to less data being swapped in and out of memory. Memory optimization can refer to either CPU memory or GPU memory.
Some research reports that model inference is memory-bound rather than CPU-bound. In such cases, memory management is key to improving latency and throughput. On the other hand, researchers have also examined increasing memory usage to save time by caching and computation reuse.
Model Compression Techniques
The main class of optimizations that reduce memory requirements by making the model smaller is called "model compression". These methods reduce memory by making the model smaller. Model compression includes sub-strategies such as:
Recomputation: Trading Time for Space
On memory-constrained device, it is possible to reduce space requirements at the cost of extra processor time. This is called "recomputation", or sometimes in research papers it is called "rematerialization" or "checkpointing". When this is used to optimize training when a model is too large to fit inside GPU memory, it is called "gradient checkpointing." The portion of this algorithm that involves swapping tensors off the GPU back to the CPU is often called "offloading."
The recomputation optimization method involves not storing results of a computation that you might need later, but instead waiting until later, and then recomputing them all over again. Hence, recomputation trades time for space and is effectively the opposite of caching and data reuse optimizations, which trade space for time.
Recomputation involves doing calculations a second time, which is redundant computation. This is not something you want to have to do often, since it involves a lot more CPU or GPU time. But it is a technique that can be considered when memory is at a premium, and is sometimes done as a GPU optimization.
Research on Recomputation: Research papers on the recomputation memory optimization technique include:
- Yu Tang, Chenyu Wang, Yufan Zhang, Yuliang Liu, Xingcheng Zhang, Linbo Qiao, Zhiquan Lai, Dongsheng Li, 2022, Delta: Dynamically optimizing gpu memory beyond tensor recomputation, https://arxiv.org/abs/2203.15980
- Chien-Chin Huang, Gu Jin, and Jinyang Li. 2020. SwapAdvisor: Pushing deep learning beyond the GPU memory limit via smart swapping. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20), James R. Larus, Luis Ceze, and Karin Strauss (Eds.). ACM, 1341–1355. https://dl.acm.org/doi/10.1145/3373376.3378530, PDF: https://news.cs.nyu.edu/~jinyang/pub/swapadvisor-asplos20.pdf
- Xuan Peng, Xuanhua Shi, Hulin Dai, Hai Jin, Weiliang Ma, Qian Xiong, Fan Yang, and Xuehai Qian. 2020. Capuchin: Tensor-based GPU memory management for deep learning. Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20), James R. Larus, Luis Ceze, and Karin Strauss (Eds.). ACM, 891–905. https://doi.org/10.1145/3373376.3378505, https://dl.acm.org/doi/10.1145/3373376.3378505
- O. Beaumont, L. Eyraud-Dubois, and A. Shilova, 2021, Efficient combination of rematerialization and offloading for training dnns, Advances in Neural Information Processing Systems, vol. 34, PDF: https://proceedings.nips.cc/paper/2021/file/c8461bf13fca8a2b9912ab2eb1668e4b-Paper.pdf
- Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan, Mike He, Jared Roesch, Tianqi Chen, Zachary Tatlock, 2020, Dynamic tensor rematerialization, arXiv preprint arXiv:2006.09616, https://arxiv.org/abs/2006.09616
- Ravi Kumar, Manish Purohit, Zoya Svitkina, Erik Vee, and Joshua Wang. Efficient rematerialization for deep networks. Advances in Neural Information Processing Systems, 32, 2019. https://dl.acm.org/doi/10.5555/3454287.3455646, PDF: https://proceedings.neurips.cc/paper/2019/file/ffe10334251de1dc98339d99ae4743ba-Paper.pdf
- Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Joseph Gonzalez, Kurt Keutzer, and Ion Stoica, 2020, Checkmate: Breaking the memory wall with optimal tensor rematerialization. Proceedings of Machine Learning and Systems, 2:497–511, https://arxiv.org/abs/1910.02653 Code: https://github.com/parasj/checkmate
- Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, https://arxiv.org/abs/1604.06174
- Audrundefinednas Gruslys, Rémi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. 2016. Memory-Efficient Backpropagation through Time. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16). Curran Associates Inc., 4132–4140. https://arxiv.org/abs/1606.03401
- James Martens and Ilya Sutskever. 2012. Training deep and recurrent networks with hessian-free optimization. In Neural Networks: Tricks of the Trade. Springer. https://link.springer.com/chapter/10.1007/978-3-642-35289-8_27, PDF: https://www.cs.utoronto.ca/~jmartens/docs/HF_book_chapter.pdf
- M Schuler, R Membarth, P Slusallek, 2022, Xengine: Optimal tensor rematerialization for neural networks in heterogeneous environments, ACM Transactions on Architecture and Code Optimization, Volume 20, Issue 1, Article No. 17, pp 1–25, https://dl.acm.org/doi/10.1145/3568956, PDF: https://dl.acm.org/doi/pdf/10.1145/3568956, Code: https://github.com/dfki-asr/xengine
- Hugging Face, Performance and Scalability: How To Fit a Bigger Model and Train It Faster, https://huggingface.co/docs/transformers/v4.18.0/en/performance (Gradient checkpointing to optimize training of large models.)
- Yaroslav Bulatov, Jan 14, 2018, Fitting larger networks into memory, Medium, https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9 (Gradient checkpointing for training large models.)
- Olivier Beaumont, Lionel Eyraud-Dubois, Julien Herrmann, Alexis Joly, and Alena Shilova. Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory. Research Report RR-9302, Inria Bordeaux Sud-Ouest, November 2019, https://arxiv.org/abs/1911.13214
- Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016, Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, https://arxiv.org/abs/1604.06174
- Audrunas Gruslys, Rémi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. Memory-efficient backpropagation through time. Advances in Neural Information Processing Systems, pages 4125–4133, 2016. https://arxiv.org/abs/1606.03401
- Navjot Kukreja, Jan Hückelheim, and Gerard J Gorman. Backpropagation for long sequences: beyond memory constraints with constant overheads. arXiv preprint arXiv:1806.01117, 2018, https://arxiv.org/abs/1806.01117
- L Waeijen, S Sioutas, M Peemen, M Lindwer, 2021, ConvFusion: A model for layer fusion in convolutional neural networks, IEEE Access (Volume: 9), https://ieeexplore.ieee.org/abstract/document/9646923/, PDF: https://ieeexplore.ieee.org/iel7/6287639/6514899/09646923.pdf
- Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, Henri E. Bal, 2023, Optimization techniques for GPU programming, ACM Computing Surveys, Volume 55, Issue 11, Article No. 239, pp 1–81, https://dl.acm.org/doi/abs/10.1145/3570638, PDF: https://dl.acm.org/doi/pdf/10.1145/3570638 (Extensive survey of software optimizations to improve GPU latency and throughput.)
- Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang, 2 Feb 2024, APIServe: Efficient API Support for Large-Language Model Inferencing, https://arxiv.org/abs/2402.01869
- Jaskirat Singh, Bram Adams, Ahmed E. Hassan, 25 Mar 2024, On the Impact of Black-box Deployment Strategies for Edge AI on Latency and Model Performance, https://arxiv.org/abs/2403.17154 (MLOps deployment for quantization, partitioning and early-exit across mobile, edge, and cloud platforms, including running early exit on mobile.)
- Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, Bryan Catanzaro, May 2022, Reducing Activation Recomputation in Large Transformer Models, https://arxiv.org/abs/2205.05198
- Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 Code: https://github.com/HazyResearch/flash-attention (The original FlashAttention version 1, now superceded by FlashAttention 2, which uses tiling and memory-aware kernels to optimize attention.)
- Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
- Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani, Russ Webb, 6 Sep 2024, Theory, Analysis, and Best Practices for Sigmoid Self-Attention, https://arxiv.org/abs/2409.04431 https://github.com/apple/ml-sigmoid-attention
- Tailing Yuan, Yuliang Liu, Xucheng Ye, Shenglong Zhang, Jianchao Tan, Bin Chen, Chengru Song, and Di Zhang, July 2024, Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism, Kuaishou Technology, Proceedings of the 2024 USENIX Annual Technical Conference. July 10–12, 2024, Santa Clara, CA, USA, https://www.usenix.org/conference/atc24/presentation/yuan https://www.usenix.org/system/files/atc24-yuan.pdf
- Ping Chen, Wenjie Zhang, Shuibing He, Yingjie Gu, Zhuwei Peng, Kexin Huang, Xuan Zhan, Weijian Chen, Yi Zheng, Zhefeng Wang, Yanlong Yin, Gang Chen, 27 Jun 2024 (v2), Optimizing Large Model Training through Overlapped Activation Recomputation, https://arxiv.org/abs/2406.08756
- Xunyi Zhao, Lionel Eyraud-Dubois, Théotime Le Hellard, Julia Gusak, Olivier Beaumont, 24 July, 2024, OFFMATE: full fine-tuning of LLMs on a single GPU by re-materialization and offloading, https://hal.science/hal-04660745/document
- Chaoyi Jiang, Lei Gao, Hossein Entezari Zarch, Murali Annavaram, 26 Nov 2024, Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation, https://arxiv.org/abs/2411.17089 (Overlapping/optimizing CPU-GPU network bandwidth for KV cache with some recomputation.)
Research on Memory Optimization
For model compression and its popular subtypes, see research paper lists on the individual pages (e.g. quantization, pruning). Other research that is specifically on memory management and reducing memory includes:
- Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016). https://arxiv.org/abs/1604.06174
- Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. 2018. Superneurons: Dynamic GPU memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming. 41–53. https://arxiv.org/abs/1801.04380
- Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023, Efficient Memory Management for Large Language Model Serving with PagedAttention arXiv preprint, https://arxiv.org/abs/2309.06180
- Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, Ce Zhang, 2023, High-throughput Generative Inference of Large Language Models with a Single GPU. arXiv preprint arXiv:2303.06865 (2023). https://arxiv.org/abs/2303.06865 (FlexGen model optimizes speed and memory.)
- Shishir G Patil, Paras Jain, Prabal Dutta, Ion Stoica, and Joseph Gonzalez. 2022. POET: Training Neural Networks on Tiny Devices with Integrated Rematerialization and Paging. In International Conference on Machine Learning. PMLR, 17573–17583. https://arxiv.org/abs/2207.07697
- Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Joseph Gonzalez, Kurt Keutzer, and Ion Stoica. 2020. Checkmate: Breaking the memory wall with optimal tensor rematerialization. Proceedings of Machine Learning and Systems 2 (2020), 497–511. https://arxiv.org/abs/1910.02653
- Jiarui Fang, Yang Yu, Chengduo Zhao, Jie Zhou, Feb 2021, TurboTransformers: An Efficient GPU Serving System For Transformer Models, Proceedings of the 26th ACM SIGPLAN, 2021, https://dl.acm.org/doi/pdf/10.1145/3437801.3441578, https://arxiv.org/abs/2010.05680
- Nimit S. Sohoni, Christopher R. Aberger, Megan Leszczynski, Jian Zhang, Christopher Ré, Apr 2022, Low-Memory Neural Network Training: A Technical Report, arXiv preprint, https://arxiv.org/abs/1904.10631
- Tung D. Le, Haruki Imai, Yasushi Negishi, Kiyokuni Kawachiya, 2019, Automatic GPU memory management for large neural models in TensorFlow, ISMM 2019: Proceedings of the 2019 ACM SIGPLAN International Symposium on Memory Management, June 2019, Pages 1–13, 2019, https://dl.acm.org/doi/10.1145/3315573.3329984
- SB Shriram, A Garg, P Kulkarni, 2019, Dynamic Memory Management for GPU-Based Training of Deep Neural Networks, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), https://ieeexplore.ieee.org/document/8820980
- Y Pisarchyk, J Lee, 2020, Efficient memory management for deep neural net inference, arXiv preprint arXiv:2001.03288, https://arxiv.org/abs/2001.03288
- Chien-Chin Huang, Gu Jin, and Jinyang Li. 2020. Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 1341–1355. https://dl.acm.org/doi/10.1145/3373376.3378530
- Benoit Steiner, Mostafa Elhoushi, Jacob Kahn, and James Hegarty. 2022. OLLA: Optimizing the Lifetime and Location of Arrays to Reduce the Memory Usage of Neural Networks. https://arxiv.org/abs/2210.12924
- Mahdi Nazemi, Ghasem Pasandi, Massoud Pedram, Aug 2018, NullaNet: Training Deep Neural Networks for Reduced-Memory-Access Inference, https://arxiv.org/abs/1807.08716
- Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. ZeRO-Offload: Democratizing Billion-Scale Model Training. USENIX Annual Technical Conference. 551–564. https://arxiv.org/abs/2101.06840 (Offloading strategy for memory optimizations.)
- B Steiner, M Elhoushi, J Kahn, J Hegarty, 2022, OLLA: Decreasing the Memory Usage of Neural Networks by Optimizing the Lifetime and Location of Arrays, arXiv preprint arXiv:2210.12924, https://arxiv.org/abs/2210.12924
- Guillaume Lample, Alexandre Sablayrolles, Marc'Aurelio Ranzato, Ludovic Denoyer, Hervé Jégou, Dec 2019, Large memory layers with product keys, NeurIPS, https://arxiv.org/abs/1907.05242, https://proceedings.neurips.cc/paper/2019/file/9d8df73a3cfbf3c5b47bc9b50f214aff-Paper.pdf
- Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems 35 (2022), 16344–16359. https://arxiv.org/abs/2205.14135
- Minxuan Zhou; Weihong Xu; Jaeyoung Kang; Tajana Rosing, 2022, TransPIM: A Memory-based Acceleration via Software-Hardware Co-Design for Transformer, 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), https://ieeexplore.ieee.org/document/9773212 PDF: https://par.nsf.gov/servlets/purl/10345536 (Memory optimizations including token-based data sharding for allocation to different memory banks.)
- Tri Dao, July 2023, FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning, https://arxiv.org/abs/2307.08691, Code: https://github.com/Dao-AILab/flash-attention (One of the main optimizations used by Flash Attention was its memory management.)
- Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Oct 2022. LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning. Advances in Neural Information Processing Systems, 35:12991–13005. https://arxiv.org/abs/2206.06522 (Reduces memory requirements of training.)
- M Capra, B Bussolino, A Marchisio, M Shafique, 2020, An updated survey of efficient hardware architectures for accelerating deep convolutional neural networks, Future Internet, https://www.mdpi.com/1999-5903/12/7/113/pdf (Survey paper with sections on memory optimization.)
- Mikhail S. Burtsev, Yuri Kuratov, Anton Peganov, Grigory V. Sapunov, Feb 2021, Memory Transformer, https://arxiv.org/abs/2006.11527
- Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks,” ISCA, 2016. https://ieeexplore.ieee.org/document/7551407, PDF: http://www.rle.mit.edu/eems/wp-content/uploads/2016/04/eyeriss_isca_2016.pdf, PDF Slides: https://eems.mit.edu/wp-content/uploads/2016/06/eyeriss_isca_2016_slides.pdf, Project: http://eyeriss.mit.edu/
- Z Guo, Z He, Y Zhang, 2023, Mira: A Progam-Behavior-Guided Far Memory System, PDF: https://cseweb.ucsd.edu/~yiying/Mira-SOSP23.pdf (Although "far memory" is probably not desirable for fast AI inference, this paper has interesting coverage of automatic memory policy management and cache optimization using static analysis and performance profiling.)
- Nabavinejad, S.M.; Baharloo, M.; Chen, K.C.; Palesi, M.; Kogel, T.; Ebrahimi, M., An Overview of Efficient Interconnection Networks for Deep Neural Network Accelerators. IEEE J. Emerg. Sel. Top. Circuits Syst. 2020, 10, 268–282. http://dx.doi.org/10.1109/JETCAS.2020.3022920, https://ieeexplore.ieee.org/abstract/document/9189825 (Hardware paper about on-chip interconnection optimizations, but examines near-memory optimizations.)
- Robert Lim, 2019, Methods for accelerating machine learning in high performance computing, Report AREA-2019-01, School of Computer and Data Sciences, University of Oregon, https://www.cs.uoregon.edu/Reports/AREA-201901-Lim.pdf (Extension analysis of ML compiler optimizations, including a long section on memory optimizations for ML compilers.)
- N Penkov, K Balaskas, M Rapp, J Henkel, 2023, Differentiable Slimming for Memory-Efficient Transformers, IEEE Embedded Systems Letters (Early Access), DOI: 10.1109/LES.2023.3299638, https://ieeexplore.ieee.org/abstract/document/10261943
- Tianyang Lin, Yuxin Wang, Xiangyang Liu, Xipeng Qiu, June 2021, A Survey of Transformers, AI Open, https://arxiv.org/abs/2106.04554 (Examines some Transformer models with memory-compression attention mechanisms.)
- João Gabriel Lopes Jan 29, 2022, Optimizing TensorFlow Models for Inference, https://medium.com/tinyclues-vision/optimizing-tensorflow-models-for-inference-d3636cf34034 (Discussion of memory optimization on TensorFlow.)
- N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen, 2018, Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions, CoRR, vol. abs/1802.04730, http://arxiv.org/abs/1802.04730 (Memory optimizations from a kernel fusion and compiler-level perspective.)
- E Yvinec, A Dapogny, K Bailly, Sep 2023, Network Memory Footprint Compression Through Jointly Learnable Codebooks and Mappings, arXiv preprint arXiv:2309.17361, https://arxiv.org/abs/2309.17361 (Uses "codebooks", i.e. look-up tables, to reduce memory usage.)
- Song Han, Jeff Pool, John Tran, and William Dally, 2015, Learning both weights and connections for efficient neural network, Advances in neural information processing systems, 28, 2015, https://arxiv.org/abs/1506.02626
- Xuan Peng, Xuanhua Shi, Hulin Dai, Hai Jin, Weiliang Ma, Qian Xiong, Fan Yang, and Xuehai Qian. 2020. Capuchin: Tensor-based GPU memory management for deep learning. Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20), James R. Larus, Luis Ceze, and Karin Strauss (Eds.). ACM, 891–905. https://doi.org/10.1145/3373376.3378505, https://dl.acm.org/doi/10.1145/3373376.3378505
- Xia, C., Zhao, J., Sun, Q., Wang, Z., Wen, Y., Feng, X., Cui, H., 2023, Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions, The ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 27 Apr-01 May 2023, San Diego, USA. https://eprints.whiterose.ac.uk/203681/, PDF: https://eprints.whiterose.ac.uk/203681/1/asplos24.pdf (Analyzes memory-intensive versus compute-intensive kernel operators and reducing GPU memory data transfers.)
- Shabnam Daghaghi, Nicholas Meisburger, Mengnan Zhao, Yong Wu, Sameh Gobriel, Charlie Tai, Anshumali Shrivastava, Mar 2021, Accelerating SLIDE Deep Learning on Modern CPUs: Vectorization, Quantizations, Memory Optimizations, and More, https://arxiv.org/abs/2103.10891, Code: https://github.com/RUSH-LAB/SLIDE (Memory optimization of training on CPUs using AVX-512 and locality-sensitive hashing of vectors.)
- Nicolai M. Josuttis, 2012, The C++ Standard Library: A Tutorial and Reference, Second Edition, Supplementary Chapter, https://www.amazon.com/Standard-Library-Tutorial-Reference-2nd/dp/0321623215, PDF (extra chapter): http://www.cppstdlib.com/cppstdlib_supplementary.pdf (C++ optimizations such as bit sets and user-defined memory allocators.)
- Zhen Zheng, Xuanda Yang, et al. 2022. AStitch: enabling a new multidimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 359–373. https://dl.acm.org/doi/abs/10.1145/3503222.3507723
- Kai Lv, Yuqing Yang, Tengxiao Liu, Qinghui Gao, Qipeng Guo, and Xipeng Qiu, June 2023, Full parameter fine-tuning for large language models with limited resources, arXiv preprint arXiv:2306.09782, https://arxiv.org/abs/2306.09782 (Fused gradient computation and parameter update saves memory in training kernel by not saving the gradient tensor in memory.)
- S Agrawal, P Ghosh, G Kumar, T Radhika, 2023, Memory Footprint Optimization for Neural Network Inference in Mobile SoCs, 2023 IEEE Women in Technology Conference (WINTECHCON) https://ieeexplore.ieee.org/abstract/document/10277304 (Improved management of memory buffers.)
- Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, Henri E. Bal, 2023, Optimization techniques for GPU programming, ACM Computing Surveys, Volume 55, Issue 11, Article No. 239, pp 1–81, https://dl.acm.org/doi/abs/10.1145/3570638, PDF: https://dl.acm.org/doi/pdf/10.1145/3570638 (Extensive survey of software optimizations to improve GPU latency and throughput.)
- Pietro Farina, Subrata Biswas, Eren Yıldız, Khakim Akhunov, Saad Ahmed, Bashima Islam, Kasım Sinan Yıldırım, 16 May 2024, Memory-efficient Energy-adaptive Inference of Pre-Trained Models on Batteryless Embedded Systems, https://arxiv.org/abs/2405.10426
- Shaoyuan Chen, Yutong Lin, Mingxing Zhang, Yongwei Wu, 3 May 2024, Efficient and Economic Large Language Model Inference with Attention Offloading, https://arxiv.org/abs/2405.01814 (Separates the process-bound and memory-bound parts of inference for speedup, with focus on prefill, decoding, and the sub-tasks such as QKV and FFN use of GEMM kernels, versus the different pattern of attention computations and the KV cache.)
- Jitai Hao, WeiWei Sun, Xin Xin, Qi Meng, Zhumin Chen, Pengjie Ren, Zhaochun Ren, 7 Jun 2024, MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter, https://arxiv.org/abs/2406.04984 Code: https://github.com/CURRENTF/MEFT
- Wei Niu, Md Musfiqur Rahman Sanim, Zhihao Shu, Jiexiong Guan, Xipeng Shen, Miao Yin, Gagan Agrawal, Bin Ren, 21 Apr 2024, SmartMem: Layout Transformation Elimination and Adaptation for Efficient DNN Execution on Mobile, https://arxiv.org/abs/2404.13528 (Choosing optimal tensor memory layouts to optimize low-level operator kernels.)
- Shashank Verma and Neal Vaidya, Nov 17, 2023 Mastering LLM Techniques: Inference Optimization, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/ (An overview that covers a lot of inference optimization techniques.)
- Theia Vogel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
- Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang, 2 Feb 2024, APIServe: Efficient API Support for Large-Language Model Inferencing, https://arxiv.org/abs/2402.01869
- Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
- T Senoo, R Kayanoma, A Jinguji, H Nakahara, 2023, A Light-Weight Vision Transformer Toward Near Memory Computation on an FPGA ARC 2023: Applied Reconfigurable Computing. Architectures, Tools, and Applications, pp 338–353, https://link.springer.com/chapter/10.1007/978-3-031-42921-7_23 (Vision transformer optimized for near-memory computation.)
- Mengke Ge, Junpeng Wang, Binhan Chen, Yingjian Zhong, Haitao Du, Song Chen, Yi Kang, 22 Mar 2024, Allspark: Workload Orchestration for Visual Transformers on Processing In-Memory Systems, https://arxiv.org/abs/2403.15069
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
- Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, Yu Wang, 9 Jan 2024, FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs, https://arxiv.org/abs/2401.03868 (Does FFN optimization by splitting FFNs into two categories, those commonly firing and those rarely used, in both RELU and non-RELU models; effectively this is FFN pruning of a subset of FFNs.)
- Jiahang Zhou, Yanyu Chen, Zicong Hong, Wuhui Chen, Yue Yu, Tao Zhang, Hui Wang, Chuanfu Zhang, Zibin Zheng, 5 Jan 2024, Training and Serving System of Foundation Models: A Comprehensive Survey, https://arxiv.org/abs/2401.02643
- Make LLM Fine-tuning 2x faster with Unsloth and HF TRL, January 10, 2023, Daniel Han-Chen, https://huggingface.co/blog/unsloth-trl Code: https://github.com/huggingface/blog/blob/main/unsloth-trl.md (Optimizes some PyTorch kernels for back-propagation and reduces memory usage in fine-tuning; currently works with Llama and Mistral architectures.)
- Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar, Dec 2023, LLM in a flash: Efficient Large Language Model Inference with Limited Memory Apple Research, https://arxiv.org/abs/2312.11514
- Noam Shazeer, Mitchell Stern, Apr 2018, Adafactor: Adaptive Learning Rates with Sublinear Memory Cost, https://arxiv.org/abs/1804.04235
- Animesh Jain, Amar Phanishayee, Jason Mars, Lingjia Tang, Gennady Pekhimenko, 2018, Gist: Efficient Data Encoding for Deep Neural Network Training, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), https://ieeexplore.ieee.org/document/8416872 PDF: https://www.microsoft.com/en-us/research/uploads/prod/2018/04/fiddle-gist-isca18.pdf
- Manuel Poter, Jesper Larsson Traf, Mar 2018, Memory Models for C/C++ Programmers, https://arxiv.org/pdf/1803.04432.pdf
- Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Nov 2021, Data movement is all you need: A case study on optimizing transformers, Proceedings of Machine Learning and Systems, 3, 2021. https://arxiv.org/abs/2007.00072 Code: https://github.com/spcl/substation
- DeepSpeed Team, Rangan Majumder, Andrey Proskurin, May 24, 2021, DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression/ (DeepSpeed uses various kernel fusion methods including for Softmax, LayerNorm, transpose, and GEMM.)
- Noam Shazeer, Nov 2019, Fast Transformer Decoding: One Write-Head is All You Need, https://arxiv.org/abs/1911.02150
- Abhiroop Bhattacharjee, Yeshwanth Venkatesha, Abhishek Moitra, Priyadarshini Panda, MIME: adapting a single neural network for multi-task inference with memory-efficient dynamic pruning. In: DAC (2022) https://arxiv.org/abs/2204.05274
- S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “Zero: Memory optimizations toward training trillion parameter models,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–16, https://arxiv.org/abs/1910.02054 Code: part of: https://github.com/microsoft/deepspeed (Zero Redundancy Optimizer (ZeRO) provides memory optimization, improved utilization, and fragmentation avoidance, allowing improved pipelining during training.)
- Mark Hildebrand, Jason Lowe-Power, Venkatesh Akella, 2024, CachedArrays: Optimizing Data Movement for Heterogeneous Memory Systems, IEEE, DOI 10.1109/IPDPS57955.2024.00055, https://arch.cs.ucdavis.edu/assets/papers/ipdps24-cachedarrays.pdf => caching
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
- Wei Niu, Gagan Agrawal, Bin Ren, 29 Feb 2024, SoD2: Statically Optimizing Dynamic Deep Neural Network, https://arxiv.org/abs/2403.00176 (Analysis of operator computation shapes and pathways with kernel fusion and memory planning.)
- Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti, 14 Mar 2024, Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference, https://arxiv.org/abs/2403.09636 (Reducing the memory size of the KV cache.)
- Bahareh Khabbazan, Marc Riera, Antonio González, Oct 2023, An Energy-Efficient Near-Data Processing Accelerator for DNNs that Optimizes Data Accesses, https://arxiv.org/abs/2310.18181
- Alireza Amirshahi, Giovanni Ansaloni, David Atienza, 20 Dec 2023, Accelerator-driven Data Arrangement to Minimize Transformers Run-time on Multi-core Architectures, https://arxiv.org/abs/2312.13000
- Chen Ding, Christopher Kanan, Dylan McKellips, Toranosuke Ozawa, Arian Shahmirza, Wesley Smith, 22 Dec 2023, DMC4ML: Data Movement Complexity for Machine Learning, https://arxiv.org/abs/2312.14441
- Tanvi Sharma, Mustafa Ali, Indranil Chakraborty, Kaushik Roy, 26 Dec 2023, WWW: What, When, Where to Compute-in-Memory, https://arxiv.org/abs/2312.15896
- Gavin Li, Nov 19, 2023, Unbelievable! Run 70B LLM Inference on a Single 4GB GPU with This NEW Technique, AI Advances https://ai.gopubby.com/unbelievable-run-70b-llm-inference-on-a-single-4gb-gpu-with-this-new-technique-93e2057c7eeb
- Arnav Chavan, Nahush Lele, Deepak Gupta, Dec 2023, Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models https://arxiv.org/abs/2312.07046 Code: https://github.com/transmuteAI/trailmet/tree/main/trailmet/algorithms/llm-rom
- Robert A. van de Geijn, Enrique S. Quintana-Ort´ı, 2007, The Science of Programming Matrix Computations, https://www.cs.utexas.edu/users/rvdg/tmp/TSoPMC.pdf Code: http://www.cs.utexas.edu/users/flame/
- Lei Zhu, Xinjiang Wang, Wayne Zhang, Rynson W.H. Lau, 30 May 2024 (v3), RelayAttention for Efficient Large Language Model Serving with Long System Prompts, https://arxiv.org/abs/2402.14808 (Reduces the number of memory accesses for attention computations and the KV cache.)
- Y Liang, Z Wang, X Xu, Y Tang, Z Jie, J Lu, Oct 2023, MCUFormer: Deploying Vision Tranformers on Microcontrollers with Limited Memory, arXiv preprint arXiv:2310.16898, https://arxiv.org/pdf/2310.16898.pdf
- MWU Rahman, MM Abrar, HG Copening, S Hariri, Oct 2023, Quantized Transformer Language Model Implementations on Edge Devices, https://arxiv.org/pdf/2310.03971.pdf (Uses a "FlatBuffer" format on TensorFlow-Lite.)
- Daniel Nichols, Siddharth Singh, Shu-Huai Lin, Abhinav Bhatele, July 2022, A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks, https://arxiv.org/abs/2111.04949 PDF: https://pssg.cs.umd.edu/assets/papers/2022-07-dl-survey-arxiv.pdf (Survey of frameworks from the theoretical perspective of parallelism.)
- Christopher Wolters, Xiaoxuan Yang, Ulf Schlichtmann, Toyotaro Suzumura, 12 Jun 2024, Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference, https://arxiv.org/abs/2406.08413
- J Chen, S Kao, H He, W Zhuo, S Wen, 2023, Run, Don't Walk: Chasing Higher FLOPS for Faster Neural Networks, https://openaccess.thecvf.com/content/CVPR2023/papers/Chen_Run_Dont_Walk_Chasing_Higher_FLOPS_for_Faster_Neural_Networks_CVPR_2023_paper.pdf
- Minkyu Kim and Jae Sun Seo. 2021. An energy-efficient deep convolutional neural network accelerator featuring conditional computing and low external memory access. IEEE Journal of Solid-State Circuits 56, 3 (2021), 803–813, https://ieeexplore.ieee.org/document/9229157
- Benjamin Charlier, Jean Feydy, Joan Alexis Glaunès, François-David Collin, Ghislain Durif, 8 Apr 2021 (v2), Kernel Operations on the GPU, with Autodiff, without Memory Overflows, https://arxiv.org/abs/2004.11127 Code: https://www.kernel-operations.io/keops/index.html
- Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, Oct 2023, Efficient Memory Management for Large Language Model Serving with PagedAttention, SOSP ’23, October 23–26, 2023, Koblenz, Germany, https://dl.acm.org/doi/pdf/10.1145/3600006.3613165 (The original Paged Attention and vLLM paper, focusing on optimizing memory size of the KV cache using methods similar to operating-system memory paging.)
- Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar, 7 May 2024, vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention, https://arxiv.org/abs/2405.04437
- Nikhil Jha, Kevin Wang, 2023, https://people.eecs.berkeley.edu/~kubitron/courses/cs262a-F23/projects/reports/project1010_paper_64287652274076362722.pdf (Extends Paged Attention to a global multi-query KV cache and also implements prefix KV caching.)
- Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji, 13 Jun 2024, MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding, https://arxiv.org/abs/2406.09297 Code: https://github.com/zaydzuhri/pythia-mlkv (Extends cross-head KV sharing in MQA to also share KV data between layers, analogous to layer fusion of weights.)
- David Spuler, March 2024, Chapter 14. Memory Optimizations, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- B Wu, Y Zhong, Z Zhang, G Huang, X Liu, 2023, Fast Distributed Inference Serving for Large Language Models, https://arxiv.org/abs/2305.05920
- Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. https://openai.com/blog/sparse-transformers, 2019, https://arxiv.org/abs/1904.10509
- Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herve Jegou. Large memory layers with product keys. CoRR, abs/1907.05242, 2019. http://arxiv.org/abs/1907.05242
- Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. The goldilocks principle: Reading children’s books with explicit memory representations. CoRR, abs/1511.02301, 2015. URL http://arxiv.org/abs/1511.02301.
- Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. CoRR, abs/1410.3916, 2014. http://arxiv.org/abs/1410.3916
- Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy P. Lillicrap. Oneshot learning with memory-augmented neural networks. CoRR, abs/1605.06065, 2016. URL http://arxiv.org/abs/1605.06065.
- Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698
- Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Mérouane Debbah, Deepak Gupta, 24 Apr 2024 (v2), Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward, https://arxiv.org/abs/2402.01799 Code: https://github.com/nyunAI/Faster-LLM-Survey
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng ZHANG, Dahua Lin, Ion Stoica, Hao Zhang, 02 May 2024, MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving, ICML 2024, https://openreview.net/forum?id=R0SoZvqXyQ PDF: https://openreview.net/pdf?id=R0SoZvqXyQ Code: https://github.com/hao-ai-lab/MuxServe (Separates the prefill and decoding phases when serving, and also manages the LLM weights and KV cache data in blocks for memory efficiency.)
- Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
- By Ben Dickson, December 27, 2023, Apple research paper hints at LLMs on iPhones and Macs, https://bdtechtalks.com/2023/12/27/apple-llm-flash-research/
- Chen, C, 2024, Hardware‑software co‑exploration and optimization for next‑generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 (Extensive coverage of hardware design with multiple contributions to accelerating various neural network types, ranging from acceleration of various single non-linear functions and end-to-end optimization algorithms. Specific topics include data compression, non-maximum suppression, MHA, and MatMul/GEMM optimizations.)
- Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang, 13 Jun 2024, ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models, https://arxiv.org/abs/2406.09041 (How to load multiple experts for MoE in a memory-efficient way using mixed-precision quantization based on identifying the few salient channels that need higher precision, as an alternative to multi-LoRA.)
- Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
- Dan Peng, Zhihui Fu, Jun Wang, 1 Jul 2024, PocketLLM: Enabling On-Device Fine-Tuning for Personalized LLMs, https://arxiv.org/abs/2407.01031 (Running fine-tuning on a smartphone via a low-memory optimization using a "derivative-free" "zeroth-order" technique called MeZo, with advantages such as privacy.)
- Taolin Zhang, Jiawang Bai, Zhihe Lu, Dongze Lian, Genping Wang, Xinchao Wang, Shu-Tao Xia, 9 Jul 2024, Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach, https://arxiv.org/abs/2407.06964
- Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
- Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 16 Jul 2024, MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models, https://arxiv.org/abs/2407.11681
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- Felippe Vieira Zacarias, Kiran Palli, Sudharshan Vazhkudai, Evelyn Grevelink, July 2024, Analyzing LLM performance: The impact of high-bandwidth memory on model inference, https://www.micron.com/content/dam/micron/global/public/documents/products/product-flyer/llm-inference-engineering-report.pdf
- Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
- Gavin Li, August 3rd, 2024, Crazy Challenge: Run Llama 405B on a 8GB VRAM GPU, https://ai.gopubby.com/crazy-challenge-run-llama-405b-on-a-8gb-vram-gpu-ab5a280a3889 (Run Llama's 405B model on a low-end GPU via 4-bit quantization and layer-by-layer inference, both to save memory.)
- Beom Jin Kang, Hae In Lee, Seok Kyu Yoon, Young Chan Kim, Sang Beom Jeong, Seong Jun O, Hyun Kim, October 2024, A survey of FPGA and ASIC designs for transformer inference acceleration and optimization, Journal of Systems Architecture, Volume 155, 103247, https://www.sciencedirect.com/science/article/abs/pii/S138376212400184X
- Jaewook Lee, Yoel Park, Seulki Lee, 7 Aug 2024, Designing Extremely Memory-Efficient CNNs for On-device Vision Tasks, https://arxiv.org/abs/2408.03663
- B. Kim et al., 2024, The Breakthrough Memory Solutions for Improved Performance on LLM Inference, IEEE Micro, vol. 44, no. 3, pp. 40-48, May-June 2024, doi: 10.1109/MM.2024.3375352, https://ieeexplore.ieee.org/abstract/document/10477465
- Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, Joseph E. Gonzalez, 12 Feb 2024 (v2), MemGPT: Towards LLMs as Operating Systems, https://arxiv.org/abs/2310.08560 https://research.memgpt.ai/
- Kaizhao Liang, Bo Liu, Lizhang Chen, Qiang Liu, 23 Aug 2024, Memory-Efficient LLM Training with Online Subspace Descent, https://arxiv.org/abs/2408.12857 https://github.com/kyleliang919/Online-Subspace-Descent
- Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu, 8 May 2024 (v2), Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security, https://arxiv.org/abs/2401.05459 https://github.com/MobileLLM/Personal_LLM_Agents_Survey
- Agarwal, Saurabh, Aug 2024, Minimizing Data Movement in Machine Learning Systems, Ph.D. Thesis, Computer Sciences, University of Wisconsin--Madison, https://digital.library.wisc.edu/1711.dl/MKLIYRPB24A5R9D https://search.library.wisc.edu/digital/AMKLIYRPB24A5R9D PDF: https://asset.library.wisc.edu/1711.dl/QXSTVAIXECHQA8L/R/file-62b54.pdf?dl https://www.proquest.com/openview/c1ae2a92106d7ec681a7296cd163e0c1/1 (Dataflow optimization in training and also "clustered head attention" for memory-efficient inference, an extension of multi-head attention similar to layer-wise head fusion/pruning.)
- Xueyuan Han, Zinuo Cai, Yichu Zhang, Chongxin Fan, Junhan Liu, Ruhui Ma, Rajkumar Buyya, 9 Sep 2024 (v2), Hermes: Memory-Efficient Pipeline Inference for Large Models on Edge Devices, https://arxiv.org/abs/2409.04249 (Pipelining of model layer-wise loading and inference for memory-efficient inference.)
- James Wang, August 27, 2024, Introducing Cerebras Inference: AI at Instant Speed, https://cerebras.ai/blog/introducing-cerebras-inference-ai-at-instant-speed
- Muhammad Saad Uddin, Sep 2024, Stop Guessing! Here’s How Much GPU Memory You REALLY Need for LLMs! Techniques to Calculate and Reduce Memory Footprint in LLM Serving, https://ai.gopubby.com/stop-guessing-heres-how-much-gpu-memory-you-really-need-for-llms-8e9b02bcdb62
- Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang, 26 Sep 2024, Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores, https://arxiv.org/abs/2409.17870
- Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
- Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
- Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
- Jie Peng, Zhang Cao, Huaizhi Qu, Zhengyu Zhang, Chang Guo, Yanyong Zhang, Zhichao Cao, Tianlong Chen, 23 Oct 2024 (v2), Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching, https://arxiv.org/abs/2410.14740
- Justine, Apr 2023, Edge AI Just Got Faster, https://justine.lol/mmap/ (Loading models using mmap.)
- Z. Zhang, D. Yang, X. Zhou and D. Cheng, "MCFuser: High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators," in 2024 SC24: International Conference for High Performance Computing, Networking, Storage and Analysis SC, Atlanta, GA, United States, 2024, pp. 528-542, doi: 10.1109/SC41406.2024.00040. https://www.computer.org/csdl/proceedings-article/sc/2024/529100a528/21HUVuG3S8M
- Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, Ion Stoica, 14 Nov 2024, Pie: Pooling CPU Memory for LLM Inference, https://arxiv.org/abs/2411.09317
- Jinjie Liu, Hang Qiu, 14 Nov 2024, FluidML: Fast and Memory Efficient Inference Optimization, https://arxiv.org/abs/2411.09242
- Erik Wijmans, Brody Huval, Alexander Hertzberg, Vladlen Koltun, Philipp Krähenbühl, 13 Nov 2024, Cut Your Losses in Large-Vocabulary Language Models, https://arxiv.org/abs/2411.09009 https://github.com/apple/ml-cross-entropy (Memory-efficient computation of cross-entropy in training.)
- Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E. Gonzalez, Matei Zaharia, Ion Stoica, 18 Nov 2024, MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs, https://arxiv.org/abs/2411.11217
- Conner Takehana, Aaryan Singhal, Nov 28, 2024, ThunderMittens For Your ThunderKittens, https://hazyresearch.stanford.edu/blog/2024-11-28-tk-mlx (Porting TK to Apple Metal and MLX on the M2 chips.)
- Chenghao Hu and Baochun Li. 2024. Menos: Split Fine-Tuning Large Language Models with Efficient GPU Memory Sharing. In Proceedings of the 25th International Middleware Conference (MIDDLEWARE '24). Association for Computing Machinery, New York, NY, USA, 185–198. https://doi.org/10.1145/3652892.3700758 https://dlnext.acm.org/doi/10.1145/3652892.3700758 https://iqua.ece.toronto.edu/papers/chenghao-middleware24.pdf
- M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
- Yanyu Chen, Ganhong Huang, 6 Dec 2024, GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments, https://arxiv.org/abs/2412.04788
Memory-Bound versus CPU-Bound
Surprisingly, researchers discovered that LLM inference was not CPU-bound (or GPU-bound), but was memory-bound, with the cost of accessing all those tensors full of weights (and activations) being the main efficiency bottleneck.
Subsequently, it was found to be more nuanced in decoder-only transformer architectures (e.g. GPT),so that:
- Prefill phase — CPU-bound
- Decoding phase &mdash memory-bound
The prefill phase is the initial phase of "prompt processing" where every token in the prompt is processed (in parallel) to generate the overall KV caches. This has been found to thrash the CPU, or rather, the GPU. Prefill is a busy time, but it also takes a long time, and is the cause of the initial delay before an LLM starts answering your question.
The decoding phase is then the next phase, whereby the autoregressive algorithm spits out one token at a time. Because it cannot be fully parallelized, this tends not to fill the GPU pipeline, but is continually accesssing the entire model, one layer at a time. Hence, it's memory-bound.
Research papers on memory-bound versus CPU-bound nature of transformers:
- Amir Gholami; Zhewei Yao; Sehoon Kim; Coleman Hooper, 25 March 2024, AI and Memory Wall, IEEE Micro ( Early Access ), pp 1-5, https://ieeexplore.ieee.org/abstract/document/10477550
- Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, Hamed Haddadi, 20 Mar 2024 (v2), MELTing point: Mobile Evaluation of Language Transformers, https://arxiv.org/abs/2403.12844 (Survey and benchmarking of SOTA methods for running LLM inference natively on phones including iPhone and Android, with quantization levels, and with measurement of speed and battery depletion.)
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
- Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
More AI Research
Read more about:
- Partitioning
- Model Compression
- Inference Optimizations
- Loop Optimizations
- Code Optimizations
- « Research Home