Aussie AI

LLM Memory Optimization

  • Last Updated 17 March, 2025
  • by David Spuler, Ph.D.

Memory optimization involves using less memory during model inference. This means that inference requires less resources, and can also reduce CPU usage by leading to less data being swapped in and out of memory. Memory optimization can refer to either CPU memory or GPU memory.

Various research reports show that model inference is memory-bound rather than CPU-bound. In such cases, memory management is key to improving latency and throughput. On the other hand, researchers have also examined increasing memory usage to save time by caching and computation reuse.

Memory-Bound vs Compute-Bound

The situation with memory verus compute is more nuanced in LLM inference. There are two distinct phases for inference with opposite characteristics:

  • Prefill phase (prompt processing) — compute-bound.
  • Decoding phase (autoregressive token generation) — memory-bound.

Hence, there is various research on prefill optimization, including the optimization of "phase splitting" that disaggregates prefill and decoding phases, allowing them to run on machines with different memory/GPU setups.

Going even further, it turns out that the decoding phase is overall memory-bound, but this arises from the attention module and its loading of the KV cache data, which changes for each token. Hence, the memory characteristics of the decoding phase are more nuanced:

  • Attention module (KV cache) — memory-bound.
  • FFN/MLP modules — compute-bound.

The FFN always operates with the same weight matrices, so they can be fully pre-loaded, making it compute-bound. Although the attention module has the same model parameters, too, it also has to load different sets of data from the KV cache for each token, making it overall memory-bound. Hence, one high-level memory optimization is to not only split prefill and decoding (phase splitting), but also do "sublayer splitting" to run attention and FFN computations on different platforms.

Model Compression Techniques

The main class of optimizations that reduce memory requirements by making the model smaller is called "model compression". These methods reduce memory by making the model smaller. Model compression includes sub-strategies such as:

Recomputation

Recomputation is a method of trading time for space in LLM algorithms, by re-computing data rather than storing the results in memory. On memory-constrained device, it is possible to reduce space requirements at the cost of extra processor time. This is called "recomputation", or sometimes in research papers it is called "rematerialization" and when used during LLM training it is related to "checkpointing". When recomputation is used to optimize training of a model that is too large to fit inside GPU memory, it is called "gradient checkpointing." The portion of this algorithm that involves swapping tensors off the GPU back to the CPU is often called "offloading."

The recomputation optimization method involves not storing results of a computation that you might need later, but instead waiting until later, and then recomputing them all over again. Hence, recomputation trades time for space and is effectively the opposite of caching and data reuse optimizations, which trade space for time.

Recomputation involves doing calculations a second time, which is redundant computation. This is not something you want to have to do often, since it involves a lot more CPU or GPU time. But it is a technique that can be considered when memory is at a premium, and is sometimes done as a GPU optimization.

Research on Recomputation: Research papers on the recomputation memory optimization technique include:

  • Yu Tang, Chenyu Wang, Yufan Zhang, Yuliang Liu, Xingcheng Zhang, Linbo Qiao, Zhiquan Lai, Dongsheng Li, 2022, Delta: Dynamically optimizing gpu memory beyond tensor recomputation, https://arxiv.org/abs/2203.15980
  • Chien-Chin Huang, Gu Jin, and Jinyang Li. 2020. SwapAdvisor: Pushing deep learning beyond the GPU memory limit via smart swapping. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20), James R. Larus, Luis Ceze, and Karin Strauss (Eds.). ACM, 1341–1355. https://dl.acm.org/doi/10.1145/3373376.3378530, PDF: https://news.cs.nyu.edu/~jinyang/pub/swapadvisor-asplos20.pdf
  • Xuan Peng, Xuanhua Shi, Hulin Dai, Hai Jin, Weiliang Ma, Qian Xiong, Fan Yang, and Xuehai Qian. 2020. Capuchin: Tensor-based GPU memory management for deep learning. Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20), James R. Larus, Luis Ceze, and Karin Strauss (Eds.). ACM, 891–905. https://doi.org/10.1145/3373376.3378505, https://dl.acm.org/doi/10.1145/3373376.3378505
  • O. Beaumont, L. Eyraud-Dubois, and A. Shilova, 2021, Efficient combination of rematerialization and offloading for training dnns, Advances in Neural Information Processing Systems, vol. 34, PDF: https://proceedings.nips.cc/paper/2021/file/c8461bf13fca8a2b9912ab2eb1668e4b-Paper.pdf
  • Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan, Mike He, Jared Roesch, Tianqi Chen, Zachary Tatlock, 2020, Dynamic tensor rematerialization, arXiv preprint arXiv:2006.09616, https://arxiv.org/abs/2006.09616
  • Ravi Kumar, Manish Purohit, Zoya Svitkina, Erik Vee, and Joshua Wang. Efficient rematerialization for deep networks. Advances in Neural Information Processing Systems, 32, 2019. https://dl.acm.org/doi/10.5555/3454287.3455646, PDF: https://proceedings.neurips.cc/paper/2019/file/ffe10334251de1dc98339d99ae4743ba-Paper.pdf
  • Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Joseph Gonzalez, Kurt Keutzer, and Ion Stoica, 2020, Checkmate: Breaking the memory wall with optimal tensor rematerialization. Proceedings of Machine Learning and Systems, 2:497–511, https://arxiv.org/abs/1910.02653 Code: https://github.com/parasj/checkmate
  • Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, https://arxiv.org/abs/1604.06174
  • Audrundefinednas Gruslys, Rémi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. 2016. Memory-Efficient Backpropagation through Time. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16). Curran Associates Inc., 4132–4140. https://arxiv.org/abs/1606.03401
  • James Martens and Ilya Sutskever. 2012. Training deep and recurrent networks with hessian-free optimization. In Neural Networks: Tricks of the Trade. Springer. https://link.springer.com/chapter/10.1007/978-3-642-35289-8_27, PDF: https://www.cs.utoronto.ca/~jmartens/docs/HF_book_chapter.pdf
  • M Schuler, R Membarth, P Slusallek, 2022, Xengine: Optimal tensor rematerialization for neural networks in heterogeneous environments, ACM Transactions on Architecture and Code Optimization, Volume 20, Issue 1, Article No. 17, pp 1–25, https://dl.acm.org/doi/10.1145/3568956, PDF: https://dl.acm.org/doi/pdf/10.1145/3568956, Code: https://github.com/dfki-asr/xengine
  • Hugging Face, Performance and Scalability: How To Fit a Bigger Model and Train It Faster, https://huggingface.co/docs/transformers/v4.18.0/en/performance (Gradient checkpointing to optimize training of large models.)
  • Yaroslav Bulatov, Jan 14, 2018, Fitting larger networks into memory, Medium, https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9 (Gradient checkpointing for training large models.)
  • Olivier Beaumont, Lionel Eyraud-Dubois, Julien Herrmann, Alexis Joly, and Alena Shilova. Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory. Research Report RR-9302, Inria Bordeaux Sud-Ouest, November 2019, https://arxiv.org/abs/1911.13214
  • Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016, Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, https://arxiv.org/abs/1604.06174
  • Audrunas Gruslys, Rémi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. Memory-efficient backpropagation through time. Advances in Neural Information Processing Systems, pages 4125–4133, 2016. https://arxiv.org/abs/1606.03401
  • Navjot Kukreja, Jan Hückelheim, and Gerard J Gorman. Backpropagation for long sequences: beyond memory constraints with constant overheads. arXiv preprint arXiv:1806.01117, 2018, https://arxiv.org/abs/1806.01117
  • L Waeijen, S Sioutas, M Peemen, M Lindwer, 2021, ConvFusion: A model for layer fusion in convolutional neural networks, IEEE Access (Volume: 9), https://ieeexplore.ieee.org/abstract/document/9646923/, PDF: https://ieeexplore.ieee.org/iel7/6287639/6514899/09646923.pdf
  • Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, Henri E. Bal, 2023, Optimization techniques for GPU programming, ACM Computing Surveys, Volume 55, Issue 11, Article No. 239, pp 1–81, https://dl.acm.org/doi/abs/10.1145/3570638, PDF: https://dl.acm.org/doi/pdf/10.1145/3570638 (Extensive survey of software optimizations to improve GPU latency and throughput.)
  • Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang, 2 Feb 2024, APIServe: Efficient API Support for Large-Language Model Inferencing, https://arxiv.org/abs/2402.01869
  • Jaskirat Singh, Bram Adams, Ahmed E. Hassan, 25 Mar 2024, On the Impact of Black-box Deployment Strategies for Edge AI on Latency and Model Performance, https://arxiv.org/abs/2403.17154 (MLOps deployment for quantization, partitioning and early-exit across mobile, edge, and cloud platforms, including running early exit on mobile.)
  • Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, Bryan Catanzaro, May 2022, Reducing Activation Recomputation in Large Transformer Models, https://arxiv.org/abs/2205.05198
  • Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 Code: https://github.com/HazyResearch/flash-attention (The original FlashAttention version 1, now superceded by FlashAttention 2, which uses tiling and memory-aware kernels to optimize attention.)
  • Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
  • Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani, Russ Webb, 6 Sep 2024, Theory, Analysis, and Best Practices for Sigmoid Self-Attention, https://arxiv.org/abs/2409.04431 https://github.com/apple/ml-sigmoid-attention
  • Tailing Yuan, Yuliang Liu, Xucheng Ye, Shenglong Zhang, Jianchao Tan, Bin Chen, Chengru Song, and Di Zhang, July 2024, Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism, Kuaishou Technology, Proceedings of the 2024 USENIX Annual Technical Conference. July 10–12, 2024, Santa Clara, CA, USA, https://www.usenix.org/conference/atc24/presentation/yuan https://www.usenix.org/system/files/atc24-yuan.pdf
  • Ping Chen, Wenjie Zhang, Shuibing He, Yingjie Gu, Zhuwei Peng, Kexin Huang, Xuan Zhan, Weijian Chen, Yi Zheng, Zhefeng Wang, Yanlong Yin, Gang Chen, 27 Jun 2024 (v2), Optimizing Large Model Training through Overlapped Activation Recomputation, https://arxiv.org/abs/2406.08756
  • Xunyi Zhao, Lionel Eyraud-Dubois, Théotime Le Hellard, Julia Gusak, Olivier Beaumont, 24 July, 2024, OFFMATE: full fine-tuning of LLMs on a single GPU by re-materialization and offloading, https://hal.science/hal-04660745/document
  • Chaoyi Jiang, Lei Gao, Hossein Entezari Zarch, Murali Annavaram, 26 Nov 2024, Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation, https://arxiv.org/abs/2411.17089 (Overlapping/optimizing CPU-GPU network bandwidth for KV cache with some recomputation.)
  • Sanghyeon Lee, Hongbeen Kim, Soojin Hwang, Guseul Heo, Minwoo Noh, Jaehyuk Huh. 3 Jan 2025, Efficient LLM Inference with Activation Checkpointing and Hybrid Caching, https://arxiv.org/abs/2501.01792 (Recomputation of the KV cache from stored activations.)
  • https://theses.hal.science/tel-04890912/file/ZHAO_XUNYI_2024.pdf Xunyi Zhao. Optimizing Memory Usage when Training Deep Neural Networks. Computer Science [cs]. Université de Bordeaux, France, 2024. English. NNT: 2024BORD0411 . tel-04890912
  • Pinxue Zhao, Hailin Zhang, Fangcheng Fu, Xiaonan Nie, Qibin Liu, Fang Yang, Yuanbo Peng, Dian Jiao, Shuaipeng Li, Jinbao Xue, Yangyu Tao, and Bin Cui. 2025. MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training. Proc. ACM Manag. Data 3, 1, Article 53 (February 2025), 28 pages. https://doi.org/10.1145/3709703 https://dl.acm.org/doi/abs/10.1145/3709703
  • Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)

Data Locality

Data locality is the method of speeding up LLM algorithms by using data that is stored closely together. The simplest idea is to store all data in contiguous memory, which is commonly used for model matrices and tensors. The use of data in "nearby" regions helps with optimizations such as caching, prefetching, tiling, coalescing, and other memory access pattern optimizations.

Research papers on data locality in LLM computations:

  • Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
  • Mengke Ge, Junpeng Wang, Binhan Chen, Yingjian Zhong, Haitao Du, Song Chen, Yi Kang, 22 Mar 2024, Allspark: Workload Orchestration for Visual Transformers on Processing In-Memory Systems, https://arxiv.org/abs/2403.15069
  • Kathryn S. McKinley, Steve Carr, Chau-Wen Tseng, 1996, Improving data locality with loop transformations, ACM Transactions on Programming Languages and Systems, Volume 18, Issue 4, pp 424–453, https://dl.acm.org/doi/10.1145/233561.233564
  • Neda Seifi, Abdullah Al-Mamun, 2014, Optimizing Memory Access Efficiency in CUDA Kernel via Data Layout Technique, Journal of Computer and Communications, 2024, 12, 124-139, DOI: 10.4236/jcc.2024.125009, https://www.scirp.org/journal/paperinformation?paperid=133500 PDF: https://www.scirp.org/pdf/jcc2024125_91732699.pdf (Fast CUDA matrix multiplication using data locality of memory accesses, by using diagonal data access patterns for coalesced access.)
  • Ilias Bournias, Lukas Cavigelli, Georgios Zacharopoulos, 8 Nov 2024, AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing and Data Locality, https://arxiv.org/abs/2411.05555
  • Jordi Wolfson-Pou, Jan Laukemann, Fabrizio Petrini, 13 Jan 2025, Generating Data Locality to Accelerate Sparse Matrix-Matrix Multiplication on CPUs, https://arxiv.org/abs/2501.07056

Prefetching

Prefetching is the optimization technique of requesting data from memory before it is needed, so that its later usage will not slow down computations. Any type of memory may be benefit from prefetching, and there is "instruction prefetching" for CPU execution and "data prefetching" for computations.

Research papers on prefetching optimizations:

  • Ulrich Drepper, October 23, 2007, Memory part 5: What programmers can do, https://lwn.net/Articles/255364/
  • Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng-Ann Heng, Minyi Guo, Chao Li, 18 Dec 2024, A Survey on Inference Optimization Techniques for Mixture of Experts Models, https://arxiv.org/abs/2412.14219 (Broad survey of MoE inference optimization from hardware to model compression to expert parallelism.)
  • Andrew Chan, Dec 12, 2024, Fast LLM Inference From Scratch: Pushing single-GPU inference throughput to the edge without libraries, https://andrewkchan.dev/posts/yalm.html
  • Sarah Butcher & Alex McMurray, Jan 2025, The C++ techniques you need for $600k hedge fund jobs, https://www.efinancialcareers.com/news/low-latency-c
  • Ahmet Caner Yüzügüler, Jiawei Zhuang, Lukas Cavigelli, 14 Jan 2025, PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving, https://arxiv.org/abs/2501.08192
  • Hongchao Du, Shangyu Wu, Arina Kharlamova, Nan Guan, Chun Jason Xue, 4 Mar 2025, FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference, https://arxiv.org/abs/2503.03777

SSD Storage

The use of SSDs is common for large-scale storage of models and their data. Research papers on SSD issues include:

  • Jie Peng, Zhang Cao, Huaizhi Qu, Zhengyu Zhang, Chang Guo, Yanyong Zhang, Zhichao Cao, Tianlong Chen, 23 Oct 2024 (v2), Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching, https://arxiv.org/abs/2410.14740
  • Lucas Mearian, 24 Oct 2024, 2025: The year of the AI PC, Computer World, https://www.computerworld.com/article/3583355/2025-the-year-of-the-ai-pc.html
  • Tuowei Wang, Ruwen Fan, Huang, Zixu Hao, Kun Li, Ting Cao, Youyou Lu, Yaoxue Zhang, Ju Ren, 29 Oct 2024 (v2), Ripple: Accelerating LLM Inference on Smartphones with Correlation-Aware Neuron Management, https://arxiv.org/abs/2410.19274
  • Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
  • S. Wang, Q. Cao, K. Zhou, J. Xu, Z. Guo and J. Guo, "ParaCkpt: Heterogeneous Multi-Path Checkpointing Mechanism for Training Deep Learning Models," 2024 IEEE 42nd International Conference on Computer Design (ICCD), Milan, Italy, 2024, pp. 183-190, doi: 10.1109/ICCD63220.2024.00036. https://ieeexplore.ieee.org/abstract/document/10818161/ (Generalizing in-memory checkpoints by storing data in shards across multiple storage areas including CPU memory and SSDs.)

Compute-in-Memory (CIM)

Compute-in-Memory (CIM) or Process-in-Memory (PIM) optimizations are the use of in-memory computations rather than storing data on disk. Performing LLM computations fully inside GPU memory is one of the main optimizations in LLM inference.

Research papers on CIM/PIM include:

  • Vaclav Snasel, Tran Khanh Dang, Josef Kueng, Lingping Kong 22 December 2023, A review of in-memory computing for machine learning: architectures, options, International Journal of Web Information Systems, https://www.emerald.com/insight/content/doi/10.1108/IJWIS-08-2023-0131/full/html
  • Christopher Wolters, Xiaoxuan Yang, Ulf Schlichtmann, Toyotaro Suzumura, 12 Jun 2024, Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference, https://arxiv.org/abs/2406.08413
  • Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
  • H. Diao et al., 2024, A Multiply-Less Approximate SRAM Compute-In-Memory Macro for Neural-Network Inference, IEEE Journal of Solid-State Circuits, doi: 10.1109/JSSC.2024.3433417, https://ieeexplore.ieee.org/abstract/document/10622078
  • B. Kim et al., 2024, The Breakthrough Memory Solutions for Improved Performance on LLM Inference, IEEE Micro, vol. 44, no. 3, pp. 40-48, May-June 2024, doi: 10.1109/MM.2024.3375352, https://ieeexplore.ieee.org/abstract/document/10477465
  • https://community.juniper.net/blogs/sharada-yeluri/2024/02/20/llm-inference-hw-sw-optimizations
  • Wenlun Zhang, Shimpei Ando, Yung-Chin Chen, Satomi Miyagi, Shinya Takamaeda-Yamazaki, Kentaro Yoshioka, 29 Aug 2024, PACiM: A Sparsity-Centric Hybrid Compute-in-Memory Architecture via Probabilistic Approximation, https://arxiv.org/abs/2408.16246
  • Md Tawsif Rahman Chowdhury, Huynh Quang Nguyen Vo, Paritosh Ramanan, Murat Yildirim, Gozde Tutuncuoglu, 10 Sep 2024, The Lynchpin of In-Memory Computing: A Benchmarking Framework for Vector-Matrix Multiplication in RRAMs, https://arxiv.org/abs/2409.06140
  • Bettayeb, M., Halawani, Y., Khan, M.U. et al. Efficient memristor accelerator for transformer self-attention functionality. Sci Rep 14, 24173 (2024). https://doi.org/10.1038/s41598-024-75021-z https://www.nature.com/articles/s41598-024-75021-z https://www.nature.com/articles/s41598-024-75021-z.pdf
  • Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
  • Hyucksung Kwon, Kyungmo Koo, Janghyeon Kim, Woongkyu Lee, Minjae Lee, Hyungdeok Lee, Yousub Jung, Jaehan Park, Yosub Song, Byeongsu Yang, Haerang Choi, Guhyun Kim, Jongsoon Won, Woojae Shin, Changhyun Kim, Gyeongcheol Shin, Yongkee Kwon, Ilkon Kim, Euicheol Lim, John Kim, Jungwook Choi, 28 Dec 2024, LoL-PIM: Long-Context LLM Decoding with Scalable DRAM-PIM System, https://arxiv.org/abs/2412.20166
  • Dong Eun Kim, Tanvi Sharma, Kaushik Roy, 17 Feb 2025, Hardware-Software Co-Design for Accelerating Transformer Inference Leveraging Compute-in-Memory, https://arxiv.org/abs/2502.12344
  • Zhantong Zhu, Hongou Li, Wenjie Ren, Meng Wu, Le Ye, Ru Huang, Tianyu Jia, 1 Mar 2025, Leveraging Compute-in-Memory for Efficient Generative Model Inference in TPUs, https://arxiv.org/abs/2503.00461

Memory-Bound versus CPU-Bound

Surprisingly, researchers discovered that LLM inference was not CPU-bound (or GPU-bound), but was memory-bound, with the cost of accessing all those tensors full of weights (and activations) being the main efficiency bottleneck.

Subsequently, it was found to be more nuanced in decoder-only transformer architectures (e.g. GPT),so that:

  • Prefill phase — CPU-bound
  • Decoding phase &mdash memory-bound

The prefill phase is the initial phase of "prompt processing" where every token in the prompt is processed (in parallel) to generate the overall KV caches. This has been found to thrash the CPU, or rather, the GPU. Prefill is a busy time, but it also takes a long time, and is the cause of the initial delay before an LLM starts answering your question.

The decoding phase is then the next phase, whereby the autoregressive algorithm spits out one token at a time. Because it cannot be fully parallelized, this tends not to fill the GPU pipeline, but is continually accesssing the entire model, one layer at a time. Hence, it's memory-bound.

Research papers on memory-bound versus CPU-bound nature of transformers:

  • Amir Gholami; Zhewei Yao; Sehoon Kim; Coleman Hooper, 25 March 2024, AI and Memory Wall, IEEE Micro ( Early Access ), pp 1-5, https://ieeexplore.ieee.org/abstract/document/10477550
  • Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, Hamed Haddadi, 20 Mar 2024 (v2), MELTing point: Mobile Evaluation of Language Transformers, https://arxiv.org/abs/2403.12844 (Survey and benchmarking of SOTA methods for running LLM inference natively on phones including iPhone and Android, with quantization levels, and with measurement of speed and battery depletion.)
  • Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
  • Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111

Research on Memory Optimization

For model compression and its popular subtypes, see research paper lists on the individual pages (e.g. quantization, pruning). Other research that is specifically on memory management and reducing memory includes:

More AI Research

Read more about: