Aussie AI

LLM Memory Optimization

Last Updated 11 June, 2025

by David Spuler, Ph.D.

Memory optimization involves using less memory during model inference. This means that inference requires less resources, and can also reduce CPU usage by leading to less data being swapped in and out of memory. Memory optimization can refer to either CPU memory or GPU memory.

Various research reports show that model inference is memory-bound rather than CPU-bound. In such cases, memory management is key to improving latency and throughput. On the other hand, researchers have also examined increasing memory usage to save time by caching and computation reuse.

Memory-Bound vs Compute-Bound

The situation with memory verus compute is more nuanced in LLM inference. There are two distinct phases for inference with opposite characteristics:

Prefill phase (prompt processing) — compute-bound.
Decoding phase (autoregressive token generation) — memory-bound.

Hence, there is various research on prefill optimization, including the optimization of "phase splitting" that disaggregates prefill and decoding phases, allowing them to run on machines with different memory/GPU setups.

Going even further, it turns out that the decoding phase is overall memory-bound, but this arises from the attention module and its loading of the KV cache data, which changes for each token. Hence, the memory characteristics of the decoding phase are more nuanced:

Attention module (KV cache) — memory-bound.
FFN/MLP modules — compute-bound.

The FFN always operates with the same weight matrices, so they can be fully pre-loaded, making it compute-bound. Although the attention module has the same model parameters, too, it also has to load different sets of data from the KV cache for each token, making it overall memory-bound. Hence, one high-level memory optimization is to not only split prefill and decoding (phase splitting), but also do "sublayer splitting" to run attention and FFN computations on different platforms.

Model Compression Techniques

The main class of optimizations that reduce memory requirements by making the model smaller is called "model compression". These methods reduce memory by making the model smaller. Model compression includes sub-strategies such as:

Recomputation

Recomputation is a method of trading time for space in LLM algorithms, by re-computing data rather than storing the results in memory. On memory-constrained device, it is possible to reduce space requirements at the cost of extra processor time. This is called "recomputation", or sometimes in research papers it is called "rematerialization" and when used during LLM training it is related to "checkpointing". When recomputation is used to optimize training of a model that is too large to fit inside GPU memory, it is called "gradient checkpointing." The portion of this algorithm that involves swapping tensors off the GPU back to the CPU is often called "offloading."

The recomputation optimization method involves not storing results of a computation that you might need later, but instead waiting until later, and then recomputing them all over again. Hence, recomputation trades time for space and is effectively the opposite of caching and data reuse optimizations, which trade space for time.

Recomputation involves doing calculations a second time, which is redundant computation. This is not something you want to have to do often, since it involves a lot more CPU or GPU time. But it is a technique that can be considered when memory is at a premium, and is sometimes done as a GPU optimization.

Research on Recomputation: Research papers on the recomputation memory optimization technique include:

Yu Tang, Chenyu Wang, Yufan Zhang, Yuliang Liu, Xingcheng Zhang, Linbo Qiao, Zhiquan Lai, Dongsheng Li, 2022, Delta: Dynamically optimizing gpu memory beyond tensor recomputation, https://arxiv.org/abs/2203.15980
Chien-Chin Huang, Gu Jin, and Jinyang Li. 2020. SwapAdvisor: Pushing deep learning beyond the GPU memory limit via smart swapping. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20), James R. Larus, Luis Ceze, and Karin Strauss (Eds.). ACM, 1341–1355. https://dl.acm.org/doi/10.1145/3373376.3378530, PDF: https://news.cs.nyu.edu/~jinyang/pub/swapadvisor-asplos20.pdf
Xuan Peng, Xuanhua Shi, Hulin Dai, Hai Jin, Weiliang Ma, Qian Xiong, Fan Yang, and Xuehai Qian. 2020. Capuchin: Tensor-based GPU memory management for deep learning. Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20), James R. Larus, Luis Ceze, and Karin Strauss (Eds.). ACM, 891–905. https://doi.org/10.1145/3373376.3378505, https://dl.acm.org/doi/10.1145/3373376.3378505
O. Beaumont, L. Eyraud-Dubois, and A. Shilova, 2021, Efficient combination of rematerialization and offloading for training dnns, Advances in Neural Information Processing Systems, vol. 34, PDF: https://proceedings.nips.cc/paper/2021/file/c8461bf13fca8a2b9912ab2eb1668e4b-Paper.pdf
Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan, Mike He, Jared Roesch, Tianqi Chen, Zachary Tatlock, 2020, Dynamic tensor rematerialization, arXiv preprint arXiv:2006.09616, https://arxiv.org/abs/2006.09616
Ravi Kumar, Manish Purohit, Zoya Svitkina, Erik Vee, and Joshua Wang. Efficient rematerialization for deep networks. Advances in Neural Information Processing Systems, 32, 2019. https://dl.acm.org/doi/10.5555/3454287.3455646, PDF: https://proceedings.neurips.cc/paper/2019/file/ffe10334251de1dc98339d99ae4743ba-Paper.pdf
Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Joseph Gonzalez, Kurt Keutzer, and Ion Stoica, 2020, Checkmate: Breaking the memory wall with optimal tensor rematerialization. Proceedings of Machine Learning and Systems, 2:497–511, https://arxiv.org/abs/1910.02653 Code: https://github.com/parasj/checkmate
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, https://arxiv.org/abs/1604.06174
Audrundefinednas Gruslys, Rémi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. 2016. Memory-Efficient Backpropagation through Time. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16). Curran Associates Inc., 4132–4140. https://arxiv.org/abs/1606.03401
James Martens and Ilya Sutskever. 2012. Training deep and recurrent networks with hessian-free optimization. In Neural Networks: Tricks of the Trade. Springer. https://link.springer.com/chapter/10.1007/978-3-642-35289-8_27, PDF: https://www.cs.utoronto.ca/~jmartens/docs/HF_book_chapter.pdf
M Schuler, R Membarth, P Slusallek, 2022, Xengine: Optimal tensor rematerialization for neural networks in heterogeneous environments, ACM Transactions on Architecture and Code Optimization, Volume 20, Issue 1, Article No. 17, pp 1–25, https://dl.acm.org/doi/10.1145/3568956, PDF: https://dl.acm.org/doi/pdf/10.1145/3568956, Code: https://github.com/dfki-asr/xengine
Hugging Face, Performance and Scalability: How To Fit a Bigger Model and Train It Faster, https://huggingface.co/docs/transformers/v4.18.0/en/performance (Gradient checkpointing to optimize training of large models.)
Yaroslav Bulatov, Jan 14, 2018, Fitting larger networks into memory, Medium, https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9 (Gradient checkpointing for training large models.)
Olivier Beaumont, Lionel Eyraud-Dubois, Julien Herrmann, Alexis Joly, and Alena Shilova. Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory. Research Report RR-9302, Inria Bordeaux Sud-Ouest, November 2019, https://arxiv.org/abs/1911.13214
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016, Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, https://arxiv.org/abs/1604.06174
Audrunas Gruslys, Rémi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. Memory-efficient backpropagation through time. Advances in Neural Information Processing Systems, pages 4125–4133, 2016. https://arxiv.org/abs/1606.03401
Navjot Kukreja, Jan Hückelheim, and Gerard J Gorman. Backpropagation for long sequences: beyond memory constraints with constant overheads. arXiv preprint arXiv:1806.01117, 2018, https://arxiv.org/abs/1806.01117
L Waeijen, S Sioutas, M Peemen, M Lindwer, 2021, ConvFusion: A model for layer fusion in convolutional neural networks, IEEE Access (Volume: 9), https://ieeexplore.ieee.org/abstract/document/9646923/, PDF: https://ieeexplore.ieee.org/iel7/6287639/6514899/09646923.pdf
Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, Henri E. Bal, 2023, Optimization techniques for GPU programming, ACM Computing Surveys, Volume 55, Issue 11, Article No. 239, pp 1–81, https://dl.acm.org/doi/abs/10.1145/3570638, PDF: https://dl.acm.org/doi/pdf/10.1145/3570638 (Extensive survey of software optimizations to improve GPU latency and throughput.)
Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang, 2 Feb 2024, APIServe: Efficient API Support for Large-Language Model Inferencing, https://arxiv.org/abs/2402.01869
Jaskirat Singh, Bram Adams, Ahmed E. Hassan, 25 Mar 2024, On the Impact of Black-box Deployment Strategies for Edge AI on Latency and Model Performance, https://arxiv.org/abs/2403.17154 (MLOps deployment for quantization, partitioning and early-exit across mobile, edge, and cloud platforms, including running early exit on mobile.)
Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, Bryan Catanzaro, May 2022, Reducing Activation Recomputation in Large Transformer Models, https://arxiv.org/abs/2205.05198
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 Code: https://github.com/HazyResearch/flash-attention (The original FlashAttention version 1, now superceded by FlashAttention 2, which uses tiling and memory-aware kernels to optimize attention.)
Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani, Russ Webb, 6 Sep 2024, Theory, Analysis, and Best Practices for Sigmoid Self-Attention, https://arxiv.org/abs/2409.04431 https://github.com/apple/ml-sigmoid-attention
Tailing Yuan, Yuliang Liu, Xucheng Ye, Shenglong Zhang, Jianchao Tan, Bin Chen, Chengru Song, and Di Zhang, July 2024, Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism, Kuaishou Technology, Proceedings of the 2024 USENIX Annual Technical Conference. July 10–12, 2024, Santa Clara, CA, USA, https://www.usenix.org/conference/atc24/presentation/yuan https://www.usenix.org/system/files/atc24-yuan.pdf
Ping Chen, Wenjie Zhang, Shuibing He, Yingjie Gu, Zhuwei Peng, Kexin Huang, Xuan Zhan, Weijian Chen, Yi Zheng, Zhefeng Wang, Yanlong Yin, Gang Chen, 27 Jun 2024 (v2), Optimizing Large Model Training through Overlapped Activation Recomputation, https://arxiv.org/abs/2406.08756
Xunyi Zhao, Lionel Eyraud-Dubois, Théotime Le Hellard, Julia Gusak, Olivier Beaumont, 24 July, 2024, OFFMATE: full fine-tuning of LLMs on a single GPU by re-materialization and offloading, https://hal.science/hal-04660745/document
Chaoyi Jiang, Lei Gao, Hossein Entezari Zarch, Murali Annavaram, 26 Nov 2024, Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation, https://arxiv.org/abs/2411.17089 (Overlapping/optimizing CPU-GPU network bandwidth for KV cache with some recomputation.)
Sanghyeon Lee, Hongbeen Kim, Soojin Hwang, Guseul Heo, Minwoo Noh, Jaehyuk Huh. 3 Jan 2025, Efficient LLM Inference with Activation Checkpointing and Hybrid Caching, https://arxiv.org/abs/2501.01792 (Recomputation of the KV cache from stored activations.)
https://theses.hal.science/tel-04890912/file/ZHAO_XUNYI_2024.pdf Xunyi Zhao. Optimizing Memory Usage when Training Deep Neural Networks. Computer Science [cs]. Université de Bordeaux, France, 2024. English. NNT: 2024BORD0411 . tel-04890912
Pinxue Zhao, Hailin Zhang, Fangcheng Fu, Xiaonan Nie, Qibin Liu, Fang Yang, Yuanbo Peng, Dian Jiao, Shuaipeng Li, Jinbao Xue, Yangyu Tao, and Bin Cui. 2025. MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training. Proc. ACM Manag. Data 3, 1, Article 53 (February 2025), 28 pages. https://doi.org/10.1145/3709703 https://dl.acm.org/doi/abs/10.1145/3709703
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song, 29 May 2025, KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction, https://arxiv.org/abs/2505.23416

Data Locality

Data locality is the method of speeding up LLM algorithms by using data that is stored closely together. The simplest idea is to store all data in contiguous memory, which is commonly used for model matrices and tensors. The use of data in "nearby" regions helps with optimizations such as caching, prefetching, tiling, coalescing, and other memory access pattern optimizations.

Research papers on data locality in LLM computations:

Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
Mengke Ge, Junpeng Wang, Binhan Chen, Yingjian Zhong, Haitao Du, Song Chen, Yi Kang, 22 Mar 2024, Allspark: Workload Orchestration for Visual Transformers on Processing In-Memory Systems, https://arxiv.org/abs/2403.15069
Kathryn S. McKinley, Steve Carr, Chau-Wen Tseng, 1996, Improving data locality with loop transformations, ACM Transactions on Programming Languages and Systems, Volume 18, Issue 4, pp 424–453, https://dl.acm.org/doi/10.1145/233561.233564
Neda Seifi, Abdullah Al-Mamun, 2014, Optimizing Memory Access Efficiency in CUDA Kernel via Data Layout Technique, Journal of Computer and Communications, 2024, 12, 124-139, DOI: 10.4236/jcc.2024.125009, https://www.scirp.org/journal/paperinformation?paperid=133500 PDF: https://www.scirp.org/pdf/jcc2024125_91732699.pdf (Fast CUDA matrix multiplication using data locality of memory accesses, by using diagonal data access patterns for coalesced access.)
Ilias Bournias, Lukas Cavigelli, Georgios Zacharopoulos, 8 Nov 2024, AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing and Data Locality, https://arxiv.org/abs/2411.05555
Jordi Wolfson-Pou, Jan Laukemann, Fabrizio Petrini, 13 Jan 2025, Generating Data Locality to Accelerate Sparse Matrix-Matrix Multiplication on CPUs, https://arxiv.org/abs/2501.07056

Prefetching

Prefetching is the optimization technique of requesting data from memory before it is needed, so that its later usage will not slow down computations. Any type of memory may be benefit from prefetching, and there is "instruction prefetching" for CPU execution and "data prefetching" for computations.

Research papers on prefetching optimizations:

Ulrich Drepper, October 23, 2007, Memory part 5: What programmers can do, https://lwn.net/Articles/255364/
Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng-Ann Heng, Minyi Guo, Chao Li, 18 Dec 2024, A Survey on Inference Optimization Techniques for Mixture of Experts Models, https://arxiv.org/abs/2412.14219 (Broad survey of MoE inference optimization from hardware to model compression to expert parallelism.)
Andrew Chan, Dec 12, 2024, Fast LLM Inference From Scratch: Pushing single-GPU inference throughput to the edge without libraries, https://andrewkchan.dev/posts/yalm.html
Sarah Butcher & Alex McMurray, Jan 2025, The C++ techniques you need for $600k hedge fund jobs, https://www.efinancialcareers.com/news/low-latency-c
Ahmet Caner Yüzügüler, Jiawei Zhuang, Lukas Cavigelli, 14 Jan 2025, PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving, https://arxiv.org/abs/2501.08192
Hongchao Du, Shangyu Wu, Arina Kharlamova, Nan Guan, Chun Jason Xue, 4 Mar 2025, FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference, https://arxiv.org/abs/2503.03777
Masahiro Tanaka, Du Li, Umesh Chand, Ali Zafar, Haiying Shen, Olatunji Ruwase, 14 Apr 2025, DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training, https://arxiv.org/abs/2504.09983

SSD Storage

The use of SSDs is common for large-scale storage of models and their data. Research papers on SSD issues include:

Jie Peng, Zhang Cao, Huaizhi Qu, Zhengyu Zhang, Chang Guo, Yanyong Zhang, Zhichao Cao, Tianlong Chen, 23 Oct 2024 (v2), Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching, https://arxiv.org/abs/2410.14740
Lucas Mearian, 24 Oct 2024, 2025: The year of the AI PC, Computer World, https://www.computerworld.com/article/3583355/2025-the-year-of-the-ai-pc.html
Tuowei Wang, Ruwen Fan, Huang, Zixu Hao, Kun Li, Ting Cao, Youyou Lu, Yaoxue Zhang, Ju Ren, 29 Oct 2024 (v2), Ripple: Accelerating LLM Inference on Smartphones with Correlation-Aware Neuron Management, https://arxiv.org/abs/2410.19274
Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
S. Wang, Q. Cao, K. Zhou, J. Xu, Z. Guo and J. Guo, "ParaCkpt: Heterogeneous Multi-Path Checkpointing Mechanism for Training Deep Learning Models," 2024 IEEE 42nd International Conference on Computer Design (ICCD), Milan, Italy, 2024, pp. 183-190, doi: 10.1109/ICCD63220.2024.00036. https://ieeexplore.ieee.org/abstract/document/10818161/ (Generalizing in-memory checkpoints by storing data in shards across multiple storage areas including CPU memory and SSDs.)

Compute-in-Memory (CIM)

Compute-in-Memory (CIM) or Process-in-Memory (PIM) optimizations are the use of in-memory computations rather than storing data on disk. Performing LLM computations fully inside GPU memory is one of the main optimizations in LLM inference.

Research papers on CIM/PIM include:

Vaclav Snasel, Tran Khanh Dang, Josef Kueng, Lingping Kong 22 December 2023, A review of in-memory computing for machine learning: architectures, options, International Journal of Web Information Systems, https://www.emerald.com/insight/content/doi/10.1108/IJWIS-08-2023-0131/full/html
Christopher Wolters, Xiaoxuan Yang, Ulf Schlichtmann, Toyotaro Suzumura, 12 Jun 2024, Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference, https://arxiv.org/abs/2406.08413
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
H. Diao et al., 2024, A Multiply-Less Approximate SRAM Compute-In-Memory Macro for Neural-Network Inference, IEEE Journal of Solid-State Circuits, doi: 10.1109/JSSC.2024.3433417, https://ieeexplore.ieee.org/abstract/document/10622078
B. Kim et al., 2024, The Breakthrough Memory Solutions for Improved Performance on LLM Inference, IEEE Micro, vol. 44, no. 3, pp. 40-48, May-June 2024, doi: 10.1109/MM.2024.3375352, https://ieeexplore.ieee.org/abstract/document/10477465
https://community.juniper.net/blogs/sharada-yeluri/2024/02/20/llm-inference-hw-sw-optimizations
Wenlun Zhang, Shimpei Ando, Yung-Chin Chen, Satomi Miyagi, Shinya Takamaeda-Yamazaki, Kentaro Yoshioka, 29 Aug 2024, PACiM: A Sparsity-Centric Hybrid Compute-in-Memory Architecture via Probabilistic Approximation, https://arxiv.org/abs/2408.16246
Md Tawsif Rahman Chowdhury, Huynh Quang Nguyen Vo, Paritosh Ramanan, Murat Yildirim, Gozde Tutuncuoglu, 10 Sep 2024, The Lynchpin of In-Memory Computing: A Benchmarking Framework for Vector-Matrix Multiplication in RRAMs, https://arxiv.org/abs/2409.06140
Bettayeb, M., Halawani, Y., Khan, M.U. et al. Efficient memristor accelerator for transformer self-attention functionality. Sci Rep 14, 24173 (2024). https://doi.org/10.1038/s41598-024-75021-z https://www.nature.com/articles/s41598-024-75021-z https://www.nature.com/articles/s41598-024-75021-z.pdf
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Hyucksung Kwon, Kyungmo Koo, Janghyeon Kim, Woongkyu Lee, Minjae Lee, Hyungdeok Lee, Yousub Jung, Jaehan Park, Yosub Song, Byeongsu Yang, Haerang Choi, Guhyun Kim, Jongsoon Won, Woojae Shin, Changhyun Kim, Gyeongcheol Shin, Yongkee Kwon, Ilkon Kim, Euicheol Lim, John Kim, Jungwook Choi, 28 Dec 2024, LoL-PIM: Long-Context LLM Decoding with Scalable DRAM-PIM System, https://arxiv.org/abs/2412.20166
Dong Eun Kim, Tanvi Sharma, Kaushik Roy, 17 Feb 2025, Hardware-Software Co-Design for Accelerating Transformer Inference Leveraging Compute-in-Memory, https://arxiv.org/abs/2502.12344
Zhantong Zhu, Hongou Li, Wenjie Ren, Meng Wu, Le Ye, Ru Huang, Tianyu Jia, 1 Mar 2025, Leveraging Compute-in-Memory for Efficient Generative Model Inference in TPUs, https://arxiv.org/abs/2503.00461
T. Sharma, M. Ali, I. Chakraborty and K. Roy, 2025, "What, When, Where to Compute-in-Memory for Efficient Matrix Multiplication during Machine Learning Inference," in IEEE Transactions on Emerging Topics in Computing, doi: 10.1109/TETC.2025.3574508, https://ieeexplore.ieee.org/abstract/document/11026257/

Memory-Bound versus CPU-Bound

Surprisingly, researchers discovered that LLM inference was not CPU-bound (or GPU-bound), but was memory-bound, with the cost of accessing all those tensors full of weights (and activations) being the main efficiency bottleneck.

Subsequently, it was found to be more nuanced in decoder-only transformer architectures (e.g. GPT),so that:

Prefill phase — CPU-bound
Decoding phase &mdash memory-bound

The prefill phase is the initial phase of "prompt processing" where every token in the prompt is processed (in parallel) to generate the overall KV caches. This has been found to thrash the CPU, or rather, the GPU. Prefill is a busy time, but it also takes a long time, and is the cause of the initial delay before an LLM starts answering your question.

The decoding phase is then the next phase, whereby the autoregressive algorithm spits out one token at a time. Because it cannot be fully parallelized, this tends not to fill the GPU pipeline, but is continually accesssing the entire model, one layer at a time. Hence, it's memory-bound.

Research papers on memory-bound versus CPU-bound nature of transformers:

Amir Gholami; Zhewei Yao; Sehoon Kim; Coleman Hooper, 25 March 2024, AI and Memory Wall, IEEE Micro ( Early Access ), pp 1-5, https://ieeexplore.ieee.org/abstract/document/10477550
Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, Hamed Haddadi, 20 Mar 2024 (v2), MELTing point: Mobile Evaluation of Language Transformers, https://arxiv.org/abs/2403.12844 (Survey and benchmarking of SOTA methods for running LLM inference natively on phones including iPhone and Android, with quantization levels, and with measurement of speed and battery depletion.)
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111

Research on Memory Optimization

For model compression and its popular subtypes, see research paper lists on the individual pages (e.g. quantization, pruning). Other research that is specifically on memory management and reducing memory includes:

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016). https://arxiv.org/abs/1604.06174
Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. 2018. Superneurons: Dynamic GPU memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming. 41–53. https://arxiv.org/abs/1801.04380
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023, Efficient Memory Management for Large Language Model Serving with PagedAttention arXiv preprint, https://arxiv.org/abs/2309.06180
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, Ce Zhang, 2023, High-throughput Generative Inference of Large Language Models with a Single GPU. arXiv preprint arXiv:2303.06865 (2023). https://arxiv.org/abs/2303.06865 (FlexGen model optimizes speed and memory.)
Shishir G Patil, Paras Jain, Prabal Dutta, Ion Stoica, and Joseph Gonzalez. 2022. POET: Training Neural Networks on Tiny Devices with Integrated Rematerialization and Paging. In International Conference on Machine Learning. PMLR, 17573–17583. https://arxiv.org/abs/2207.07697
Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Joseph Gonzalez, Kurt Keutzer, and Ion Stoica. 2020. Checkmate: Breaking the memory wall with optimal tensor rematerialization. Proceedings of Machine Learning and Systems 2 (2020), 497–511. https://arxiv.org/abs/1910.02653
Jiarui Fang, Yang Yu, Chengduo Zhao, Jie Zhou, Feb 2021, TurboTransformers: An Efficient GPU Serving System For Transformer Models, Proceedings of the 26th ACM SIGPLAN, 2021, https://dl.acm.org/doi/pdf/10.1145/3437801.3441578, https://arxiv.org/abs/2010.05680
Nimit S. Sohoni, Christopher R. Aberger, Megan Leszczynski, Jian Zhang, Christopher Ré, Apr 2022, Low-Memory Neural Network Training: A Technical Report, arXiv preprint, https://arxiv.org/abs/1904.10631
Tung D. Le, Haruki Imai, Yasushi Negishi, Kiyokuni Kawachiya, 2019, Automatic GPU memory management for large neural models in TensorFlow, ISMM 2019: Proceedings of the 2019 ACM SIGPLAN International Symposium on Memory Management, June 2019, Pages 1–13, 2019, https://dl.acm.org/doi/10.1145/3315573.3329984
SB Shriram, A Garg, P Kulkarni, 2019, Dynamic Memory Management for GPU-Based Training of Deep Neural Networks, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), https://ieeexplore.ieee.org/document/8820980
Y Pisarchyk, J Lee, 2020, Efficient memory management for deep neural net inference, arXiv preprint arXiv:2001.03288, https://arxiv.org/abs/2001.03288
Chien-Chin Huang, Gu Jin, and Jinyang Li. 2020. Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 1341–1355. https://dl.acm.org/doi/10.1145/3373376.3378530
Benoit Steiner, Mostafa Elhoushi, Jacob Kahn, and James Hegarty. 2022. OLLA: Optimizing the Lifetime and Location of Arrays to Reduce the Memory Usage of Neural Networks. https://arxiv.org/abs/2210.12924
Mahdi Nazemi, Ghasem Pasandi, Massoud Pedram, Aug 2018, NullaNet: Training Deep Neural Networks for Reduced-Memory-Access Inference, https://arxiv.org/abs/1807.08716
Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. ZeRO-Offload: Democratizing Billion-Scale Model Training. USENIX Annual Technical Conference. 551–564. https://arxiv.org/abs/2101.06840 (Offloading strategy for memory optimizations.)
B Steiner, M Elhoushi, J Kahn, J Hegarty, 2022, OLLA: Decreasing the Memory Usage of Neural Networks by Optimizing the Lifetime and Location of Arrays, arXiv preprint arXiv:2210.12924, https://arxiv.org/abs/2210.12924
Guillaume Lample, Alexandre Sablayrolles, Marc'Aurelio Ranzato, Ludovic Denoyer, Hervé Jégou, Dec 2019, Large memory layers with product keys, NeurIPS, https://arxiv.org/abs/1907.05242, https://proceedings.neurips.cc/paper/2019/file/9d8df73a3cfbf3c5b47bc9b50f214aff-Paper.pdf
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems 35 (2022), 16344–16359. https://arxiv.org/abs/2205.14135
Minxuan Zhou; Weihong Xu; Jaeyoung Kang; Tajana Rosing, 2022, TransPIM: A Memory-based Acceleration via Software-Hardware Co-Design for Transformer, 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), https://ieeexplore.ieee.org/document/9773212 PDF: https://par.nsf.gov/servlets/purl/10345536 (Memory optimizations including token-based data sharding for allocation to different memory banks.)
Tri Dao, July 2023, FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning, https://arxiv.org/abs/2307.08691, Code: https://github.com/Dao-AILab/flash-attention (One of the main optimizations used by Flash Attention was its memory management.)
Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Oct 2022. LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning. Advances in Neural Information Processing Systems, 35:12991–13005. https://arxiv.org/abs/2206.06522 (Reduces memory requirements of training.)
M Capra, B Bussolino, A Marchisio, M Shafique, 2020, An updated survey of efficient hardware architectures for accelerating deep convolutional neural networks, Future Internet, https://www.mdpi.com/1999-5903/12/7/113/pdf (Survey paper with sections on memory optimization.)
Mikhail S. Burtsev, Yuri Kuratov, Anton Peganov, Grigory V. Sapunov, Feb 2021, Memory Transformer, https://arxiv.org/abs/2006.11527
Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks,” ISCA, 2016. https://ieeexplore.ieee.org/document/7551407, PDF: http://www.rle.mit.edu/eems/wp-content/uploads/2016/04/eyeriss_isca_2016.pdf, PDF Slides: https://eems.mit.edu/wp-content/uploads/2016/06/eyeriss_isca_2016_slides.pdf, Project: http://eyeriss.mit.edu/
Z Guo, Z He, Y Zhang, 2023, Mira: A Progam-Behavior-Guided Far Memory System, PDF: https://cseweb.ucsd.edu/~yiying/Mira-SOSP23.pdf (Although "far memory" is probably not desirable for fast AI inference, this paper has interesting coverage of automatic memory policy management and cache optimization using static analysis and performance profiling.)
Nabavinejad, S.M.; Baharloo, M.; Chen, K.C.; Palesi, M.; Kogel, T.; Ebrahimi, M., An Overview of Efficient Interconnection Networks for Deep Neural Network Accelerators. IEEE J. Emerg. Sel. Top. Circuits Syst. 2020, 10, 268–282. http://dx.doi.org/10.1109/JETCAS.2020.3022920, https://ieeexplore.ieee.org/abstract/document/9189825 (Hardware paper about on-chip interconnection optimizations, but examines near-memory optimizations.)
Robert Lim, 2019, Methods for accelerating machine learning in high performance computing, Report AREA-2019-01, School of Computer and Data Sciences, University of Oregon, https://www.cs.uoregon.edu/Reports/AREA-201901-Lim.pdf (Extension analysis of ML compiler optimizations, including a long section on memory optimizations for ML compilers.)
N Penkov, K Balaskas, M Rapp, J Henkel, 2023, Differentiable Slimming for Memory-Efficient Transformers, IEEE Embedded Systems Letters (Early Access), DOI: 10.1109/LES.2023.3299638, https://ieeexplore.ieee.org/abstract/document/10261943
Tianyang Lin, Yuxin Wang, Xiangyang Liu, Xipeng Qiu, June 2021, A Survey of Transformers, AI Open, https://arxiv.org/abs/2106.04554 (Examines some Transformer models with memory-compression attention mechanisms.)
João Gabriel Lopes Jan 29, 2022, Optimizing TensorFlow Models for Inference, https://medium.com/tinyclues-vision/optimizing-tensorflow-models-for-inference-d3636cf34034 (Discussion of memory optimization on TensorFlow.)
N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen, 2018, Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions, CoRR, vol. abs/1802.04730, http://arxiv.org/abs/1802.04730 (Memory optimizations from a kernel fusion and compiler-level perspective.)
E Yvinec, A Dapogny, K Bailly, Sep 2023, Network Memory Footprint Compression Through Jointly Learnable Codebooks and Mappings, arXiv preprint arXiv:2309.17361, https://arxiv.org/abs/2309.17361 (Uses "codebooks", i.e. look-up tables, to reduce memory usage.)
Song Han, Jeff Pool, John Tran, and William Dally, 2015, Learning both weights and connections for efficient neural network, Advances in neural information processing systems, 28, 2015, https://arxiv.org/abs/1506.02626
Xuan Peng, Xuanhua Shi, Hulin Dai, Hai Jin, Weiliang Ma, Qian Xiong, Fan Yang, and Xuehai Qian. 2020. Capuchin: Tensor-based GPU memory management for deep learning. Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20), James R. Larus, Luis Ceze, and Karin Strauss (Eds.). ACM, 891–905. https://doi.org/10.1145/3373376.3378505, https://dl.acm.org/doi/10.1145/3373376.3378505
Xia, C., Zhao, J., Sun, Q., Wang, Z., Wen, Y., Feng, X., Cui, H., 2023, Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions, The ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 27 Apr-01 May 2023, San Diego, USA. https://eprints.whiterose.ac.uk/203681/, PDF: https://eprints.whiterose.ac.uk/203681/1/asplos24.pdf (Analyzes memory-intensive versus compute-intensive kernel operators and reducing GPU memory data transfers.)
Shabnam Daghaghi, Nicholas Meisburger, Mengnan Zhao, Yong Wu, Sameh Gobriel, Charlie Tai, Anshumali Shrivastava, Mar 2021, Accelerating SLIDE Deep Learning on Modern CPUs: Vectorization, Quantizations, Memory Optimizations, and More, https://arxiv.org/abs/2103.10891, Code: https://github.com/RUSH-LAB/SLIDE (Memory optimization of training on CPUs using AVX-512 and locality-sensitive hashing of vectors.)
Nicolai M. Josuttis, 2012, The C++ Standard Library: A Tutorial and Reference, Second Edition, Supplementary Chapter, https://www.amazon.com/Standard-Library-Tutorial-Reference-2nd/dp/0321623215, PDF (extra chapter): http://www.cppstdlib.com/cppstdlib_supplementary.pdf (C++ optimizations such as bit sets and user-defined memory allocators.)
Zhen Zheng, Xuanda Yang, et al. 2022. AStitch: enabling a new multidimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 359–373. https://dl.acm.org/doi/abs/10.1145/3503222.3507723
Kai Lv, Yuqing Yang, Tengxiao Liu, Qinghui Gao, Qipeng Guo, and Xipeng Qiu, June 2023, Full parameter fine-tuning for large language models with limited resources, arXiv preprint arXiv:2306.09782, https://arxiv.org/abs/2306.09782 (Fused gradient computation and parameter update saves memory in training kernel by not saving the gradient tensor in memory.)
S Agrawal, P Ghosh, G Kumar, T Radhika, 2023, Memory Footprint Optimization for Neural Network Inference in Mobile SoCs, 2023 IEEE Women in Technology Conference (WINTECHCON) https://ieeexplore.ieee.org/abstract/document/10277304 (Improved management of memory buffers.)
Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, Henri E. Bal, 2023, Optimization techniques for GPU programming, ACM Computing Surveys, Volume 55, Issue 11, Article No. 239, pp 1–81, https://dl.acm.org/doi/abs/10.1145/3570638, PDF: https://dl.acm.org/doi/pdf/10.1145/3570638 (Extensive survey of software optimizations to improve GPU latency and throughput.)
Pietro Farina, Subrata Biswas, Eren Yıldız, Khakim Akhunov, Saad Ahmed, Bashima Islam, Kasım Sinan Yıldırım, 16 May 2024, Memory-efficient Energy-adaptive Inference of Pre-Trained Models on Batteryless Embedded Systems, https://arxiv.org/abs/2405.10426
Shaoyuan Chen, Yutong Lin, Mingxing Zhang, Yongwei Wu, 3 May 2024, Efficient and Economic Large Language Model Inference with Attention Offloading, https://arxiv.org/abs/2405.01814 (Separates the process-bound and memory-bound parts of inference for speedup, with focus on prefill, decoding, and the sub-tasks such as QKV and FFN use of GEMM kernels, versus the different pattern of attention computations and the KV cache.)
Jitai Hao, WeiWei Sun, Xin Xin, Qi Meng, Zhumin Chen, Pengjie Ren, Zhaochun Ren, 7 Jun 2024, MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter, https://arxiv.org/abs/2406.04984 Code: https://github.com/CURRENTF/MEFT
Wei Niu, Md Musfiqur Rahman Sanim, Zhihao Shu, Jiexiong Guan, Xipeng Shen, Miao Yin, Gagan Agrawal, Bin Ren, 21 Apr 2024, SmartMem: Layout Transformation Elimination and Adaptation for Efficient DNN Execution on Mobile, https://arxiv.org/abs/2404.13528 (Choosing optimal tensor memory layouts to optimize low-level operator kernels.)
Shashank Verma and Neal Vaidya, Nov 17, 2023 Mastering LLM Techniques: Inference Optimization, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/ (An overview that covers a lot of inference optimization techniques.)
Theia Vogel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang, 2 Feb 2024, APIServe: Efficient API Support for Large-Language Model Inferencing, https://arxiv.org/abs/2402.01869
Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
T Senoo, R Kayanoma, A Jinguji, H Nakahara, 2023, A Light-Weight Vision Transformer Toward Near Memory Computation on an FPGA ARC 2023: Applied Reconfigurable Computing. Architectures, Tools, and Applications, pp 338–353, https://link.springer.com/chapter/10.1007/978-3-031-42921-7_23 (Vision transformer optimized for near-memory computation.)
Mengke Ge, Junpeng Wang, Binhan Chen, Yingjian Zhong, Haitao Du, Song Chen, Yi Kang, 22 Mar 2024, Allspark: Workload Orchestration for Visual Transformers on Processing In-Memory Systems, https://arxiv.org/abs/2403.15069
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, Yu Wang, 9 Jan 2024, FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs, https://arxiv.org/abs/2401.03868 (Does FFN optimization by splitting FFNs into two categories, those commonly firing and those rarely used, in both RELU and non-RELU models; effectively this is FFN pruning of a subset of FFNs.)
Jiahang Zhou, Yanyu Chen, Zicong Hong, Wuhui Chen, Yue Yu, Tao Zhang, Hui Wang, Chuanfu Zhang, Zibin Zheng, 5 Jan 2024, Training and Serving System of Foundation Models: A Comprehensive Survey, https://arxiv.org/abs/2401.02643
Make LLM Fine-tuning 2x faster with Unsloth and HF TRL, January 10, 2023, Daniel Han-Chen, https://huggingface.co/blog/unsloth-trl Code: https://github.com/huggingface/blog/blob/main/unsloth-trl.md (Optimizes some PyTorch kernels for back-propagation and reduces memory usage in fine-tuning; currently works with Llama and Mistral architectures.)
Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar, Dec 2023, LLM in a flash: Efficient Large Language Model Inference with Limited Memory Apple Research, https://arxiv.org/abs/2312.11514
Noam Shazeer, Mitchell Stern, Apr 2018, Adafactor: Adaptive Learning Rates with Sublinear Memory Cost, https://arxiv.org/abs/1804.04235
Animesh Jain, Amar Phanishayee, Jason Mars, Lingjia Tang, Gennady Pekhimenko, 2018, Gist: Efficient Data Encoding for Deep Neural Network Training, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), https://ieeexplore.ieee.org/document/8416872 PDF: https://www.microsoft.com/en-us/research/uploads/prod/2018/04/fiddle-gist-isca18.pdf
Manuel Poter, Jesper Larsson Traf, Mar 2018, Memory Models for C/C++ Programmers, https://arxiv.org/pdf/1803.04432.pdf
Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Nov 2021, Data movement is all you need: A case study on optimizing transformers, Proceedings of Machine Learning and Systems, 3, 2021. https://arxiv.org/abs/2007.00072 Code: https://github.com/spcl/substation
DeepSpeed Team, Rangan Majumder, Andrey Proskurin, May 24, 2021, DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression/ (DeepSpeed uses various kernel fusion methods including for Softmax, LayerNorm, transpose, and GEMM.)
Noam Shazeer, Nov 2019, Fast Transformer Decoding: One Write-Head is All You Need, https://arxiv.org/abs/1911.02150
Abhiroop Bhattacharjee, Yeshwanth Venkatesha, Abhishek Moitra, Priyadarshini Panda, MIME: adapting a single neural network for multi-task inference with memory-efficient dynamic pruning. In: DAC (2022) https://arxiv.org/abs/2204.05274
S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “Zero: Memory optimizations toward training trillion parameter models,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–16, https://arxiv.org/abs/1910.02054 Code: part of: https://github.com/microsoft/deepspeed (Zero Redundancy Optimizer (ZeRO) provides memory optimization, improved utilization, and fragmentation avoidance, allowing improved pipelining during training.)
Mark Hildebrand, Jason Lowe-Power, Venkatesh Akella, 2024, CachedArrays: Optimizing Data Movement for Heterogeneous Memory Systems, IEEE, DOI 10.1109/IPDPS57955.2024.00055, https://arch.cs.ucdavis.edu/assets/papers/ipdps24-cachedarrays.pdf => caching
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
Wei Niu, Gagan Agrawal, Bin Ren, 29 Feb 2024, SoD2: Statically Optimizing Dynamic Deep Neural Network, https://arxiv.org/abs/2403.00176 (Analysis of operator computation shapes and pathways with kernel fusion and memory planning.)
Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti, 14 Mar 2024, Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference, https://arxiv.org/abs/2403.09636 (Reducing the memory size of the KV cache.)
Bahareh Khabbazan, Marc Riera, Antonio González, Oct 2023, An Energy-Efficient Near-Data Processing Accelerator for DNNs that Optimizes Data Accesses, https://arxiv.org/abs/2310.18181
Alireza Amirshahi, Giovanni Ansaloni, David Atienza, 20 Dec 2023, Accelerator-driven Data Arrangement to Minimize Transformers Run-time on Multi-core Architectures, https://arxiv.org/abs/2312.13000
Chen Ding, Christopher Kanan, Dylan McKellips, Toranosuke Ozawa, Arian Shahmirza, Wesley Smith, 22 Dec 2023, DMC4ML: Data Movement Complexity for Machine Learning, https://arxiv.org/abs/2312.14441
Tanvi Sharma, Mustafa Ali, Indranil Chakraborty, Kaushik Roy, 26 Dec 2023, WWW: What, When, Where to Compute-in-Memory, https://arxiv.org/abs/2312.15896
Gavin Li, Nov 19, 2023, Unbelievable! Run 70B LLM Inference on a Single 4GB GPU with This NEW Technique, AI Advances https://ai.gopubby.com/unbelievable-run-70b-llm-inference-on-a-single-4gb-gpu-with-this-new-technique-93e2057c7eeb
Arnav Chavan, Nahush Lele, Deepak Gupta, Dec 2023, Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models https://arxiv.org/abs/2312.07046 Code: https://github.com/transmuteAI/trailmet/tree/main/trailmet/algorithms/llm-rom
Robert A. van de Geijn, Enrique S. Quintana-Ort´ı, 2007, The Science of Programming Matrix Computations, https://www.cs.utexas.edu/users/rvdg/tmp/TSoPMC.pdf Code: http://www.cs.utexas.edu/users/flame/
Lei Zhu, Xinjiang Wang, Wayne Zhang, Rynson W.H. Lau, 30 May 2024 (v3), RelayAttention for Efficient Large Language Model Serving with Long System Prompts, https://arxiv.org/abs/2402.14808 (Reduces the number of memory accesses for attention computations and the KV cache.)
Y Liang, Z Wang, X Xu, Y Tang, Z Jie, J Lu, Oct 2023, MCUFormer: Deploying Vision Tranformers on Microcontrollers with Limited Memory, arXiv preprint arXiv:2310.16898, https://arxiv.org/pdf/2310.16898.pdf
MWU Rahman, MM Abrar, HG Copening, S Hariri, Oct 2023, Quantized Transformer Language Model Implementations on Edge Devices, https://arxiv.org/pdf/2310.03971.pdf (Uses a "FlatBuffer" format on TensorFlow-Lite.)
Daniel Nichols, Siddharth Singh, Shu-Huai Lin, Abhinav Bhatele, July 2022, A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks, https://arxiv.org/abs/2111.04949 PDF: https://pssg.cs.umd.edu/assets/papers/2022-07-dl-survey-arxiv.pdf (Survey of frameworks from the theoretical perspective of parallelism.)
Christopher Wolters, Xiaoxuan Yang, Ulf Schlichtmann, Toyotaro Suzumura, 12 Jun 2024, Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference, https://arxiv.org/abs/2406.08413
J Chen, S Kao, H He, W Zhuo, S Wen, 2023, Run, Don't Walk: Chasing Higher FLOPS for Faster Neural Networks, https://openaccess.thecvf.com/content/CVPR2023/papers/Chen_Run_Dont_Walk_Chasing_Higher_FLOPS_for_Faster_Neural_Networks_CVPR_2023_paper.pdf
Minkyu Kim and Jae Sun Seo. 2021. An energy-efficient deep convolutional neural network accelerator featuring conditional computing and low external memory access. IEEE Journal of Solid-State Circuits 56, 3 (2021), 803–813, https://ieeexplore.ieee.org/document/9229157
Benjamin Charlier, Jean Feydy, Joan Alexis Glaunès, François-David Collin, Ghislain Durif, 8 Apr 2021 (v2), Kernel Operations on the GPU, with Autodiff, without Memory Overflows, https://arxiv.org/abs/2004.11127 Code: https://www.kernel-operations.io/keops/index.html
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, Oct 2023, Efficient Memory Management for Large Language Model Serving with PagedAttention, SOSP ’23, October 23–26, 2023, Koblenz, Germany, https://dl.acm.org/doi/pdf/10.1145/3600006.3613165 (The original Paged Attention and vLLM paper, focusing on optimizing memory size of the KV cache using methods similar to operating-system memory paging.)
Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar, 7 May 2024, vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention, https://arxiv.org/abs/2405.04437
Nikhil Jha, Kevin Wang, 2023, https://people.eecs.berkeley.edu/~kubitron/courses/cs262a-F23/projects/reports/project1010_paper_64287652274076362722.pdf (Extends Paged Attention to a global multi-query KV cache and also implements prefix KV caching.)
Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji, 13 Jun 2024, MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding, https://arxiv.org/abs/2406.09297 Code: https://github.com/zaydzuhri/pythia-mlkv (Extends cross-head KV sharing in MQA to also share KV data between layers, analogous to layer fusion of weights.)
David Spuler, March 2024, Chapter 14. Memory Optimizations, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
B Wu, Y Zhong, Z Zhang, G Huang, X Liu, 2023, Fast Distributed Inference Serving for Large Language Models, https://arxiv.org/abs/2305.05920
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. https://openai.com/blog/sparse-transformers, 2019, https://arxiv.org/abs/1904.10509
Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herve Jegou. Large memory layers with product keys. CoRR, abs/1907.05242, 2019. http://arxiv.org/abs/1907.05242
Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. The goldilocks principle: Reading children’s books with explicit memory representations. CoRR, abs/1511.02301, 2015. URL http://arxiv.org/abs/1511.02301.
Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. CoRR, abs/1410.3916, 2014. http://arxiv.org/abs/1410.3916
Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy P. Lillicrap. Oneshot learning with memory-augmented neural networks. CoRR, abs/1605.06065, 2016. URL http://arxiv.org/abs/1605.06065.
Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698
Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Mérouane Debbah, Deepak Gupta, 24 Apr 2024 (v2), Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward, https://arxiv.org/abs/2402.01799 Code: https://github.com/nyunAI/Faster-LLM-Survey
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng ZHANG, Dahua Lin, Ion Stoica, Hao Zhang, 02 May 2024, MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving, ICML 2024, https://openreview.net/forum?id=R0SoZvqXyQ PDF: https://openreview.net/pdf?id=R0SoZvqXyQ Code: https://github.com/hao-ai-lab/MuxServe (Separates the prefill and decoding phases when serving, and also manages the LLM weights and KV cache data in blocks for memory efficiency.)
Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
By Ben Dickson, December 27, 2023, Apple research paper hints at LLMs on iPhones and Macs, https://bdtechtalks.com/2023/12/27/apple-llm-flash-research/
Chen, C, 2024, Hardware‑software co‑exploration and optimization for next‑generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 (Extensive coverage of hardware design with multiple contributions to accelerating various neural network types, ranging from acceleration of various single non-linear functions and end-to-end optimization algorithms. Specific topics include data compression, non-maximum suppression, MHA, and MatMul/GEMM optimizations.)
Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang, 13 Jun 2024, ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models, https://arxiv.org/abs/2406.09041 (How to load multiple experts for MoE in a memory-efficient way using mixed-precision quantization based on identifying the few salient channels that need higher precision, as an alternative to multi-LoRA.)
Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
Dan Peng, Zhihui Fu, Jun Wang, 1 Jul 2024, PocketLLM: Enabling On-Device Fine-Tuning for Personalized LLMs, https://arxiv.org/abs/2407.01031 (Running fine-tuning on a smartphone via a low-memory optimization using a "derivative-free" "zeroth-order" technique called MeZo, with advantages such as privacy.)
Taolin Zhang, Jiawang Bai, Zhihe Lu, Dongze Lian, Genping Wang, Xinchao Wang, Shu-Tao Xia, 9 Jul 2024, Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach, https://arxiv.org/abs/2407.06964
Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 16 Jul 2024, MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models, https://arxiv.org/abs/2407.11681
18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
Felippe Vieira Zacarias, Kiran Palli, Sudharshan Vazhkudai, Evelyn Grevelink, July 2024, Analyzing LLM performance: The impact of high-bandwidth memory on model inference, https://www.micron.com/content/dam/micron/global/public/documents/products/product-flyer/llm-inference-engineering-report.pdf
Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
Gavin Li, August 3rd, 2024, Crazy Challenge: Run Llama 405B on a 8GB VRAM GPU, https://ai.gopubby.com/crazy-challenge-run-llama-405b-on-a-8gb-vram-gpu-ab5a280a3889 (Run Llama's 405B model on a low-end GPU via 4-bit quantization and layer-by-layer inference, both to save memory.)
Beom Jin Kang, Hae In Lee, Seok Kyu Yoon, Young Chan Kim, Sang Beom Jeong, Seong Jun O, Hyun Kim, October 2024, A survey of FPGA and ASIC designs for transformer inference acceleration and optimization, Journal of Systems Architecture, Volume 155, 103247, https://www.sciencedirect.com/science/article/abs/pii/S138376212400184X
Jaewook Lee, Yoel Park, Seulki Lee, 7 Aug 2024, Designing Extremely Memory-Efficient CNNs for On-device Vision Tasks, https://arxiv.org/abs/2408.03663
B. Kim et al., 2024, The Breakthrough Memory Solutions for Improved Performance on LLM Inference, IEEE Micro, vol. 44, no. 3, pp. 40-48, May-June 2024, doi: 10.1109/MM.2024.3375352, https://ieeexplore.ieee.org/abstract/document/10477465
Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, Joseph E. Gonzalez, 12 Feb 2024 (v2), MemGPT: Towards LLMs as Operating Systems, https://arxiv.org/abs/2310.08560 https://research.memgpt.ai/
Kaizhao Liang, Bo Liu, Lizhang Chen, Qiang Liu, 23 Aug 2024, Memory-Efficient LLM Training with Online Subspace Descent, https://arxiv.org/abs/2408.12857 https://github.com/kyleliang919/Online-Subspace-Descent
Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu, 8 May 2024 (v2), Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security, https://arxiv.org/abs/2401.05459 https://github.com/MobileLLM/Personal_LLM_Agents_Survey
Agarwal, Saurabh, Aug 2024, Minimizing Data Movement in Machine Learning Systems, Ph.D. Thesis, Computer Sciences, University of Wisconsin--Madison, https://digital.library.wisc.edu/1711.dl/MKLIYRPB24A5R9D https://search.library.wisc.edu/digital/AMKLIYRPB24A5R9D PDF: https://asset.library.wisc.edu/1711.dl/QXSTVAIXECHQA8L/R/file-62b54.pdf?dl https://www.proquest.com/openview/c1ae2a92106d7ec681a7296cd163e0c1/1 (Dataflow optimization in training and also "clustered head attention" for memory-efficient inference, an extension of multi-head attention similar to layer-wise head fusion/pruning.)
Xueyuan Han, Zinuo Cai, Yichu Zhang, Chongxin Fan, Junhan Liu, Ruhui Ma, Rajkumar Buyya, 9 Sep 2024 (v2), Hermes: Memory-Efficient Pipeline Inference for Large Models on Edge Devices, https://arxiv.org/abs/2409.04249 (Pipelining of model layer-wise loading and inference for memory-efficient inference.)
James Wang, August 27, 2024, Introducing Cerebras Inference: AI at Instant Speed, https://cerebras.ai/blog/introducing-cerebras-inference-ai-at-instant-speed
Muhammad Saad Uddin, Sep 2024, Stop Guessing! Here’s How Much GPU Memory You REALLY Need for LLMs! Techniques to Calculate and Reduce Memory Footprint in LLM Serving, https://ai.gopubby.com/stop-guessing-heres-how-much-gpu-memory-you-really-need-for-llms-8e9b02bcdb62
Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang, 26 Sep 2024, Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores, https://arxiv.org/abs/2409.17870
Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
Jie Peng, Zhang Cao, Huaizhi Qu, Zhengyu Zhang, Chang Guo, Yanyong Zhang, Zhichao Cao, Tianlong Chen, 23 Oct 2024 (v2), Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching, https://arxiv.org/abs/2410.14740
Justine, Apr 2023, Edge AI Just Got Faster, https://justine.lol/mmap/ (Loading models using mmap.)
Z. Zhang, D. Yang, X. Zhou and D. Cheng, "MCFuser: High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators," in 2024 SC24: International Conference for High Performance Computing, Networking, Storage and Analysis SC, Atlanta, GA, United States, 2024, pp. 528-542, doi: 10.1109/SC41406.2024.00040. https://www.computer.org/csdl/proceedings-article/sc/2024/529100a528/21HUVuG3S8M
Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, Ion Stoica, 14 Nov 2024, Pie: Pooling CPU Memory for LLM Inference, https://arxiv.org/abs/2411.09317
Jinjie Liu, Hang Qiu, 14 Nov 2024, FluidML: Fast and Memory Efficient Inference Optimization, https://arxiv.org/abs/2411.09242
Erik Wijmans, Brody Huval, Alexander Hertzberg, Vladlen Koltun, Philipp Krähenbühl, 13 Nov 2024, Cut Your Losses in Large-Vocabulary Language Models, https://arxiv.org/abs/2411.09009 https://github.com/apple/ml-cross-entropy (Memory-efficient computation of cross-entropy in training.)
Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E. Gonzalez, Matei Zaharia, Ion Stoica, 18 Nov 2024, MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs, https://arxiv.org/abs/2411.11217
Conner Takehana, Aaryan Singhal, Nov 28, 2024, ThunderMittens For Your ThunderKittens, https://hazyresearch.stanford.edu/blog/2024-11-28-tk-mlx (Porting TK to Apple Metal and MLX on the M2 chips.)
Chenghao Hu and Baochun Li. 2024. Menos: Split Fine-Tuning Large Language Models with Efficient GPU Memory Sharing. In Proceedings of the 25th International Middleware Conference (MIDDLEWARE '24). Association for Computing Machinery, New York, NY, USA, 185–198. https://doi.org/10.1145/3652892.3700758 https://dlnext.acm.org/doi/10.1145/3652892.3700758 https://iqua.ece.toronto.edu/papers/chenghao-middleware24.pdf
M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
Yanyu Chen, Ganhong Huang, 6 Dec 2024, GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments, https://arxiv.org/abs/2412.04788
Prem Sagar Gali and Ben Zaitlen, Dec 05, 2024, Unified Virtual Memory Supercharges pandas with RAPIDS cuDF, https://developer.nvidia.com/blog/unified-virtual-memory-supercharges-pandas-with-rapids-cudf/
Andronicus Rajasukumar, Tianchi Zhang, Ruiqi Xu, and Andrew A. Chien. 2024. UpDown: A Novel Architecture for Unlimited Memory Parallelism. In Proceedings of the International Symposium on Memory Systems (MEMSYS '24). Association for Computing Machinery, New York, NY, USA, 61–77. https://doi.org/10.1145/3695794.3695801 https://dl.acm.org/doi/full/10.1145/3695794.3695801 https://dl.acm.org/doi/pdf/10.1145/3695794.3695801
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Rongxin Cheng, Yifan Peng, Yuxin Lai, Xingda Wei, Rong Chen, Haibo Chen, 24 Dec 2024, KunServe: Elastic and Efficient Large Language Model Serving with Parameter-centric Memory Management, https://arxiv.org/abs/2412.18169
Daniel S. Berger, Yuhong Zhong, Pantea Zardoshti, Shuwei Teng, Fiodar Kazhamiaka, Rodrigo Fonseca, 15 Jan 2025, Octopus: Scalable Low-Cost CXL Memory Pooling, https://arxiv.org/abs/2501.09020
Weichen Fan, Chenyang Si, Junhao Song, Zhenyu Yang, Yinan He, Long Zhuo, Ziqi Huang, Ziyue Dong, Jingwen He, Dongwei Pan, Yi Wang, Yuming Jiang, Yaohui Wang, Peng Gao, Xinyuan Chen, Hengjie Li, Dahua Lin, Yu Qiao, Ziwei Liu, 14 Jan 2025, Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models, https://arxiv.org/abs/2501.08453 (Efficient training of text-to-video models.)
https://theses.hal.science/tel-04890912/file/ZHAO_XUNYI_2024.pdf Xunyi Zhao. Optimizing Memory Usage when Training Deep Neural Networks. Computer Science [cs]. Université de Bordeaux, France, 2024. English. NNT: 2024BORD0411 . tel-04890912
W. Choi, J. Jeong, H. Jang and J. Ahn, "GPU-centric Memory Tiering for LLM Serving with NVIDIA Grace Hopper Superchip," in IEEE Computer Architecture Letters, doi: 10.1109/LCA.2025.3533588. https://ieeexplore.ieee.org/abstract/document/10852027
Y Wang, B Li, MTI Ziad, L Eeckhout, J Yang, A Jaleel, Jan 2025, OASIS: Object-Aware Page Management for Multi-GPU Systems https://users.elis.ugent.be/~leeckhou/papers/HPCA2025-OASIS.pdf
Yupeng Tang, Dec 2024, Optimizing Memory Management for Disaggregated Architectures, Ph.D. Thesis, Yale University, https://www.proquest.com/openview/85f4eca3d62c09c20e3afc4ad1b98328
Hongsun Jang, Siung Noh, Changmin Shin, Jaewon Jung, Jaeyong Song, Jinho Lee, 14 Feb 2025, INF^2: High-Throughput Generative Inference of Large Language Models using Near-Storage Processing, https://arxiv.org/abs/2502.09921
DeepSeek, Feb 2025, Fire-Flyer File System: A high-performance distributed file system designed to address the challenges of AI training and inference workloads, https://github.com/deepseek-ai/3FS ("The Fire-Flyer File System (3FS) is a high-performance distributed file system designed to address the challenges of AI training and inference workloads. ")
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Z. Ding and T. Yang, "DynamicAttention: Dynamic KV Cache for Disaggregate LLM Inference," ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025, pp. 1-5, doi: 10.1109/ICASSP49660.2025.10890367. https://ieeexplore.ieee.org/abstract/document/10890367 (Differing management of GPU memory in prefill and decoding phases.)
Chen Zhang, Kuntai Du, Shu Liu, Woosuk Kwon, Xiangxi Mo, Yufeng Wang, Xiaoxuan Liu, Kaichao You, Zhuohan Li, Mingsheng Long, Jidong Zhai, Joseph Gonzalez, Ion Stoica, 24 Mar 2025, Jenga: Effective Memory Management for Serving LLM with Heterogeneity, https://arxiv.org/abs/2503.18292
Feng Cheng, Cong Guo, Chiyue Wei, Junyao Zhang, Changchun Zhou, Edward Hanson, Jiaqi Zhang, Xiaoxiao Liu, Hai "Helen" Li, Yiran Chen, 11 May 2025, Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-aware Cache Compression, https://arxiv.org/abs/2505.06901

Aussie AI

LLM Memory Optimization

Memory-Bound vs Compute-Bound

Model Compression Techniques

Recomputation

Data Locality

Prefetching

SSD Storage

Compute-in-Memory (CIM)

Memory-Bound versus CPU-Bound

Research on Memory Optimization

More AI Research

Quick Links

Product

New to Writing?

Writing Styles