Aussie AI

Serving and Deployment

Last Updated 24 June, 2025

by David Spuler, Ph.D.

Serving

Serving is the practical matter of how to architecture the full production application around the LLM. Other components may include a web server, application server, RAG datastore, retriever, load balancer, and more. Furthermore, there are some techniques that affect the speed of inference:

Batching
Prefill versus decoding phase
Scheduling
Load balancing
Frameworks (backend)

Research on LLM Serving

Recently, there has been an explosion of papers about the practical aspects of deployment, orchestration, and serving of LLM inference. Here's some of the papers:

Zao Zhang, 23 May 2024, Design Efficient Deep Neural Networks with System Optimization, Ph.D. Thesis, School of Electrical and Information Engineering, Faculty of Engineering, The University of Sydney, Australia, PDF: https://ses.library.usyd.edu.au/bitstream/handle/2123/32642/zhang_z_thesis.pdf?sequence=1&isAllowed=y https://ses.library.usyd.edu.au/handle/2123/32642 https://hdl.handle.net/2123/32642
Sohaib Ahmad, Hui Guan, Ramesh K. Sitaraman, 2024, Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy Scaling, https://guanh01.github.io/files/2024hpdc-loki.pdf
Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698 Code: https://github.com/abhibambhaniya/GenZ-LLM-Analyzer (Analysis of cost of serving LLMs, including separate profiles of prefill versus decoding phases, and the cost of extra prompt processing in RAG architectures with prepended information.)
Jeon, Byungsoo, May 2024, Automated and Portable Machine Learning Systems, Ph.D. Thesis, Carnegie Mellon University, https://doi.org/10.1184/R1/25746708.v1 https://kilthub.cmu.edu/articles/thesis/Automated_and_Portable_Machine_Learning_Systems/25746708/1 PDF: https://kilthub.cmu.edu/ndownloader/files/46074087 Code: https://github.com/cmu-catalyst/collage (Portability layer to integrate the various kernels and low-level backends more easily. Also covers pipeline parallelism in graph models, and KV cache parallelism similar to FlashDecode.)
Pai Zeng, Zhenyu Ning, Jieru Zhao, Weihao Cui, Mengwei Xu, Liwei Guo, Xusheng Chen, Yizhou Shan, 18 May 2024, The CAP Principle for LLM Serving, https://arxiv.org/abs/2405.11299
Vikranth Srivatsa∗, Zijian He∗, Reyna Abhyankar, Dongming Li, Yiying Zhang, 2024, Preble: Efficient Distributed Prompt Scheduling for LLM Serving, University of California, San Diego, https://escholarship.org/content/qt1bm0k1w0/qt1bm0k1w0.pdf (Evalulates prompt sharing including full inference cache or a partial prefix-based computation of a global KV cache for the prefill phase. Also schedules GPUs based on prefill versus decoding phase requirements.)
Paula Rooney, 14 May 2024, Private cloud makes its comeback, thanks to AI, CIO, https://www.cio.com/article/2104613/private-cloud-makes-its-comeback-thanks-to-ai.html
Chengyi Nie, Rodrigo Fonseca, Zhenhua Liu, 11 May 2024, Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving, https://arxiv.org/abs/2405.06856
Lequn Chen, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/bitstream/handle/1773/51337/Chen_washington_0250E_26603.pdf?sequence=1&isAllowed=y
Xue Geng, Zhe Wang, Chunyun Chen, Qing Xu, Kaixin Xu, Chao Jin, Manas Gupta, Xulei Yang, Zhenghua Chen, Mohamed M. Sabry Aly, Jie Lin, Min Wu, Xiaoli Li, 9 May 2024, From Algorithm to Hardware: A Survey on Efficient and Safe Deployment of Deep Neural Networks, https://arxiv.org/abs/2405.06038
Vinod Vijay Nigade, Latency-Critical Inference Serving for Deep Learning, Ph.D. Thesis, VRIJE UNIVERSITEIT, Netherlands, https://research.vu.nl/ws/portalfiles/portal/258499994/phdthesis-vinodvufinal+4+-+65043c3f62dc9.pdf
Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, Binhang Yuan, 2024, HEXGEN: Generative Inference of Large Language Model over Heterogeneous Environment. https://openreview.net/pdf?id=9ANyvRtFGa Code: https://github.com/Relaxed-System-Lab/HexGen
Shashank Verma and Neal Vaidya, Mastering LLM Techniques: Inference Optimization, Nov 17, 2023, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
Grant Wilkins, 3 June 202, Online Workload Allocation and Energy Optimization in Large Language Model Inference Systems, Master of Philosophy in Advanced Computer Science, Churchill College, University of Cambridge, https://grantwilkins.github.io/gfw27_project.pdf
David Spuler, March 2024, Chapter 7. Deployment Architecture, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Ke Cheng, Wen Hu, Zhi Wang, Peng Du, Jianguo Li, Sheng Zhang, 7 Jun 2024, Enabling Efficient Batch Serving for LMaaS via Generation Length Prediction, https://arxiv.org/abs/2406.04785
Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, Wei Lin, 5 Jun 2024, Llumnix: Dynamic Scheduling for Large Language Model Serving, https://arxiv.org/abs/2406.03243 Code: https://github.com/AlibabaPAI/llumnix
Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, Lili Qiu, 30 May 2024, Parrot: Efficient Serving of LLM-based Applications with Semantic Variable, https://arxiv.org/abs/2405.19888 (Uses prefix KV caching and a combined flash attention and paged attention module.)
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng ZHANG, Dahua Lin, Ion Stoica, Hao Zhang, 02 May 2024, MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving, ICML 2024, https://openreview.net/forum?id=R0SoZvqXyQ PDF: https://openreview.net/pdf?id=R0SoZvqXyQ Code: https://github.com/hao-ai-lab/MuxServe (Separates the prefill and decoding phases when serving, and also manages the LLM weights and KV cache data in blocks for memory efficiency.)
Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Shengkun Cui, Chandra Narayanaswami, Zbigniew Kalbarczyk, Ravishankar Iyer, 2024, One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving, https://haoran-qiu.com/pdf/qlm-preprint.pdf
Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
Schwinn Saereesitthipitak, Ashish Rao, Cathy Zhou, William Li, 2024, Prophet: An LLM Inference Engine Optimized For Head-of-Line Blocking, https://www.scs.stanford.edu/24sp-cs244b/projects/Prophet_An_LLM_Inference_Engine_Optimized_For_Head_of_Line_Blocking.pdf (Faster inference serving via iterative scheduling, separating prefill and decoding phase computations for batching, using priority-based schedulers with preemption, and controling transfer of KV caches from prefill to decoders.)
Ke Cheng, Wen Hu, Zhi Wang, Hongen Peng, Jianguo Li, Sheng Zhang, 19 Jun 2024, Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving, https://arxiv.org/abs/2406.13511 (Improved batched scheduling by splitting queries into fixed-size token generation slices.)
Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 26 Jun 2024 (v2), MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool, https://arxiv.org/abs/2406.17565 (Combined session-based prefix KV caching with disaggregation of prefill and decoding phases.)
Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu, 2 Jul 2024 (v2), Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving, https://arxiv.org/abs/2407.00079 Code: https://github.com/kvcache-ai/Mooncake (Disaggregates prefill and decoding phases for scheduling, with chunked prefill, while managing the KV cache.)
Isaac Ong, May 16, 2024, Efficient Distributed LLM Inference with Dynamic Partitioning, Masters Thesis, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-108, http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.pdf
Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
Yu, Lingfan, 2024, Improve Language Model Serving Efficiency With Fine-Grained and Stateful Scheduling, Ph.D. Thesis, Department of Computer Science, New York University, ProQuest Dissertations & Theses, 31139782, https://www.proquest.com/openview/7200cdfc0906f1d4edb8008b4368bcf9 PDF: https://cs.nyu.edu/media/publications/lingfan_yu_phd_thesis.pdf (Examines efficiency of batching methods and how to create a "stateful" version with cached multi-turn conversation history using session-based KV caching.)
Xin Tan, Jingzong Li, Jiamin Li, Yitao Yang, Hong Xu, August 2024, Arlo: Serving Transformer-based Language Models with Dynamic, Input Lengths, ICPP ’24, August 12–15, 2024, Gotland, Sweden, https://doi.org/10.1145/3673038.3673124 https://kanonjz.github.io/academic/share/xin-icpp24.pdf
Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
Chen, Lequn, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/items/13e14599-b4ee-4fbb-86bf-e58a4118d0f9
Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, Esha Choukse, 1 Aug 2024, DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency, https://arxiv.org/abs/2408.00741
Ke Cheng, Zhi Wang, Wen Hu, Tiannuo Yang, Jianguo Li, Sheng Zhang, 8 Aug 2024, Towards SLO-Optimized LLM Serving via Automatic Inference Engine Tuning, https://arxiv.org/abs/2408.04323
Andreas Kosmas Kakolyris, Dimosthenis Masouros, Petros Vavaroutsos, Sotirios Xydis, Dimitrios Soudris, 5 Aug 2024, SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving, https://arxiv.org/abs/2408.05235
Jaehong Cho, Minsu Kim, Hyunmin Choi, Guseul Heo, Jongse Park, 10 Aug 2024, LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale, https://arxiv.org/abs/2408.05499
Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo, Zhengyong Zhang, 15 Aug 2024, P/D-Serve: Serving Disaggregated Large Language Model at Scale, https://arxiv.org/abs/2408.08147 (Comprehensive serving system addressing disaggregated prefill and KV cache transfer with RDMA.)
Ryan Lucchese, Niki Birkner, Yaron Hagai, Virginia Adams, August 13, 2024, Together AI, A practitioner's guide to testing and running large GPU clusters for training generative AI models, https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models
Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, Hao Zhang, July 2024, MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving, Proceedings of the 41st July 2024, International Conference on Machine Learning, PMLR 235:11905-11917, 2024, https://proceedings.mlr.press/v235/duan24a.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/duan24a/duan24a.pdf Code: https://github.com/hao-ai-lab/MuxServe.
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
Guangxuan Xiao, May 2024, Efficient Deployment Algorithms for Large Language Models, Masters Thesis, MIT, https://dspace.mit.edu/bitstream/handle/1721.1/156332/xiao-xgx-sm-eecs-2024-thesis.pdf
The SGLang Team, Jul 25, 2024, Achieving Faster Open-Source Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM), https://lmsys.org/blog/2024-07-25-sglang-llama3/
Ahmed Tremo, Aug 6, 2024, How to Efficiently Serve an LLM? https://ahmedtremo.com/posts/How-to-Efficiently-serve-an-llm/
Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci, 22 Aug 2024, NanoFlow: Towards Optimal Large Language Model Serving Throughput, https://arxiv.org/abs/2408.12757
Yao Lu, Song Bian, Lequn Chen, Yongjun He, Yulong Hui, Matthew Lentz, Beibin Li, Fei Liu, Jialin Li, Qi Liu, Rui Liu, Xiaoxuan Liu, Lin Ma, Kexin Rong, Jianguo Wang, Yingjun Wu, Yongji Wu, Huanchen Zhang, Minjia Zhang, Qizhen Zhang, Tianyi Zhou, Danyang Zhuo, 17 Jan 2024, Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native, https://arxiv.org/abs/2401.12230
Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica, Hao Zhang, 28 Aug 2024, Efficient LLM Scheduling by Learning to Rank, https://arxiv.org/abs/2408.15792 https://github.com/hao-ai-lab/vllm-ltr.git
Lightning AI, 2024, Serve LLMs, https://lightning.ai/docs/litserve/features/serve-llms
Y. Peng, W. Gao and H. Peng, "Serving DNN Inference With Fine-Grained Spatio-Temporal Sharing of GPU Servers," in IEEE Transactions on Services Computing, doi: 10.1109/TSC.2024.3463429. https://ieeexplore.ieee.org/document/10684028 https://www.computer.org/csdl/journal/sc/5555/01/10684028/20lm4PEVn9u
Aparna Dhinakaran, Sep 2024, Choosing Between LLM Agent Frameworks. The tradeoffs between building bespoke code-based agents and the major agent frameworks. https://towardsdatascience.com/choosing-between-llm-agent-frameworks-69019493b259
Yihua Cheng, Kuntai Du, Jiayi Yao, Junchen Jiang, 16 Sep 2024, Do Large Language Models Need a Content Delivery Network? https://arxiv.org/abs/2409.13761 https://github.com/LMCache/LMCache (Managing the process of sharing KV cache data over a network.)
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Yiyuan He, Minxian Xu, Jingfeng Wu, Wanyi Zheng, Kejiang Ye, Chengzhong Xu, 24 Sep 2024 (v2), UELLM: A Unified and Efficient Approach for LLM Inference Serving, https://arxiv.org/abs/2409.14961
Amey Agrawal, Junda Chen, Íñigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, Esha Choukse, 25 Sep 2024, Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations, https://arxiv.org/abs/2409.17264
Yifan Qiao, Shu Anzai, Shan Yu, Haoran Ma, Yang Wang, Miryung Kim, Harry Xu, 2 Oct 2024, ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving, https://arxiv.org/abs/2410.01228
Linke Song, Zixuan Pang, Wenhao Wang, Zihao Wang, XiaoFeng Wang, Hongbo Chen, Wei Song, Yier Jin, Dan Meng, Rui Hou, 30 Sep 2024, The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems, https://arxiv.org/abs/2409.20002
Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
Noah Martin, Abdullah Bin Faisal, Hiba Eltigani, Rukhshan Haroon, Swaminathan Lamelas, Fahad Dogar, 4 Oct 2024, LLMProxy: Reducing Cost to Access Large Language Models, https://arxiv.org/abs/2410.11857 (Deploying a proxy between user and LLM, with handling of conversational history context and caching.)
OpenVINO-toolkit, Oct 1, 2024, Introducing OpenVINO™ 2024.4, https://medium.com/openvino-toolkit/introducing-openvino-2024-4-28578870b264
Baolin Li, April 2024, Making Machine Learning on HPC Systems Cost-Effective and Carbon-Friendly, Ph.D. Thesis, The Department of Electrical and Computer Engineering, Computer Engineering, Northeastern University, Boston, Massachusetts, https://repository.library.northeastern.edu/files/neu:4f248m902/fulltext.pdf
Can Wang, Bolin Zhang, Dianbo Sui, Zhiying Tu, Xiaoyu Liu, Jiabao Kang, 1 Mar 2024 (v2), A Survey on Effective Invocation Methods of Massive LLM Services, https://arxiv.org/abs/2402.03408
Siddharth Jha, Coleman Hooper, Xiaoxuan Liu, Sehoon Kim, Kurt Keutzer, 15 Jul 2024 (v2), Learned Best-Effort LLM Serving, https://arxiv.org/abs/2401.07886
Yuhang Yao, Han Jin, Alay Dilipbhai Shah, Shanshan Han, Zijian Hu, Yide Ran, Dimitris Stripelis, Zhaozhuo Xu, Salman Avestimehr, Chaoyang He, 10 Sep 2024 (v2), ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency, https://arxiv.org/abs/2408.00008
Grant Wilkins, Srinivasan Keshav, Richard Mortier, 4 Jul 2024, Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems, https://arxiv.org/abs/2407.04014
Mastering LLM, Aug 17, 2024, How Much GPU Memory is Needed to Serve a Large Language Model (LLM)? https://masteringllm.medium.com/how-much-gpu-memory-is-needed-to-serve-a-large-languagemodel-llm-b1899bb2ab5d
Youpeng Zhao, Jun Wang, 31 Oct 2024, ALISE: Accelerating Large Language Model Serving with Speculative Scheduling, https://arxiv.org/abs/2410.23537
Yan Zhuang, Zhenzhe Zheng, Fan Wu, and Guihai Chen. 2024. LiteMoE: Customizing On-device LLM Serving via Proxy Submodel Tuning. In Proceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems (SenSys '24). Association for Computing Machinery, New York, NY, USA, 521–534. https://doi.org/10.1145/3666025.3699355 https://dl.acm.org/doi/abs/10.1145/3666025.3699355
Ziming Mao, Tian Xia, Zhanghao Wu, Wei-Lin Chiang, Tyler Griggs, Romil Bhardwaj, Zongheng Yang, Scott Shenker, Ion Stoica, 3 Nov 2024, SkyServe: Serving AI Models across Regions and Clouds with Spot Instances, https://arxiv.org/abs/2411.01438
R Mendoza, I Cruz, P Singh, A Martinez, N Kim, S Patel, Nov 2024, Dynamic Resource Management for Efficient Fast Device Placement https://www.researchgate.net/profile/Priya-Singh-103/publication/385528236_Dynamic_Resource_Management_for_Efficient_Fast_Device_Placement/links/672983c3ecbbde716b584acc/Dynamic-Resource-Management-for-Efficient-Fast-Device-Placement.pdf
H Zhang, Z Chen, XLY Liu, J Wu, L Wang, Nov 2024, Dynamic Fast Device Placement Strategies for Real-Time Resource Allocation, https://www.researchgate.net/profile/Haoran-Zhang-111/publication/385589353_Dynamic_Fast_Device_Placement_Strategies_for_Real-Time_Resource_Allocation/links/672b9ca977f274616d60a5e6/Dynamic-Fast-Device-Placement-Strategies-for-Real-Time-Resource-Allocation.pdf
OpenVINO™ toolkit, Sep 26, 2024, How To Efficiently Serve Today’s Large Language Models, https://medium.com/openvino-toolkit/how-to-efficiently-serve-todays-large-language-models-b3f1e8d33fdf
Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chandra Narayanaswami, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2024. Queue Management for SLO-Oriented Large Language Model Serving. In Proceedings of the 2024 ACM Symposium on Cloud Computing (SoCC '24). Association for Computing Machinery, New York, NY, USA, 18–35. https://doi.org/10.1145/3698038.3698523 https://dl.acm.org/doi/abs/10.1145/3698038.3698523
Haiying Shen, Tanmoy Sen, 10 Nov 2024, EcoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving, https://arxiv.org/abs/2411.06364
Kyoungmin Kim, Kijae Hong, Caglar Gulcehre, Anastasia Ailamaki, 12 Nov 2024, The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving, https://arxiv.org/abs/2411.07447
Redwan Ibne Seraj Khan, Kunal Jain, Haiying Shen, Ankur Mallick, Anjaly Parayil, Anoop Kulkarni, Steve Kofsky, Pankhuri Choudhary, Renèe St. Amant, Rujia Wang, Yue Cheng, Ali R. Butt, Victor Rühle, Chetan Bansal, Saravan Rajmohan, 24 Nov 2024, Ensuring Fair LLM Serving Amid Diverse Applications, https://arxiv.org/abs/2411.15997
Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, Ion Stoica, 25 Nov 2024, BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching, https://arxiv.org/abs/2411.16102
Ao Shen, Zhiyao Li, Mingyu Gao, 27 Nov 2024, FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving, https://arxiv.org/abs/2411.18424
Yanyu Chen, Ganhong Huang, 6 Dec 2024, GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments, https://arxiv.org/abs/2412.04788
He, Y., Xu, M., Wu, J., Zheng, W., Ye, K., Xu, C. (2025). UELLM: A Unified and Efficient Approach for Large Language Model Inference Serving. In: Gaaloul, W., Sheng, M., Yu, Q., Yangui, S. (eds) Service-Oriented Computing. ICSOC 2024. Lecture Notes in Computer Science, vol 15404. Springer, Singapore. https://doi.org/10.1007/978-981-96-0805-8_16 https://link.springer.com/chapter/10.1007/978-981-96-0805-8_16
Hongyi Jin, Ruihang Lai, Charlie F. Ruan, Yingcheng Wang, Todd C. Mowry, Xupeng Miao, Zhihao Jia, Tianqi Chen, 17 Dec 2024, A System for Microserving of LLMs, https://arxiv.org/abs/2412.12488 (Disaggregated prefill and decoding combined with context cache migration for sending the KV cache over the network.)
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Mingcong Song, Xinru Tang, Fengfan Hou, Jing Li, Wei Wei, Yipeng Ma, Runqiu Xiao, Hongjie Si, Dingcheng Jiang, Shouyi Yin, Yang Hu, Guoping Long, 24 Dec 2024, Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels, https://arxiv.org/abs/2412.18106
Y Xiao, Dec 2024, Optimizing the Serving System for Large Language Model Inference, https://charlie-xiao.github.io/assets/pdf/projects/fluidinfer.pdf (Concatenated or splits batches for higher throughput.)
Ahmet Caner Yüzügüler, Jiawei Zhuang, Lukas Cavigelli, 14 Jan 2025, PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving, https://arxiv.org/abs/2501.08192
Desen Sun, Zepeng Zhao, Yuke Wang, 16 Jan 2025, PATCHEDSERVE: A Patch Management Framework for SLO-Optimized Hybrid Resolution Diffusion Serving, https://arxiv.org/abs/2501.09253
Can Wang, Dianbo Sui, Bolin Zhang, Xiaoyu Liu, Jiabao Kang, Zhidong Qiao, Zhiying Tu, Jan 2025, A Framework for Effective Invocation Methods of Various LLM Services, Proceedings of the 31st International Conference on Computational Linguistics, pages 6953–6965, January 19–24, 2025, Association for Computational Linguistics, https://aclanthology.org/2025.coling-main.464.pdf
Dimitrios Liakopoulos, Tianrui Hu, Prasoon Sinha, Neeraja J. Yadwadkar, 8 Jan 2025, iServe: An Intent-based Serving System for LLMs, https://arxiv.org/abs/2501.13111 (Flexible LLM serving based on prioritizing latency versus cost.)
Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Baoquan Zhang, Shining Wan, Gengyuan Dan, Zhiyu Dong, Zhihao Ren, Jie Meng, Chao He, Changhong Liu, Tao Xie, Dayun Lin, Qin Zhang, Yue Yu, Hao Feng, Xusheng Chen, Yizhou Shan, 27 Jan 2025 (v2), DeepFlow: Serverless Large Language Model Serving at Scale, https://arxiv.org/abs/2501.14417
Ting Sun, Penghan Wang, Fan Lai, 15 Jan 2025, HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location, https://arxiv.org/abs/2501.14808
Xiaozhe Yao, Qinghao Hu, Ana Klimovic, 1 Nov 2024 (v2), DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs, https://arxiv.org/abs/2312.05215 (Serve multiple fine-tuned models with full parameters by using deltas/diffs, rather than PEFT or multi-LoRA.)
Haoran Qiu, Anish Biswas, Zihan Zhao, Jayashree Mohan, Alind Khare, Esha Choukse, Íñigo Goiri, Zeyu Zhang, Haiying Shen, Chetan Bansal, Ramachandran Ramjee, Rodrigo Fonseca, 2 Feb 2025, Towards Efficient Large Multimodal Model Serving, https://arxiv.org/abs/2502.00937 (Disaggregating or "decoupling" the different stages of multimodal LLM inference, not only prefill and decoding, but also the multimodal-specific bottlenecks in cross-attention and image encoding.)
Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. 2025. Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (ASPLOS '25). Association for Computing Machinery, New York, NY, USA, 586–602. https://doi.org/10.1145/3669940.3707215 https://dl.acm.org/doi/abs/10.1145/3669940.3707215
Gregory Dexter, Shao Tang, Ata Fatahi Baarzi, Qingquan Song, Tejas Dharamsi, Aman Gupta, 7 Feb 2025. LLM Query Scheduling with Prefix Reuse and Latency Constraints, https://arxiv.org/abs/2502.04677
Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, Eiko Yoneki, 13 Feb 2025, ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments, https://arxiv.org/abs/2502.09334
Shashwat Jaiswal, Kunal Jain, Yogesh Simmhan, Anjaly Parayil, Ankur Mallick, Rujia Wang, Renee St. Amant, Chetan Bansal, Victor Rühle, Anoop Kulkarni, Steve Kofsky, Saravan Rajmohan, 20 Feb 2025, Serving Models, Fast and Slow:Optimizing Heterogeneous LLM Inferencing Workloads at Scale, https://arxiv.org/abs/2502.14617
Alex Fazio, Feb 2025, How to Build an LLM Chat App: The New Litmus Test for Junior Devs, https://x.com/alxfazio/status/1893242657331101976 (How to build a wrapper chat app that scales by taking care of message queueing, API rate limits, history database management, caching, and other real-world deployment issues.)
Junsoo Kim, Hunjong Lee, Geonwoo Ko, Gyubin Choi, Seri Ham, Seongmin Hong, Joo-Young Kim, 6 Mar 2025, ADOR: A Design Exploration Framework for LLM Serving with Enhanced Latency and Throughput, https://arxiv.org/abs/2503.04253
Chen Zhang, Kuntai Du, Shu Liu, Woosuk Kwon, Xiangxi Mo, Yufeng Wang, Xiaoxuan Liu, Kaichao You, Zhuohan Li, Mingsheng Long, Jidong Zhai, Joseph Gonzalez, Ion Stoica, 24 Mar 2025, Jenga: Effective Memory Management for Serving LLM with Heterogeneity, https://arxiv.org/abs/2503.18292
AK Kakolyris, D Masouros, P Vavaroutsos, S Xydis, April 2025, throttLL’eM: Predictive GPU Throttling for Energy Efficient LLM Inference Serving, https://microlab.ntua.gr/wp-content/uploads/2025/03/throttLLeM_HPCA25.pdf https://github.com/WilliamBlaskowicz/throttLL-eM
Yuxing Xiang, Xue Li, Kun Qian, Wenyuan Yu, Ennan Zhai, Xin Jin, 15 May 2025, ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production, https://arxiv.org/abs/2505.09999
Shaoyu Wang, Guangrong He, Geon-Woo Kim, Yanqi Zhou, Seo Jin Park, 13 May 2025, Toward Cost-Efficient Serving of Mixture-of-Experts with Asynchrony, https://arxiv.org/abs/2505.08944
Hang Zhang, Jiuchen Shi, Yixiao Wang, Quan Chen, Yizhou Shan, Minyi Guo, 19 Apr 2025, Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management, https://arxiv.org/abs/2505.03756
Azam Ikram, Xiang Li, Sameh Elnikety, Saurabh Bagchi, 30 Apr 2025 (v2), Ascendra: Dynamic Request Prioritization for Efficient LLM Serving, https://arxiv.org/abs/2504.20828
Wei Zhang, Zhiyu Wu, Yi Mu, Banruo Liu, Myungjin Lee, Fan Lai, 24 Apr 2025, Tempo: Application-aware LLM Serving with Mixed SLO Requirements, https://arxiv.org/abs/2504.20068
Youhe Jiang, Fangcheng Fu, Wanru Zhao, Stephan Rabanser, Nicholas D. Lane, Binhang Yuan, 4 Jun 2025, Cascadia: A Cascade Serving System for Large Language Models, https://arxiv.org/abs/2506.04203
Xiannan Hu, Tianyou Zeng, Xiaoming Yuan, Liwei Song, Guangyuan Zhang, Bangzheng He, 6 Jun 2025, BestServe: Serving Strategies with Optimal Goodput in Collocation and Disaggregation Architectures, https://arxiv.org/abs/2506.05871

Deployment

Research on LLM deployment:

Sohaib Ahmad, Hui Guan, Ramesh K. Sitaraman, 2024, Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy Scaling, https://guanh01.github.io/files/2024hpdc-loki.pdf
Jeon, Byungsoo, May 2024, Automated and Portable Machine Learning Systems, Ph.D. Thesis, Carnegie Mellon University, https://doi.org/10.1184/R1/25746708.v1 https://kilthub.cmu.edu/articles/thesis/Automated_and_Portable_Machine_Learning_Systems/25746708/1 PDF: https://kilthub.cmu.edu/ndownloader/files/46074087 Code: https://github.com/cmu-catalyst/collage (Portability layer to integrate the various kernels and low-level backends more easily. Also covers pipeline parallelism in graph models, and KV cache parallelism similar to FlashDecode.)
Vikranth Srivatsa∗, Zijian He∗, Reyna Abhyankar, Dongming Li, Yiying Zhang, 2024, Preble: Efficient Distributed Prompt Scheduling for LLM Serving, University of California, San Diego, https://escholarship.org/content/qt1bm0k1w0/qt1bm0k1w0.pdf (Evalulates prompt sharing including full inference cache or a partial prefix-based computation of a global KV cache for the prefill phase. Also schedules GPUs based on prefill versus decoding phase requirements.)
Yao Fu, 14 May 2024, Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis, https://arxiv.org/abs/2405.08944
Paula Rooney, 14 May 2024, Private cloud makes its comeback, thanks to AI, CIO, https://www.cio.com/article/2104613/private-cloud-makes-its-comeback-thanks-to-ai.html
Chengyi Nie, Rodrigo Fonseca, Zhenhua Liu, 11 May 2024, Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving, https://arxiv.org/abs/2405.06856
JH Jones, May 2024, A Quantitative Comparison of Pre-Trained Model Registries to Traditional Software Package Registries, Masters Thesis, Electrical and Computer Engineering, Purdue University, https://hammer.purdue.edu/articles/thesis/A_Quantitative_Comparison_of_Pre-Trained_Model_Registries_to_Traditional_Software_Package_Registries/25686447/1 PDF: https://hammer.purdue.edu/ndownloader/files/46096152
Jiamin Li, Le Xu, Hong Xu, Aditya Akella, 28 Apr 2024, BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models, https://arxiv.org/abs/2404.18322 (Partitioning inference over blocks for GPU.)
Lequn Chen, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/bitstream/handle/1773/51337/Chen_washington_0250E_26603.pdf?sequence=1&isAllowed=y
Cohere Toolkit, https://github.com/cohere-ai/cohere-toolkit (A set of open source components for RAG architectures.)
Ahmed Menshawy, Zeeshan Nawaz, Mahmoud Fahmy, April 2024, Navigating Challenges and Technical Debt in Large Language Models Deployment, EuroMLSys '24: Proceedings of the 4th Workshop on Machine Learning and Systems, Pages 192–199, https://doi.org/10.1145/3642970.3655840 https://dl.acm.org/doi/abs/10.1145/3642970.3655840 PDF Slides: https://www.cl.cam.ac.uk/research/srg/netos/euromlsys2024/slides/P_5_27.pdf
Jiachen Liu, Zhiyu Wu, Jae-Won Chung, Fan Lai, Myungjin Lee, Mosharaf Chowdhury, 25 Apr 2024, Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services, https://arxiv.org/abs/2404.16283 (Scheduling GPU activity for multiple queries to ensure good UI experience for text-streaming outputs like chatbots.)
Xue Geng, Zhe Wang, Chunyun Chen, Qing Xu, Kaixin Xu, Chao Jin, Manas Gupta, Xulei Yang, Zhenghua Chen, Mohamed M. Sabry Aly, Jie Lin, Min Wu, Xiaoli Li, 9 May 2024, From Algorithm to Hardware: A Survey on Efficient and Safe Deployment of Deep Neural Networks, https://arxiv.org/abs/2405.06038
Josef Pichlmeier, Philipp Ross, Andre Luckow, 22 Apr 2024, Expert Router: Orchestrating Efficient Language Model Inference through Prompt Classification, https://arxiv.org/abs/2404.15153
Konstantinos Papaioannou, Thaleia Dimitra Doudali, April 2024, The Importance of Workload Choice in Evaluating LLM Inference Systems, EuroMLSys '24: Proceedings of the 4th Workshop on Machine Learning and Systems, April 2024, Pages 39–46, https://doi.org/10.1145/3642970.3655823 https://dl.acm.org/doi/abs/10.1145/3642970.3655823
Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang, 2 Feb 2024, APIServe: Efficient API Support for Large-Language Model Inferencing, https://arxiv.org/abs/2402.01869
Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 20 Jan 2024, Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads, https://arxiv.org/abs/2401.11181 (Separating the prefill and decoding phases for optimization.)
Ajay Jaiswal, Bodun Hu, Lu Yin, Yeonju Ro, Shiwei Liu, Tianlong Chen, Aditya Akella, 5 Apr 2024, FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping, https://arxiv.org/abs/2404.03865 (Presents an input-adaptive layer skipping scheme for drops about 30% of FFN calculations. Avoids the KV caching problems by only skipping FFN computations in layers.)
Stan Gibson, 03 Jun 2024, Getting infrastructure right for generative AI, CIO, https://www.cio.com/article/2128440/getting-infrastructure-right-for-generative-ai.html
Vinod Vijay Nigade, Latency-Critical Inference Serving for Deep Learning, Ph.D. Thesis, VRIJE UNIVERSITEIT, Netherlands, https://research.vu.nl/ws/portalfiles/portal/258499994/phdthesis-vinodvufinal+4+-+65043c3f62dc9.pdf
Jaskirat Singh, Bram Adams, Ahmed E. Hassan, 25 Mar 2024, On the Impact of Black-box Deployment Strategies for Edge AI on Latency and Model Performance, https://arxiv.org/abs/2403.17154 (MLOps deployment for quantization, partitioning and early-exit across mobile, edge, and cloud platforms, including running early exit on mobile.)
LMDeploy Contributors, 2023, LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM, Apache 2.0 License, Code: https://github.com/InternLM/lmdeploy
Mengke Ge, Junpeng Wang, Binhan Chen, Yingjian Zhong, Haitao Du, Song Chen, Yi Kang, 22 Mar 2024, Allspark: Workload Orchestration for Visual Transformers on Processing In-Memory Systems, https://arxiv.org/abs/2403.15069
Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo, T Zhang, 2023, Deep Learning Workload Scheduling in GPU Datacenters: A Survey, ACM Computing Surveys, PDF: https://dl.acm.org/doi/pdf/10.1145/3638757
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
Jiahang Zhou, Yanyu Chen, Zicong Hong, Wuhui Chen, Yue Yu, Tao Zhang, Hui Wang, Chuanfu Zhang, Zibin Zheng, 5 Jan 2024, Training and Serving System of Foundation Models: A Comprehensive Survey, https://arxiv.org/abs/2401.02643
Meenu Mary John; Helena Holmström Olsson; Jan Bosch, 2020, AI Deployment Architecture: Multi-Case Study for Key Factor Identification, 2020 27th Asia-Pacific Software Engineering Conference (APSEC), https://ieeexplore.ieee.org/abstract/document/9359253
Meenu Mary John, Helena Holmström Olsson, Jan Bosch, 2020, Architecting AI Deployment: A Systematic Review of State-of-the-Art and State-of-Practice Literature, ICSOB 2020: Software Business, pp 14–29, https://link.springer.com/chapter/10.1007/978-3-030-67292-8_2
Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Stoica, Joseph Gonzalez, and Alexey Tumanov. 2020. InferLine: latency-aware provisioning and scaling for prediction serving pipelines. Proceedings of the 11th ACM Symposium on Cloud Computing. 477–491, https://arxiv.org/abs/1812.01776
Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, Binhang Yuan, 2024, HEXGEN: Generative Inference of Large Language Model over Heterogeneous Environment. https://openreview.net/pdf?id=9ANyvRtFGa Code: https://github.com/Relaxed-System-Lab/HexGen
Ali Rahmanian, Doctoral Thesis, April 2024, Edge Orchestration for Latency-Sensitive Applications, Department of Computing Science, Umea University, Sweden, https://www.diva-portal.org/smash/get/diva2:1849510/FULLTEXT02.pdf
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang, 19 Mar 2024 (v2), DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving, https://arxiv.org/abs/2401.09670 (Optimizing LLMs differently in the prefill and decoding phases.)
Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, Chuan Wu, 2 Mar 2024, LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization, https://arxiv.org/abs/2403.01136 (Deployment of LLMs on heterogenous GPUs and also differences between the two phases of decoder-only Transformers: prefill and decoding computations.)
Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, Hao Zhang, 2 Apr 2024, MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving, https://arxiv.org/abs/2404.02015
Can Wang, Bolin Zhang, Dianbo Sui, Zhiying Tu, Xiaoyu Liu, Jiabao Kang, 1 Mar 2024 (v2), A Survey on Effective Invocation Methods of Massive LLM Services, https://arxiv.org/abs/2402.03408 (Deployment of LLMs as LLM-as-a-Service or LLMaaS architectures including prompt compression, semantic caching and model selection based on scoring inputs.)
Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, Tianwei Zhang, https://arxiv.org/abs/2403.07648 12 Mar 2024, Characterization of Large Language Model Development in the Datacenter, (Analysis of deployment and LLOps issues in a 6-month production deployment.)
Apple, June 2022, Deploying Transformers on the Apple Neural Engine, https://machinelearning.apple.com/research/neural-engine-transformers Code: https://github.com/apple/ml-ane-transformers
Yao Lu, Song Bian, Lequn Chen, Yongjun He, Yulong Hui, Matthew Lentz, Beibin Li, Fei Liu, Jialin Li, Qi Liu, Rui Liu, Xiaoxuan Liu, Lin Ma, Kexin Rong, Jianguo Wang, Yingjun Wu, Yongji Wu, Huanchen Zhang, Minjia Zhang, Qizhen Zhang, Tianyi Zhou, Danyang Zhuo, 17 Jan 2024, Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native, https://arxiv.org/abs/2401.12230
Chang, Xiangyu; Miraj Ahmed, Sk; Krishnamurthy, Srikanth V.; Guler, Basak; Swami, Ananthram; Oymak, Samet; Roy-Chowdhury, Amit K., Jan 2024, Plug-and-Play Transformer Modules for Test-Time Adaptation, https://arxiv.org/abs/2401.04130 https://ui.adsabs.harvard.edu/abs/2024arXiv240104130C/abstract
Tal Peretz, 15 NOV 2023, The Developer's Guide to Production-Grade LLM Apps: Advanced Techniques for Maximizing LLM Performance, https://buildingaistuff.com/p/the-developers-guide-to-production
Andrew Starc, Feb 22 2024, Mantel Group survey reveals AI challenges of large Australian businesses, CRN, https://www.crn.com.au/news/mantel-group-survey-reveals-ai-challenges-of-large-australian-businesses-605376
Chris Parnin, Gustavo Soares, Rahul Pandita, Sumit Gulwani, Jessica Rich, Austin Z. Henley, 21 Dec 2023, Building Your Own Product Copilot: Challenges, Opportunities, and Needs, https://arxiv.org/abs/2312.14231
Jacob Robbins, January 4, 2024, Why generative AI orchestration startups are poised for growth in 2024, Pitch Book, https://pitchbook.com/news/articles/generative-ai-orchestration-startups-venture-capital-unicorns
Eberhard Hechler , Martin Oberhofer , Thomas Schaeck, 2020, Deploying AI in the Enterprise, Book, https://link.springer.com/book/10.1007/978-1-4842-6206-1
Teresa Tung, June 2023, 7 architecture considerations for generative AI, Accenture, https://www.accenture.com/us-en/blogs/cloud-computing/7-generative-ai-architecture-considerations
Hayden Wolff, Jun 02, 2024, A Simple Guide to Deploying Generative AI with NVIDIA NIM, NVIDIA Technical Blog, https://developer.nvidia.com/blog/a-simple-guide-to-deploying-generative-ai-with-nvidia-nim/
Hong Zhang, Yupeng Tang, Anurag Khandelwal, and Ion Stoica. 2023. SHEPHERD: Serving DNNs in the Wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 787–808. https://www.usenix.org/conference/nsdi23/presentation/zhang-hong
David Spuler, March 2024, Chapter 7. Deployment Architecture, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Kirill Kolodiazhnyi, May 15, 2020, Hands-On Machine Learning with C++: Build, train, and deploy end-to-end machine learning and deep learning pipelines, https://www.amazon.com/Hands-Machine-Learning-end-end/dp/1789955335/
Deci Engineering Team, September 28, 2021, 5 Factors that Impact the Inference Pipeline in Production + Hardware Usage Metrics, https://deci.ai/blog/optimize-inference-pipeline-production/
Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698
Adva Nakash Peleg, May 30, 2024, An LLM Journey: From POC to Production, https://medium.com/cyberark-engineering/an-llm-journey-from-poc-to-production-6c5ec6a172fb
Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, Wei Lin, 5 Jun 2024, Llumnix: Dynamic Scheduling for Large Language Model Serving, https://arxiv.org/abs/2406.03243 Code: https://github.com/AlibabaPAI/llumnix
Fabian Both, June 2024, why we no longer use LangChain for building our AI agents , https://www.octomind.dev/blog/why-we-no-longer-use-langchain-for-building-our-ai-agents (Replaces LangChain with their own more-focused internal tool sets.)
Waleed Kadous, August 23, 2023, Llama 2 is about as factually accurate as GPT-4 for summaries and is 30X cheaper, https://www.anyscale.com/blog/llama-2-is-about-as-factually-accurate-as-gpt-4-for-summaries-and-is-30x-cheaper Code: https://github.com/anyscale/factuality-eval
Louis-François Bouchard, Louie Peters, May 2024, Chapter 11: Deployment, Building LLMs for Production: Enhancing LLM Abilities and Reliability with Prompting, Fine-Tuning, and RAG, https://www.amazon.com/Building-LLMs-Production-Reliability-Fine-Tuning/dp/B0D4FFPFW8/
Aarushi Kansal, Chapter 7: Monitoring, Building Generative AI-Powered Apps: A Hands-on Guide for Developers, Apress, https://www.amazon.com/Building-Generative-AI-Powered-Apps-Hands-ebook/dp/B0CTXXP1S4/
Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
Ryan Lucchese, Niki Birkner, Yaron Hagai, Virginia Adams, August 13, 2024, Together AI, A practitioner's guide to testing and running large GPU clusters for training generative AI models, https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models
Yeonhong Park, Jake Hyun, SangLyul Cho, Bonggeun Sim, Jae W. Lee, 21 Jun 2024 (v4), Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs, https://arxiv.org/abs/2402.10517 Code: https://github.com/SNU-ARC/any-precision-llm
Guangxuan Xiao, May 2024, Efficient Deployment Algorithms for Large Language Models, Masters Thesis, MIT, https://dspace.mit.edu/bitstream/handle/1721.1/156332/xiao-xgx-sm-eecs-2024-thesis.pdf
Intel, Jul 24, 2024, Generative AI Fundamentals: Deploying LLMs with OpenVINO™, OpenVINO™ toolkit, https://medium.com/openvino-toolkit/generative-ai-fundamentals-deploying-llms-with-openvino-3057861f6feb
Hao Zhou, Chengming Hu, Ye Yuan, Yufei Cui, Yili Jin, Can Chen, Haolun Wu, Dun Yuan, Li Jiang, Di Wu, Xue Liu, Charlie Zhang, Xianbin Wang, Jiangchuan Liu, 17 May 2024, Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities, https://arxiv.org/abs/2405.10825
Abhinand, Aug 20, 2024, Self-Hosting LLaMA 3.1 70B (or any ~70B LLM) Affordably, https://abhinand05.medium.com/self-hosting-llama-3-1-70b-or-any-70b-llm-affordably-2bd323d72f8d
Dom Couldwell, Sep 03, 2024 Dealing with ‘day two’ issues in generative AI deployments, https://www.infoworld.com/article/3493255/dealing-with-day-two-issues-in-generative-ai-deployments.html
Lightning AI, 2024, Serve LLMs, https://lightning.ai/docs/litserve/features/serve-llms
Evan Schuman, 01 May 2024, LLM deployment flaws that catch IT by surprise, https://www.computerworld.com/article/2095216/llm-deployment-flaws-that-catch-it-by-surprise.html
Michael Nuñez, September 10, 2024, Is Anthropic’s new ‘Workspaces’ feature the future of enterprise AI management? https://venturebeat.com/ai/is-anthropics-new-workspaces-feature-the-future-of-enterprise-ai-management/
Andrei Paleyes, Raoul-Gabriel Urma, Neil D. Lawrence, 19 May 2022 (v3), Challenges in Deploying Machine Learning: a Survey of Case Studies, ACM Comput. Surv., Vol. 55, No. 6, Article 114, December 2022. https://doi.org/10.1145/3533378 https://arxiv.org/abs/2011.09926 https://dl.acm.org/doi/fullHtml/10.1145/3533378#Bib0005
Yiyuan He, Minxian Xu, Jingfeng Wu, Wanyi Zheng, Kejiang Ye, Chengzhong Xu, 24 Sep 2024 (v2), UELLM: A Unified and Efficient Approach for LLM Inference Serving, https://arxiv.org/abs/2409.14961
Dylan Patel and Daniel Nishball, Oct 03, 2024, AI Neocloud Playbook and Anatomy, https://www.semianalysis.com/p/ai-neocloud-playbook-and-anatomy
Michael J. Zellinger, Matt Thomson, 3 Oct 2024, Efficiently Deploying LLMs with Controlled Risk, https://arxiv.org/abs/2410.02173
Noah Martin, Abdullah Bin Faisal, Hiba Eltigani, Rukhshan Haroon, Swaminathan Lamelas, Fahad Dogar, 4 Oct 2024, LLMProxy: Reducing Cost to Access Large Language Models, https://arxiv.org/abs/2410.11857 (Deploying a proxy between user and LLM, with handling of conversational history context and caching.)
Mastering LLM, Aug 17, 2024, How Much GPU Memory is Needed to Serve a Large Language Model (LLM)? https://masteringllm.medium.com/how-much-gpu-memory-is-needed-to-serve-a-large-languagemodel-llm-b1899bb2ab5d
Fan Yang, Zehao Wang∗, Haoyu Zhang, Zhenhua Zhu, Xinhao Yang, Guohao Dai, Yu Wang, Oct 2024, Efficient Deployment of Large Language Model across Cloud-Device Systems, https://nicsefc.ee.tsinghua.edu.cn/nics_file/pdf/f06a14c1-4d6d-441d-b4e4-82545ac5781b.pdf
Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh, 4 Nov 2024, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization, https://arxiv.org/abs/2411.02355
Alina Mailach, Sebastian Simon, Johannes Dorn, Norbert Siegmund, 13 Nov 2024, Practitioners' Discussions on Building LLM-based Applications for Production, https://arxiv.org/abs/2411.08574
Sonal Prabhune, Donald J. Berndt, 7 Nov 2024, Deploying Large Language Models With Retrieval Augmented Generation, https://arxiv.org/abs/2411.11895
Narcisa Guran, Florian Knauf, Man Ngo, Stefan Petrescu, Jan S. Rellermeyer, 21 Nov 2024, Towards a Middleware for Large Language Models, https://arxiv.org/abs/2411.14513
Yanyu Chen, Ganhong Huang, 6 Dec 2024, GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments, https://arxiv.org/abs/2412.04788
Leigh Engel and Anthony Larijani, Dec 11, 2024, Deploying NVIDIA H200 NVL at Scale with New Enterprise Reference Architecture, https://developer.nvidia.com/blog/deploying-nvidia-h200-nvl-at-scale-with-new-enterprise-reference-architecture/
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
Kailai Sun, Xinwei Wang, Xi Miao, and Qianchuan Zhao. 2025. A review of AI edge devices and lightweight CNN and LLM deployment. Neurocomput. 614, C (Jan 2025). https://doi.org/10.1016/j.neucom.2024.128791 https://dl.acm.org/doi/abs/10.1016/j.neucom.2024.128791
Alex Fazio, Feb 2025, How to Build an LLM Chat App: The New Litmus Test for Junior Devs, https://x.com/alxfazio/status/1893242657331101976 (How to build a wrapper chat app that scales by taking care of message queueing, API rate limits, history database management, caching, and other real-world deployment issues.)
The Latency Gambler, May 10, 2025, Scaling to 1 Million Users: The Architecture I Wish I Knew Sooner, https://medium.com/@kanishks772/scaling-to-1-million-users-the-architecture-i-wish-i-knew-sooner-39c688ded2f1

Batching

Research papers on batching:

Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698 Code: https://github.com/abhibambhaniya/GenZ-LLM-Analyzer (Analysis of cost of serving LLMs, including separate profiles of prefill versus decoding phases, and the cost of extra prompt processing in RAG architectures with prepended information.)
Xiao Fu, Weiling Yang, Dezun Dong, Xing Su, 03 June 2024, Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs, ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing, May 2024, Pages 137–149, https://doi.org/10.1145/3650200.3656620 https://dl.acm.org/doi/abs/10.1145/3650200.3656620
Chengyi Nie, Rodrigo Fonseca, Zhenhua Liu, 11 May 2024, Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving, https://arxiv.org/abs/2405.06856
D Shin, May 8, 2024, Multi-User Language Model Resource Allocation Using Contextual Pause Token Aware Transformers, Technical Disclosure Commons, https://www.tdcommons.org/dpubs_series/6981/ PDF: https://www.tdcommons.org/cgi/viewcontent.cgi?article=8121&context=dpubs_series (Interesting idea of training a model how and when to pause during inference, so it can be pre-empted if needed, and thus the overall system can schedule batching of multiple queries more optimally.)
Lequn Chen, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/bitstream/handle/1773/51337/Chen_washington_0250E_26603.pdf?sequence=1&isAllowed=y
Shashank Verma and Neal Vaidya, Nov 17, 2023 Mastering LLM Techniques: Inference Optimization, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/ (An overview that covers a lot of inference optimization techniques.)
Theia Vogel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang, 2 Feb 2024, APIServe: Efficient API Support for Large-Language Model Inferencing, https://arxiv.org/abs/2402.01869
Qidong Su, Christina Giannoula, Gennady Pekhimenko, Oct 2023, The Synergy of Speculative Decoding and Batching in Serving Large Language Models, https://arxiv.org/abs/2310.18813 (Optimizing by adapting dynamically the length of the speculated sequence in batches.)
Jaehyun Park, Jaewan Choi, Kwanhee Kyung, Michael Jaemin Kim, Yongsuk Kwon, Nam Sung Kim, Jung Ho Ahn, April 2024, AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, Pages 103–119, https://doi.org/10.1145/3620665.3640422
Gyeongin Yu, Geon-Woo Kim, Joo Seong Jeong, Soo Jeong Kim, Byung-Gon Chun, 2022, Selective Batching for Inference System for Transformer-Based Generation Tasks, U.S. Patent, US20230177401A1 https://patents.google.com/patent/US20230177401A1/en
Ke Cheng, Wen Hu, Zhi Wang, Peng Du, Jianguo Li, Sheng Zhang, 7 Jun 2024, Enabling Efficient Batch Serving for LMaaS via Generation Length Prediction, https://arxiv.org/abs/2406.04785
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Siyan Zhao, Daniel Israel, Guy Van den Broeck, Aditya Grover, 15 Apr 2024, Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models, https://arxiv.org/abs/2404.09529 Code: https://github.com/siyan-zhao/prepacking (Optimizes prefill KV cache computations by batching multiple query prefill phases together via packing, since prefill token sequence lengths are fully known, and further combined with simple modifications to positional encoding and masking to avoid cross-query attention.)
Ke Cheng, Wen Hu, Zhi Wang, Hongen Peng, Jianguo Li, Sheng Zhang, 19 Jun 2024, Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving, https://arxiv.org/abs/2406.13511 (Improved batched scheduling by splitting queries into fixed-size token generation slices.)
Hui Wu, Yi Gan, Feng Yuan, Jing Ma, Wei Zhu, Yutao Xu, Hong Zhu, Yuhua Zhu, Xiaoli Liu, Jinghui Gu, Peng Zhao, 23 Jun 2024 (v2), Efficient LLM inference solution on Intel GPU, https://arxiv.org/abs/2401.05391 (Disaggregated the KV cache between prefill and decoding tokens, since theh KV cache size is known for prefill, thereby reducing memory fragmentation, and also applying kernel fusion to several modules include the scaled dot product attention.)
Kartik Talamadupula, March 4, 2024, A Guide to LLM Inference Performance Monitoring, https://symbl.ai/developers/blog/a-guide-to-llm-inference-performance-monitoring/
Yuqing Yang, Yuedong Xu, Lei Jiao, 7 Jul 2024, A Queueing Theoretic Perspective on Low-Latency LLM Inference with Variable Token Length, https://arxiv.org/abs/2407.05347
Myung Beom Her, Jisu Jeong, Hojoon Song, Ji-Hyeong Han, 5 Jul 2024, Batch Transformer: Look for Attention in Batch, https://arxiv.org/abs/2407.04218
Isaac Ong, May 16, 2024, Efficient Distributed LLM Inference with Dynamic Partitioning, Masters Thesis, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-108, http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.pdf
Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
Yu, Lingfan, 2024, Improve Language Model Serving Efficiency With Fine-Grained and Stateful Scheduling, Ph.D. Thesis, Department of Computer Science, New York University, ProQuest Dissertations & Theses, 31139782, https://www.proquest.com/openview/7200cdfc0906f1d4edb8008b4368bcf9 PDF: https://cs.nyu.edu/media/publications/lingfan_yu_phd_thesis.pdf (Examines efficiency of batching methods and how to create a "stateful" version with cached multi-turn conversation history using session-based KV caching.)
Felippe Vieira Zacarias, Kiran Palli, Sudharshan Vazhkudai, Evelyn Grevelink, July 2024, Analyzing LLM performance: The impact of high-bandwidth memory on model inference, https://www.micron.com/content/dam/micron/global/public/documents/products/product-flyer/llm-inference-engineering-report.pdf
Ruijie Miao, Yihan Yan, Xinshuo Yao, Tong Yang, 25 Jul 2024, An Efficient Inference Framework for Early-exit Large Language Models, https://arxiv.org/abs/2407.20272 (Faster early exit using batching and KV cache resolution.)
Amr Elmeleegy, Shivam Raj, Brian Slechta and Vishal Mehta, Jun 12, 2024, Demystifying AI Inference Deployments for Trillion Parameter Large Language Models, NVIDIA Technical Blog, https://developer.nvidia.com/blog/demystifying-ai-inference-deployments-for-trillion-parameter-large-language-models/
Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo, Zhengyong Zhang, 15 Aug 2024, P/D-Serve: Serving Disaggregated Large Language Model at Scale, https://arxiv.org/abs/2408.08147 (Comprehensive serving system addressing disaggregated prefill and KV cache transfer with RDMA.)
S. Selvam, A. Nagarajan, A. Raghunathan, 16 August 2024, Efficient Batched Inference in Conditional Neural Networks, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3445263, https://ieeexplore.ieee.org/abstract/document/10638141 Code: https://github.com/surya00060/BatchCond
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
Sergei Savvov, Jun 27, 2023, 7 Ways To Speed Up Inference of Your Hosted LLMs, https://betterprogramming.pub/speed-up-llm-inference-83653aa24c47
S. Selvam, A. Nagarajan and A. Raghunathan, 2024, Efficient Batched Inference in Conditional Neural Networks, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3445263, https://ieeexplore.ieee.org/document/10638141
O. Khan, J. Yu, Y. Kim and E. Seo, 2024, Efficient Adaptive Batching of DNN Inference Services for Improved Latency, 2024 International Conference on Information Networking (ICOIN), Ho Chi Minh City, Vietnam, 2024, pp. 197-200, doi: 10.1109/ICOIN59985.2024.10572152, https://ieeexplore.ieee.org/document/10572152
Yosuke Oyama, Tal Ben-Nun, Torsten Hoefler, Satoshi Matsuoka, 2024, Accelerating Deep Learning Frameworks with Micro-Batches, 2018 IEEE International Conference on Cluster Computing (CLUSTER), Year: 2018, Pages: 402-412, DOI Bookmark: 10.1109/CLUSTER.2018.00058, https://www.computer.org/csdl/proceedings-article/cluster/2018/831900a402/17D45WHONl5
Ahmed Tremo, Aug 6, 2024, How to Efficiently Serve an LLM? https://ahmedtremo.com/posts/How-to-Efficiently-serve-an-llm/
Cade Daniel, Chen Shen, Eric Liang and Richard Liaw , June 22, 2023, How continuous batching enables 23x throughput in LLM inference while reducing p50 latency, https://www.anyscale.com/blog/continuous-batching-llm-inference
Zihao Ye,, Lequn Chen, Ruihang Lai, Yilong Zhao, Size Zheng, Junru Shao, Bohan Hou, Hongyi Jin, Yifei Zuo, Liangsheng Yin, Tianqi Chen, Luis Ceze, Feb 2, 2024, Accelerating Self-Attentions for LLM Serving with FlashInfer, https://flashinfer.ai/2024/02/02/introduce-flashinfer.html
Zihao Ye, Ruihang Lai, Bo-Ru Lu, Chien-Yu Lin, Size Zheng, Lequn Chen, Tianqi Chen, Luis Ceze, Feb 2, 2024 , Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding, https://flashinfer.ai/2024/02/02/cascade-inference.html
Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci, 22 Aug 2024, NanoFlow: Towards Optimal Large Language Model Serving Throughput, https://arxiv.org/abs/2408.12757
Yao Lu, Song Bian, Lequn Chen, Yongjun He, Yulong Hui, Matthew Lentz, Beibin Li, Fei Liu, Jialin Li, Qi Liu, Rui Liu, Xiaoxuan Liu, Lin Ma, Kexin Rong, Jianguo Wang, Yingjun Wu, Yongji Wu, Huanchen Zhang, Minjia Zhang, Qizhen Zhang, Tianyi Zhou, Danyang Zhuo, 17 Jan 2024, Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native, https://arxiv.org/abs/2401.12230
Jiayi Liu, Tinghan Yang, Jennifer Neville, 17 Feb 2024, CliqueParcel: An Approach For Batching LLM Prompts That Jointly Optimizes Efficiency And Faithfulness, https://arxiv.org/abs/2402.14833
Lightning AI, 2024, Batching, https://lightning.ai/docs/litserve/features/batching
DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et. al. (many additional authors), 19 Jun 2024 (v5), DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, https://arxiv.org/abs/2405.04434
Yiyuan He, Minxian Xu, Jingfeng Wu, Wanyi Zheng, Kejiang Ye, Chengzhong Xu, 24 Sep 2024 (v2), UELLM: A Unified and Efficient Approach for LLM Inference Serving, https://arxiv.org/abs/2409.14961
Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
Shixiaowei02, Oct 2024, TensorRT-LLM 0.13.0 Release Latest, https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.13.0
Yifan Qiao, Shu Anzai, Shan Yu, Haoran Ma, Yang Wang, Miryung Kim, Harry Xu, 2 Oct 2024, ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving, https://arxiv.org/abs/2410.01228
Shi, J., Shi, C. (2025). Improve LLM Inference Performance with Matrix Decomposition Strategies. In: Shi, Z., Witbrock, M., Tian, Q. (eds) Intelligence Science V. ICIS 2024. IFIP Advances in Information and Communication Technology, vol 720. Springer, Cham. https://doi.org/10.1007/978-3-031-71253-1_12 https://link.springer.com/chapter/10.1007/978-3-031-71253-1_12 (Speed up matrix operations with SVD and NMF via adaptive block sizing based on batching.)
Michael Nuñez, October 8, 2024, Anthropic challenges OpenAI with affordable batch processing, https://venturebeat.com/ai/anthropic-challenges-openai-with-affordable-batch-processing/
Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
Y Cong, 2024, Research for Enhancing Processing and Computational Efficiency in LLM, 2024 2nd International Conference on Image, https://www.atlantis-press.com/article/126004157.pdf
Xin He, Shunkang Zhang, Yuxin Wang, Haiyan Yin, Zihao Zeng, Shaohuai Shi, Zhenheng Tang, Xiaowen Chu, Ivor Tsang, Ong Yew Soon, 23 Oct 2024, ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference, https://arxiv.org/abs/2410.17954
Peizhuang Cong, Qizhi Chen, Haochen Zhao, Tong Yang, 24 Oct 2024, BATON: Enhancing Batch-wise Inference Efficiency for Large Language Models via Dynamic Re-batching, https://arxiv.org/abs/2410.18701
Yuzhe Yang, Yipeng Du, Ahmad Farhan, Claudio Angione, Yue Zhao, Harry Yang, Fielding Johnston, James Buban, Patrick Colangelo, 28 Oct 2024, Meta-Learning for Speeding Up Large Model Inference in Decentralized Environments, https://arxiv.org/abs/2410.21340 (Choosing between multiple acceleration techniques).
Don Moon, Aug 28, 2024, Chunked Prefill and Decode-Maximal Batching https://medium.com/byte-sized-ai/llm-inference-optimizations-2-chunked-prefill-764407b3a67a
Ming Yin, Minshuo Chen, Kaixuan Huang, Mengdi Wang, 30 Oct 2024, A Theoretical Perspective for Speculative Decoding Algorithm, https://arxiv.org/abs/2411.00841
Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, Ion Stoica, 25 Nov 2024, BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching, https://arxiv.org/abs/2411.16102
Zhen Zheng, Xin Ji, Taosong Fang, Fanghao Zhou, Chuanjie Liu, Gang Peng, 29 Nov 2024, BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching, https://arxiv.org/abs/2412.03594
Yanyu Chen, Ganhong Huang, 6 Dec 2024, GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments, https://arxiv.org/abs/2412.04788
Michael R. Metel, Boxing Chen, Mehdi Rezagholizadeh, 7 Dec 2024, Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression, https://arxiv.org/abs/2412.05693 (KV cache compression in prefill or prompt processing phase.)
Anjali Shah, Kshitiz Gupta, Jiahong Liu and Haohang Huang, Dec 11, 2024, NVIDIA TensorRT-LLM Now Accelerates Encoder-Decoder Models with In-Flight Batching, https://developer.nvidia.com/blog/nvidia-tensorrt-llm-now-accelerates-encoder-decoder-models-with-in-flight-batching/
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, Iacopo Poli, 19 Dec 2024 (v2)], Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference, https://arxiv.org/abs/2412.13663 (Encoder-only BERT model updated with modern optimizations including Flash attention, bias removal, RoPE, pre-norm, and GeGLU, a GELU varaint, hybrid local-global attention, and zero padding removal.)
Waleed Kadous, May 17, 2023, Numbers every LLM Developer should know, https://www.anyscale.com/blog/num-every-llm-developer-should-know (Includes discussion of "be concise" prompting.)
Y Xiao, Dec 2024, Optimizing the Serving System for Large Language Model Inference, https://charlie-xiao.github.io/assets/pdf/projects/fluidinfer.pdf (Concatenated or splits batches for higher throughput.)
Qwen Team, January 21, 2025, Global-batch load balance almost free lunch to improve your MoE LLM training, https://qwenlm.github.io/blog/global-load-balance/
Zhiyuan Fang, Yuegui Huang, Zicong Hong, Yufeng Lyu, Wuhui Chen, Yue Yu, Fan Yu, Zibin Zheng, 9 Feb 2025, Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline, https://arxiv.org/abs/2502.06888
Pol G. Recasens, Ferran Agullo, Yue Zhu, Chen Wang, Eun Kyung Lee, Olivier Tardieu, Jordi Torres, Josep Ll. Berral, 11 Mar 2025, Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference, https://arxiv.org/abs/2503.08311
Bilel Bensaid, Gaël Poëtte, Rodolphe Turpault. Numerical splitting schemes as the cornerstone for mini-batch optimization. 2025. hal-04991621 https://hal.science/hal-04991621/document
Raja Gond, Nipun Kwatra, Ramachandran Ramjee, 16 May 2025, TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference, https://arxiv.org/abs/2505.11329
Rajeshkumar Bambhaniya, Abhimanyu ; Wu, Hanjiang ; Subramanian, Suvinay ; Srinivasan, Sudarshan ; Kundu, Souvik ; Yazdanbakhsh, Amir ; Elavazhagan, Midhilesh ; Kumar, Madhu ; Krishna, Tushar, April 2025, Understanding and Optimizing Multi-Stage AI Inference Pipelines, https://ui.adsabs.harvard.edu/abs/2025arXiv250409775R/abstract https://arxiv.org/abs/2504.09775
CC Hu, HY Huang, LL Xu, XS Chen, C Wang, J Xu, 2025, ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads, ACMTrans. Arch. Code Optim., https://dl.acm.org/doi/pdf/10.1145/3732941 https://doi.org/10.1145/3732941
Aniruddha Nrusimha, William Brandon, Mayank Mishra, Yikang Shen, Rameswar Panda, Jonathan Ragan-Kelley, Yoon Kim, 28 May 2025, FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference, https://arxiv.org/abs/2505.22758 https://github.com/aninrusimha/flashformer (Optimizing kernels for low latency in a single isolated query, not a batch, via kernel fusion and running all components in one kernel, along with programming techniques like metaprogramming.)

Continuous Batching

Research papers on continuous batching:

Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698 Code: https://github.com/abhibambhaniya/GenZ-LLM-Analyzer (Analysis of cost of serving LLMs, including separate profiles of prefill versus decoding phases, and the cost of extra prompt processing in RAG architectures with prepended information.)
Theia Vogel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang, 2 Feb 2024, APIServe: Efficient API Support for Large-Language Model Inferencing, https://arxiv.org/abs/2402.01869
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang, 20 Jun 2024, Optimizing Speculative Decoding for Serving Large Language Models Using Goodput, https://arxiv.org/abs/2406.14066 (Estimation of the draft length for increased acceptance to improve overall performance.)
Yuqing Yang, Yuedong Xu, Lei Jiao, 7 Jul 2024, A Queueing Theoretic Perspective on Low-Latency LLM Inference with Variable Token Length, https://arxiv.org/abs/2407.05347
Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
Sergei Savvov, Jun 27, 2023, 7 Ways To Speed Up Inference of Your Hosted LLMs, https://betterprogramming.pub/speed-up-llm-inference-83653aa24c47
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng, 6 Jun 2024 (v2), SGLang: Efficient Execution of Structured Language Model Programs, https://arxiv.org/abs/2312.07104 https://github.com/sgl-project/sglang
https://community.juniper.net/blogs/sharada-yeluri/2024/02/20/llm-inference-hw-sw-optimizations
Cade Daniel, Chen Shen, Eric Liang and Richard Liaw , June 22, 2023, How continuous batching enables 23x throughput in LLM inference while reducing p50 latency, https://www.anyscale.com/blog/continuous-batching-llm-inference
Hugging Face, 2024, Text Generation Inference, https://huggingface.co/docs/text-generation-inference/index
Sungmin Yun, Kwanhee Kyung, Juhwan Cho, Jaewan Choi, Jongmin Kim, Byeongho Kim, Sukhan Lee, Kyomin Sohn, Jung Ho Ahn, 2 Sep 2024, Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching, https://arxiv.org/abs/2409.01141
OpenVINO-toolkit, Oct 1, 2024, Introducing OpenVINO™ 2024.4, https://medium.com/openvino-toolkit/introducing-openvino-2024-4-28578870b264
Yuhang Yao, Han Jin, Alay Dilipbhai Shah, Shanshan Han, Zijian Hu, Yide Ran, Dimitris Stripelis, Zhaozhuo Xu, Salman Avestimehr, Chaoyang He, 10 Sep 2024 (v2), ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency, https://arxiv.org/abs/2408.00008
Yuzhe Yang, Yipeng Du, Ahmad Farhan, Claudio Angione, Yue Zhao, Harry Yang, Fielding Johnston, James Buban, Patrick Colangelo, 28 Oct 2024, Meta-Learning for Speeding Up Large Model Inference in Decentralized Environments, https://arxiv.org/abs/2410.21340 (Choosing between multiple acceleration techniques).
OpenVINO™ toolkit, Sep 26, 2024, How To Efficiently Serve Today’s Large Language Models, https://medium.com/openvino-toolkit/how-to-efficiently-serve-todays-large-language-models-b3f1e8d33fdf
Anjali Shah, Kshitiz Gupta, Jiahong Liu and Haohang Huang, Dec 11, 2024, NVIDIA TensorRT-LLM Now Accelerates Encoder-Decoder Models with In-Flight Batching, https://developer.nvidia.com/blog/nvidia-tensorrt-llm-now-accelerates-encoder-decoder-models-with-in-flight-batching/
NVIDIA, Dec 2024, Multi-Head, Multi-Query, and Group-Query Attention, https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html#kv-cache

Frameworks

Research on inference frameworks as part of serving:

Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698 Code: https://github.com/abhibambhaniya/GenZ-LLM-Analyzer (Analysis of cost of serving LLMs, including separate profiles of prefill versus decoding phases, and the cost of extra prompt processing in RAG architectures with prepended information.)
Jeon, Byungsoo, May 2024, Automated and Portable Machine Learning Systems, Ph.D. Thesis, Carnegie Mellon University, https://doi.org/10.1184/R1/25746708.v1 https://kilthub.cmu.edu/articles/thesis/Automated_and_Portable_Machine_Learning_Systems/25746708/1 PDF: https://kilthub.cmu.edu/ndownloader/files/46074087 Code: https://github.com/cmu-catalyst/collage (Portability layer to integrate the various kernels and low-level backends more easily. Also covers pipeline parallelism in graph models, and KV cache parallelism similar to FlashDecode.)
Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
Martin Thissen, April 20, 2024, Llama 3 on Your Local Computer | Free GPT-4 Alternative, https://medium.com/@martin-thissen/llama-3-on-your-local-computer-free-gpt-4-alternative-1f533e9abff7 (Llama3-70B with 4-bit quantization using vLLM for inference on NVIDIA RTX 6000 Ada GPU.)
Yushuo Chen, Tianyi Tang, Erge Xiang, Linjiang Li, Wayne Xin Zhao, Jing Wang, Yunpeng Chai, Ji-Rong Wen, 17 Apr 2024, Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models, https://arxiv.org/abs/2404.11502 (Benchmarks the performance of various Transformer inference frameworks: Transformers, vLLM, DeepSpeed-MII, TGI, TenserRT-LLM, llama.cpp, LightLLM, LMDeploy, StreamingLLM.)
Pierrick Pochelu, 9 Oct 2022, Deep Learning Inference Frameworks Benchmark, https://arxiv.org/abs/2210.04323 (Benchmarking study in 2022 of various frameworks.)
Max A. Cherney, March 26, 2024, Exclusive: Behind the plot to break Nvidia's grip on AI by targeting software, https://www.reuters.com/technology/behind-plot-break-nvidias-grip-ai-by-targeting-software-2024-03-25/
Fucheng Jia, Shiqi Jiang, Ting Cao, Wei Cui, Tianrui Xia, Xu Cao, Yuanchun Li, Deyu Zhang, Ju Ren, Yunxin Liu, Lili Qiu, Mao Yang, Sep 2023, Accelerating In-Browser Deep Learning Inference on Diverse Edge Clients through Just-in-Time Kernel Optimizations, https://arxiv.org/pdf/2309.08978.pdf
Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Stoica, Joseph Gonzalez, and Alexey Tumanov. 2020. InferLine: latency-aware provisioning and scaling for prediction serving pipelines. Proceedings of the 11th ACM Symposium on Cloud Computing. 477–491, https://arxiv.org/abs/1812.01776
Maurizio Capra, Beatrice Bussolino, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique, 2020, Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead, https://ieeexplore.ieee.org/iel7/6287639/6514899/09269334.pdf, https://arxiv.org/abs/2012.11233 (Analysis of optimizations for DNNs and SNNs.)
Suresh G, Sep 25, 2023, 7 Frameworks for Serving LLMs, Medium, https://medium.com/@gsuresh957/7-frameworks-for-serving-llms-5044b533ee88
Doug Eadline, October 5, 2023, How AMD May Get Across the CUDA Moat, HPC Wire, https://www.hpcwire.com/2023/10/05/how-amd-may-get-across-the-cuda-moat/
Hayden Wolff, Jun 02, 2024, A Simple Guide to Deploying Generative AI with NVIDIA NIM, NVIDIA Technical Blog, https://developer.nvidia.com/blog/a-simple-guide-to-deploying-generative-ai-with-nvidia-nim/
K Dinghofer, F Hartung, 2020, Analysis of criteria for the selection of machine learning frameworks 2020 International Conference on Computing, Networking and Communications (ICNC), https://ieeexplore.ieee.org/document/9049650
H Dai, X Peng, X Shi, L He, Q Xiong, H Jin, 2022, Reveal training performance mystery between TensorFlow and PyTorch in the single GPU environment, Science China Information Sciences volume 65, Article number: 112103 (2022), https://link.springer.com/article/10.1007/s11432-020-3182-1 http://scis.scichina.com/en/2022/112103.pdf
C Luo, X He, J Zhan, L Wang, W Gao, J Dai, 2020, Comparison and benchmarking of AI models and frameworks on mobile devices, https://arxiv.org/abs/2005.05085
Daniel Nichols, Siddharth Singh, Shu-Huai Lin, Abhinav Bhatele, July 2022, A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks, https://arxiv.org/abs/2111.04949 PDF: https://pssg.cs.umd.edu/assets/papers/2022-07-dl-survey-arxiv.pdf (Survey of frameworks from the theoretical perspective of parallelism.)
R. Sanchez-Iborra and A. F. Skarmeta, Tinyml-enabled frugal smart objects: Challenges and opportunities, IEEE Circuits and Systems Magazine, vol. 20, no. 3, pp. 4–18, 2020. https://ieeexplore.ieee.org/document/9166461 PDF: https://sci-hub.se/10.1109/MCAS.2020.3005467
R. Immonen, T. Hämäläinen et al., Tiny machine learning for resource-constrained microcontrollers, Journal of Sensors, vol. 2022, 2022, https://www.hindawi.com/journals/js/2022/7437023/
M. Giordano, L. Piccinelli, and M. Magno, Survey and comparison of milliwatts micro controllers for tiny machine learning at the edge, in 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 2022, pp. 94–97. https://ieeexplore.ieee.org/document/9870017
Hong Zhang, Yupeng Tang, Anurag Khandelwal, and Ion Stoica. 2023. SHEPHERD: Serving DNNs in the Wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 787–808. https://www.usenix.org/conference/nsdi23/presentation/zhang-hong
Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Mérouane Debbah, Deepak Gupta, 24 Apr 2024 (v2), Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward, https://arxiv.org/abs/2402.01799 Code: https://github.com/nyunAI/Faster-LLM-Survey
Fabian Both, June 2024, why we no longer use LangChain for building our AI agents , https://www.octomind.dev/blog/why-we-no-longer-use-langchain-for-building-our-ai-agents (Replaces LangChain with their own more-focused internal tool sets.)
LiLMod, Aug 27, 2024, Haystack: the new LLM framework that is shaking its competitors, https://ai.plainenglish.io/haystack-the-new-llm-framework-that-is-shaking-its-competitors-1a083a153fd9
Aparna Dhinakaran, Sep 2024, Choosing Between LLM Agent Frameworks. The tradeoffs between building bespoke code-based agents and the major agent frameworks. https://towardsdatascience.com/choosing-between-llm-agent-frameworks-69019493b259
Nicola Sessions, Oct 15, 2024, DataStax Announces New AI Development Platform, Built with NVIDIA AI, https://developer.nvidia.com/blog/datastax-announces-new-ai-development-platform-built-with-nvidia-ai/
Anurag Guda and Shruthii Sathyanarayanan, Oct 16, 2024, Simplify AI Application Development with NVIDIA Cloud Native Stack, https://developer.nvidia.com/blog/simplify-ai-application-development-with-nvidia-cloud-native-stack/
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, Ping Luo, 17 Oct 2024, Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation, https://arxiv.org/abs/2410.13848 https://github.com/deepseek-ai/Janus?tab=readme-ov-file
Robert Corwin Nov 2024, Running Large Language Models Privately: A comparison of frameworks, models, and costs, https://towardsdatascience.com/running-large-language-models-privately-a-comparison-of-frameworks-models-and-costs-ac33cfe3a462
Kristian McCann, November 13, 2024, Top 10 AI Frameworks, https://aimagazine.com/articles/top-10-ai-frameworks
Meta, Jan 2025 (accessed), Llama Stack: Composable building blocks to build Llama Apps, https://github.com/meta-llama/llama-stack

Serverless

Yao Lu, Song Bian, Lequn Chen, Yongjun He, Yulong Hui, Matthew Lentz, Beibin Li, Fei Liu, Jialin Li, Qi Liu, Rui Liu, Xiaoxuan Liu, Lin Ma, Kexin Rong, Jianguo Wang, Yingjun Wu, Yongji Wu, Huanchen Zhang, Minjia Zhang, Qizhen Zhang, Tianyi Zhou, Danyang Zhuo, 17 Jan 2024, Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native, https://arxiv.org/abs/2401.12230
Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai, 25 Jan 2024, ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models, https://arxiv.org/abs/2401.14351 Code: https://github.com/ServerlessLLM/ServerlessLLM
David Linthicum, July 2, 2024, Serverless cloud technology fades away, InfoWorld, https://www.infoworld.com/article/3715605/serverless-cloud-technology-fades-away.html
Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
Joe Oakley, Hakan Ferhatosmanoglu, 22 Mar 2024, FSD-Inference: Fully Serverless Distributed Inference with Scalable Cloud Communication, https://arxiv.org/abs/2403.15195
Hao Wu, Yue Yu, and Junxiao Deng, Shadi Ibrahim, Inria; Song Wu and Hao Fan, Ziyue Cheng, Hai Jin, Huazhong, 2024, StreamBox: A Lightweight GPU SandBox for Serverless Inference Workflow, Usenix 2024, https://www.usenix.org/conference/atc24/presentation/wu-hao PDF: https://www.usenix.org/system/files/atc24-wu-hao.pdf
Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai, ServerlessLLM: Low-Latency Serverless Inference for Large Language Models, 2024, OSDI 2024, https://www.usenix.org/conference/osdi24/presentation/fu
Google, 2024, L’Oréal: Launching Gen AI as a Service in 3 months with Cloud Run and LangChain, https://services.google.com/fh/files/misc/google_loreal_with_langchain_case_study.pdf
Jianfeng Gu, Yichao Zhu, Puxuan Wang, Mohak Chadha, Michael Gerndt, 1 Sep 2023, FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference, https://arxiv.org/abs/2309.00558
C. Lu et al., "SMIless: Serving DAG-based Inference with Dynamic Invocations under Serverless Computing," in 2024 SC24: International Conference for High Performance Computing, Networking, Storage and Analysis SC, Atlanta, GA, United States, 2024, pp. 590-606, doi: 10.1109/SC41406.2024.00044. https://www.computer.org/csdl/proceedings-article/sc/2024/529100a590/21HUVxvcnoA
Dingyan Zhang, Haotian Wang, Yang Liu, Xingda Wei, Yizhou Shan, Rong Chen, Haibo Chen, 23 Dec 2024, Fast and Live Model Auto Scaling with O(1) Host Caching, https://arxiv.org/abs/2412.17246
HuggingFace, Jan 28, 2025, Welcome to Inference Providers on the Hub , https://huggingface.co/blog/inference-providers (Announcing " integration of four awesome serverless Inference Providers – fal, Replicate, Sambanova, Together AI")
Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Baoquan Zhang, Shining Wan, Gengyuan Dan, Zhiyu Dong, Zhihao Ren, Jie Meng, Chao He, Changhong Liu, Tao Xie, Dayun Lin, Qin Zhang, Yue Yu, Hao Feng, Xusheng Chen, Yizhou Shan, 27 Jan 2025 (v2), DeepFlow: Serverless Large Language Model Serving at Scale, https://arxiv.org/abs/2501.14417
Z. Wu, Y. Deng, J. Hu, L. Cui, Z. Zhang, L. Zeng, G. Min, 28 Feb 2025, Cicada: A Pipeline-Efficient Approach to Serverless Inference with Decoupled Management, https://arxiv.org/abs/2502.20959

Scheduling

Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, Ion Stoica, 22 Apr 2024, Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity, https://arxiv.org/abs/2404.14527 Code: https://github.com/tyler-griggs/melange-release
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng ZHANG, Dahua Lin, Ion Stoica, Hao Zhang, 02 May 2024, MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving, ICML 2024, https://openreview.net/forum?id=R0SoZvqXyQ PDF: https://openreview.net/pdf?id=R0SoZvqXyQ Code: https://github.com/hao-ai-lab/MuxServe (Separates the prefill and decoding phases when serving, and also manages the LLM weights and KV cache data in blocks for memory efficiency.)
Schwinn Saereesitthipitak, Ashish Rao, Cathy Zhou, William Li, 2024, Prophet: An LLM Inference Engine Optimized For Head-of-Line Blocking, https://www.scs.stanford.edu/24sp-cs244b/projects/Prophet_An_LLM_Inference_Engine_Optimized_For_Head_of_Line_Blocking.pdf (Faster inference serving via iterative scheduling, separating prefill and decoding phase computations for batching, using priority-based schedulers with preemption, and controling transfer of KV caches from prefill to decoders.)
Ke Cheng, Wen Hu, Zhi Wang, Hongen Peng, Jianguo Li, Sheng Zhang, 19 Jun 2024, Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving, https://arxiv.org/abs/2406.13511 (Improved batched scheduling by splitting queries into fixed-size token generation slices.)
Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 26 Jun 2024 (v2), MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool, https://arxiv.org/abs/2406.17565 (Combined session-based prefix KV caching with disaggregation of prefill and decoding phases.)
Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu, 2 Jul 2024 (v2), Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving, https://arxiv.org/abs/2407.00079 Code: https://github.com/kvcache-ai/Mooncake (Disaggregates prefill and decoding phases for scheduling, with chunked prefill, while managing the KV cache.)
Grant Wilkins, Srinivasan Keshav, Richard Mortier, 4 Jul 2024, Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems, https://arxiv.org/abs/2407.04014
Xin Tan, Jingzong Li, Jiamin Li, Yitao Yang, Hong Xu, August 2024, Arlo: Serving Transformer-based Language Models with Dynamic, Input Lengths, ICPP ’24, August 12–15, 2024, Gotland, Sweden, https://doi.org/10.1145/3673038.3673124 https://kanonjz.github.io/academic/share/xin-icpp24.pdf
Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, Esha Choukse, 1 Aug 2024, DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency, https://arxiv.org/abs/2408.00741
Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo, Zhengyong Zhang, 15 Aug 2024, P/D-Serve: Serving Disaggregated Large Language Model at Scale, https://arxiv.org/abs/2408.08147 (Comprehensive serving system addressing disaggregated prefill and KV cache transfer with RDMA.)
Ahmed Tremo, Aug 6, 2024, How to Efficiently Serve an LLM? https://ahmedtremo.com/posts/How-to-Efficiently-serve-an-llm/
vLLM, 2024, Performance and Tuning, https://docs.vllm.ai/en/latest/models/performance.html
Mingjin Zhang, 2024, High-performance scheduling of deep learning tasks in collaborative edge computing, Ph.D. Thesis, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, https://theses.lib.polyu.edu.hk/bitstream/200/13080/3/7528.pdf (Scheduling of inference and training tasks on edge devices with techniques such as model splitting/partitioning.)
Kunal Jain, Anjaly Parayil, Ankur Mallick, Esha Choukse, Xiaoting Qin, Jue Zhang, Íñigo Goiri, Rujia Wang, Chetan Bansal, Victor Rühle, Anoop Kulkarni, Steve Kofsky, Saravan Rajmohan, 24 Aug 2024, Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Scheduling, https://arxiv.org/abs/2408.13510
Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 20 Jan 2024, Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads, https://arxiv.org/abs/2401.11181
Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica, Hao Zhang, 28 Aug 2024, Efficient LLM Scheduling by Learning to Rank, https://arxiv.org/abs/2408.15792 https://github.com/hao-ai-lab/vllm-ltr.git
Eric Samikwa, 2024, Resource-Aware Distributed Machine Learning for Artificial Intelligence of Things, Ph.D. thesis, Faculty of Science, University of Bern, Switzerland, https://boristheses.unibe.ch/5378/1/24samikwa_e_1_.pdf https://doi.org/10.48549/5378 (Multi-edge device with early exit, "micro-split" scheduling, split/federated learning, and distributed inference.)
DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et. al. (many additional authors), 19 Jun 2024 (v5), DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, https://arxiv.org/abs/2405.04434
T Zhao, 2024, Acceleration of Deep Learning Algorithms with Transformers, https://escholarship.org/uc/item/3419t2z6
Y. Peng, W. Gao and H. Peng, "Serving DNN Inference With Fine-Grained Spatio-Temporal Sharing of GPU Servers," in IEEE Transactions on Services Computing, doi: 10.1109/TSC.2024.3463429. https://ieeexplore.ieee.org/document/10684028 https://www.computer.org/csdl/journal/sc/5555/01/10684028/20lm4PEVn9u
Jianfeng Gu, Yichao Zhu, Puxuan Wang, Mohak Chadha, Michael Gerndt, 1 Sep 2023, FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference, https://arxiv.org/abs/2309.00558
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Yiyuan He, Minxian Xu, Jingfeng Wu, Wanyi Zheng, Kejiang Ye, Chengzhong Xu, 24 Sep 2024 (v2), UELLM: A Unified and Efficient Approach for LLM Inference Serving, https://arxiv.org/abs/2409.14961
Yifan Qiao, Shu Anzai, Shan Yu, Haoran Ma, Yang Wang, Miryung Kim, Harry Xu, 2 Oct 2024, ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving, https://arxiv.org/abs/2410.01228
Rana Shahout, Eran Malach, Chunwei Liu, Weifan Jiang, Minlan Yu, Michael Mitzenmacher, 1 Oct 2024, Don't Stop Me Now: Embedding Based Scheduling for LLMs, https://arxiv.org/abs/2410.01035
Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
Wei Zhao, Anand Jayarajan, Gennady Pekhimenko, 9 Oct 2024, Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads, https://arxiv.org/abs/2410.07381 (Interleaved scheduling layer for GPU workloads.)
S Durvasula, A Zhao, R Kiguru, Y Guan, Z Chen, Oct 2024, ACE: Efficient GPU Kernel Concurrency for Input-Dependent Irregular Computational Graphs, PACT ’24, October 14–16, 2024, Southern California, CA, USA, https://www.embarclab.com/static/media/ace.1c73b44bc2ad143f7b9f.pdf (Identify parallel kernels at runtime.)
Ferdi Kossmann, Bruce Fontaine, Daya Khudia, Michael Cafarella, Samuel Madden, 23 Oct 2024, Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs, https://arxiv.org/abs/2410.17840
Zebin Yang, Renze Chen, Taiqiang Wu, Ngai Wong, Yun Liang, Runsheng Wang, Ru Huang, Meng Li, 23 Oct 2024, MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers https://arxiv.org/abs/2410.17957
Rana Shahout, Cong Liang, Shiji Xin, Qianru Lao, Yong Cui, Minlan Yu, Michael Mitzenmacher, 23 Oct 2024, Efficient Inference for Augmented Large Language Models, https://arxiv.org/abs/2410.18248
Youpeng Zhao, Jun Wang, 31 Oct 2024, ALISE: Accelerating Large Language Model Serving with Speculative Scheduling, https://arxiv.org/abs/2410.23537
R Mendoza, I Cruz, P Singh, A Martinez, N Kim, S Patel, Nov 2024, Dynamic Resource Management for Efficient Fast Device Placement https://www.researchgate.net/profile/Priya-Singh-103/publication/385528236_Dynamic_Resource_Management_for_Efficient_Fast_Device_Placement/links/672983c3ecbbde716b584acc/Dynamic-Resource-Management-for-Efficient-Fast-Device-Placement.pdf
H Zhang, Z Chen, XLY Liu, J Wu, L Wang, Nov 2024, Dynamic Fast Device Placement Strategies for Real-Time Resource Allocation, https://www.researchgate.net/profile/Haoran-Zhang-111/publication/385589353_Dynamic_Fast_Device_Placement_Strategies_for_Real-Time_Resource_Allocation/links/672b9ca977f274616d60a5e6/Dynamic-Fast-Device-Placement-Strategies-for-Real-Time-Resource-Allocation.pdf
Zhiqiang Xie, Hao Kang, Ying Sheng, Tushar Krishna, Kayvon Fatahalian, Christos Kozyrakis, 5 Nov 2024, AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution https://arxiv.org/abs/2411.03519 (Scheduling multiple agents.)
Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chandra Narayanaswami, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2024. Queue Management for SLO-Oriented Large Language Model Serving. In Proceedings of the 2024 ACM Symposium on Cloud Computing (SoCC '24). Association for Computing Machinery, New York, NY, USA, 18–35. https://doi.org/10.1145/3698038.3698523 https://dl.acm.org/doi/abs/10.1145/3698038.3698523
Yuka Ikarashi, Kevin Qian, Samir Droubi, Alex Reinking, Gilbert Bernstein, Jonathan Ragan-Kelley, 14 Nov 2024 (v2), Exo 2: Growing a Scheduling Language, https://arxiv.org/abs/2411.07211
M. Gil et al., "TLP Balancer: Predictive Thread Allocation for Multi-Tenant Inference in Embedded GPUs," in IEEE Embedded Systems Letters, doi: 10.1109/LES.2024.3497587. https://ieeexplore.ieee.org/abstract/document/10753458/
Kyoungmin Kim, Kijae Hong, Caglar Gulcehre, Anastasia Ailamaki, 12 Nov 2024, The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving, https://arxiv.org/abs/2411.07447
Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, Ion Stoica, 25 Nov 2024, BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching, https://arxiv.org/abs/2411.16102
Wenxiang Lin, Xinglin Pan, Shaohuai Shi, Xuan Wang, Xiaowen Chu, 24 Nov 2024, Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems, https://arxiv.org/abs/2411.15715
He, Y., Xu, M., Wu, J., Zheng, W., Ye, K., Xu, C. (2025). UELLM: A Unified and Efficient Approach for Large Language Model Inference Serving. In: Gaaloul, W., Sheng, M., Yu, Q., Yangui, S. (eds) Service-Oriented Computing. ICSOC 2024. Lecture Notes in Computer Science, vol 15404. Springer, Singapore. https://doi.org/10.1007/978-981-96-0805-8_16 https://link.springer.com/chapter/10.1007/978-981-96-0805-8_16
Hongyi Jin, Ruihang Lai, Charlie F. Ruan, Yingcheng Wang, Todd C. Mowry, Xupeng Miao, Zhihao Jia, Tianqi Chen, 17 Dec 2024, A System for Microserving of LLMs, https://arxiv.org/abs/2412.12488 (Disaggregated prefill and decoding combined with context cache migration for sending the KV cache over the network.)
Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng-Ann Heng, Minyi Guo, Chao Li, 18 Dec 2024, A Survey on Inference Optimization Techniques for Mixture of Experts Models, https://arxiv.org/abs/2412.14219 (Broad survey of MoE inference optimization from hardware to model compression to expert parallelism.)
Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, Luis Ceze, 2 Jan 2025, FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving, https://arxiv.org/abs/2501.01005
Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Esha Choukse, Haoran Qiu, Rodrigo Fonseca, Josep Torrellas, Ricardo Bianchini, 5 Jan 2025, TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms, https://arxiv.org/abs/2501.02600
Liu Qianli, Hong Zicong, Chen Fahao, Li Peng, Guo Song, 12 Jan 2025, Mell: Memory-Efficient Large Language Model Serving via Multi-GPU KV Cache Management, https://arxiv.org/abs/2501.06709
Archit Patke, Dhemath Reddy, Saurabh Jha, Chandra Narayanaswami, Zbigniew Kalbarczyk, Ravishankar Iyer, 14 Jan 2025, Hierarchical Autoscaling for Large Language Model Serving with Chiron, https://arxiv.org/abs/2501.08090 (Scheduling for interactive versus batch queries.)
Shiyi Cao, Yichuan Wang, Ziming Mao, Pin-Lun Hsu, Liangsheng Yin, Tian Xia, Dacheng Li, Shu Liu, Yineng Zhang, Yang Zhou, Ying Sheng, Joseph Gonzalez, Ion Stoica, 24 Jan 2025, Locality-aware Fair Scheduling in LLM Serving, https://arxiv.org/abs/2501.14312 (Scheduling taking into account prefix cache availability based on locality.)
Gohar Irfan Chaudhry, Esha Choukse, Íñigo Goiri, Rodrigo Fonseca, Adam Belay, Ricardo Bianchini, 29 Jan 2025 (v2), Towards Resource-Efficient Compound AI Systems, https://arxiv.org/abs/2501.16634
Patrick Jaillet, Jiashuo Jiang, Chara Podimata, Zijie Zhou, 13 Feb 2025 (v2), Online Scheduling for LLM Inference with KV Cache Constraints, https://arxiv.org/abs/2502.07115
Youhe Jiang, Ran Yan, Binhang Yuan, 11 Feb 2025, HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment, https://arxiv.org/abs/2502.07903
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Bowen Pang, Kai Li, Ruifeng She, Feifan Wang, 14 Feb 2025, Hybrid Offline-online Scheduling Method for Large Language Model Inference Optimization, https://arxiv.org/abs/2502.15763
Kai Mei, Wujiang Xu, Shuhang Lin, Yongfeng Zhang, 27 Feb 2025, ECCOS: Efficient Capability and Cost Coordinated Scheduling for Multi-LLM Serving, https://arxiv.org/abs/2502.20576 https://github.com/agiresearch/ECCOS
Hao Ge, Junda Feng, Qi Huang, Fangcheng Fu, Xiaonan Nie, Lei Zuo, Haibin Lin, Bin Cui, Xin Liu, 28 Feb 2025, ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs, https://arxiv.org/abs/2502.21231 (Addressing training inefficiencies when training data ranges from short to very long queries, including via hybrid data parallelism and communications optimizations.)
Amr Elmeleegy, Harry Kim, David Zier, Kyle Kranen, Neelay Shah, Ryan Olson and Omri Kahalon, Mar 18, 2025, Introducing NVIDIA Dynamo, A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models, https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/
Yuxing Xiang, Xue Li, Kun Qian, Wenyuan Yu, Ennan Zhai, Xin Jin, 15 May 2025, ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production, https://arxiv.org/abs/2505.09999
CC Hu, HY Huang, LL Xu, XS Chen, C Wang, J Xu, 2025, ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads, ACMTrans. Arch. Code Optim., https://dl.acm.org/doi/pdf/10.1145/3732941 https://doi.org/10.1145/3732941
Azam Ikram, Xiang Li, Sameh Elnikety, Saurabh Bagchi, 30 Apr 2025 (v2), Ascendra: Dynamic Request Prioritization for Efficient LLM Serving, https://arxiv.org/abs/2504.20828

Load Balancing

Research papers on AI load balancing:

Grant Wilkins, 3 June 202, Online Workload Allocation and Energy Optimization in Large Language Model Inference Systems, Master of Philosophy in Advanced Computer Science, Churchill College, University of Cambridge, https://grantwilkins.github.io/gfw27_project.pdf
David Spuler, March 2024, Chapter 7. Deployment Architecture, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu, 2 Jul 2024 (v2), Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving, https://arxiv.org/abs/2407.00079 Code: https://github.com/kvcache-ai/Mooncake (Disaggregates prefill and decoding phases for scheduling, with chunked prefill, while managing the KV cache.)
Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
J Liu, 2024, Data-driven Performance Optimization for Data-intensive Applications, Ph.D. Thesis, Electrical Engineering and Computer Science, University of California, Merced, https://escholarship.org/content/qt6gn2p8mn/qt6gn2p8mn.pdf (Optimization of data movement intensive algorithms, mostly non-AI applications.)
An Efficient Network Orchestrator for Distributed Compound Language Model Systems Muhammad Shahir Abdurrahman, Stanford University, Stanford, California, USA, https://www.scs.stanford.edu/24sp-cs244b/projects/An_Efficient_Network_Orchestrator_for_Distributed_Compound_Language_Model_Systems.pdf
David Spuler, March 2024, Load Balancing, in Generative AI in C++, https://www.aussieai.com/book/ch7-load-balancer
N Kim, P Singh, S Patel, A Martinez, I Cruz, R Mendoza, Oct 2024, Dynamic Load Balancing Techniques for Efficient Fast Device Placement, https://www.researchgate.net/profile/Priya-Singh-103/publication/385103680_Dynamic_Load_Balancing_Techniques_for_Efficient_Fast_Device_Placement/links/6716aa9a68ac304149aa2fa6/Dynamic-Load-Balancing-Techniques-for-Efficient-Fast-Device-Placement.pdf
Yuhang Yao, Han Jin, Alay Dilipbhai Shah, Shanshan Han, Zijian Hu, Yide Ran, Dimitris Stripelis, Zhaozhuo Xu, Salman Avestimehr, Chaoyang He, 10 Sep 2024 (v2), ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency, https://arxiv.org/abs/2408.00008
Ilias Bournias, Lukas Cavigelli, Georgios Zacharopoulos, 8 Nov 2024, AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing and Data Locality, https://arxiv.org/abs/2411.05555
Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng-Ann Heng, Minyi Guo, Chao Li, 18 Dec 2024, A Survey on Inference Optimization Techniques for Mixture of Experts Models, https://arxiv.org/abs/2412.14219 (Broad survey of MoE inference optimization from hardware to model compression to expert parallelism.)
Geeks4Geeks, 25 Jun, 2024, What is Session Affinity in Load Balancing? https://www.geeksforgeeks.org/what-is-session-affinity-in-load-balancing/
Qwen Team, January 21, 2025, Global-batch load balance almost free lunch to improve your MoE LLM training, https://qwenlm.github.io/blog/global-load-balance/
Alex Fazio, Feb 2025, How to Build an LLM Chat App: The New Litmus Test for Junior Devs, https://x.com/alxfazio/status/1893242657331101976 (How to build a wrapper chat app that scales by taking care of message queueing, with RabbitMQ or Kafka API rate limits, history database management, in-memory caching with Redis, load balancing, and other real-world deployment issues.)

Networking

Research papers on networking optimizations for LLMs:

Ari Lotter, Jeffrey Quesnelle, Umer H. Adil, Dillon Rolnick, Esteban La Rocca, A Preliminary Report on Distro, 2024, https://github.com/NousResearch/DisTrO/blob/main/A_Preliminary_Report_on_DisTrO.pdf https://venturebeat.com/wp-content/uploads/2024/08/A_Preliminary_Report_on_DisTrO.pdf (Reducing the inter-GPU networking bandwidth cost during training.)
Wei An, Xiao Bi, Guanting Chen, Shanhuang Chen, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Wenjun Gao, Kang Guan, Jianzhong Guo, Yongqiang Guo, Zhe Fu, Ying He, Panpan Huang, Jiashi Li, Wenfeng Liang, Xiaodong Liu, Xin Liu, Yiyuan Liu, Yuxuan Liu, Shanghao Lu, Xuan Lu, Xiaotao Nie, Tian Pei, Junjie Qiu, Hui Qu, Zehui Ren, Zhangli Sha, Xuecheng Su, Xiaowen Sun, Yixuan Tan, Minghui Tang, Shiyu Wang, Yaohui Wang, Yongji Wang, Ziwei Xie, Yiliang Xiong, Yanhong Xu, Shengfeng Ye, Shuiping Yu, Yukun Zha, Liyue Zhang, Haowei Zhang, Mingchuan Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Yuheng Zou, 31 Aug 2024 (v2), Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning, DeepSeek AI, https://www.arxiv.org/abs/2408.14158
David Spuler, 26th August, 2024, State-of-the-Art LLM Backends, Aussie AI Blog, https://www.aussieai.com/blog/state-of-the-art-llm-backends
Zeyu Zhang, Haiying Shen, 23 Sep 2024, CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts, https://arxiv.org/abs/2409.15104 (Sparse attention and overlapped communication with computation and disaggregates prefill/decoding with chunked prefill, with a novel QKV splitting approach focused on the Q values.)
Stephen Jones, March 2024, CUDA: New Features and Beyond, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62400/
Dylan Patel and Daniel Nishball, Oct 03, 2024, AI Neocloud Playbook and Anatomy, https://www.semianalysis.com/p/ai-neocloud-playbook-and-anatomy
Yuhang Yao, Han Jin, Alay Dilipbhai Shah, Shanshan Han, Zijian Hu, Yide Ran, Dimitris Stripelis, Zhaozhuo Xu, Salman Avestimehr, Chaoyang He, 10 Sep 2024 (v2), ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency, https://arxiv.org/abs/2408.00008
Nir Barazida, Mar 9, 2022, Distributed training of deep learning models: handling stragglers and latency in synchronous training A review of the challenges in Synchronous distributed training and best solutions for stragglers and high latency https://towardsdatascience.com/stragglers-and-latency-in-synchronous-distributed-training-of-deep-learning-models-43783b0266d9
Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, Rafal Jozefowicz, 21 Mar 2017 (v3), Revisiting Distributed Synchronous SGD, https://arxiv.org/abs/1604.00981
Palak (Microsoft Research India), Rohan Gandhi (Microsoft Research India), Karan Tandon (Microsoft Research India), Debopam Bhattacherjee (Microsoft Research India), Venkata N. Padmanabhan (Microsoft Research India), 16 Nov 2024, Improving training time and GPU utilization in geo-distributed language model training, https://arxiv.org/abs/2411.14458
Haiquan Wang, Chaoyi Ruan, Jia He, Jiaqi Ruan, Chengjie Tang, Xiaosong Ma, Cheng Li, 24 Nov 2024, Hiding Communication Cost in Distributed LLM Training via Micro-batch Co-execution, https://arxiv.org/abs/2411.15871
Greg Gutmann, Sep 2020, Peer-to-peer Memory Copy with NVLink: CUDA Feature Testing, https://codingbyexample.com/2020/09/14/p2p-memcpy-with-nvlink/
Chaoyi Jiang, Lei Gao, Hossein Entezari Zarch, Murali Annavaram, 26 Nov 2024, Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation, https://arxiv.org/abs/2411.17089 (Overlapping/optimizing CPU-GPU network bandwidth for KV cache with some recomputation.)
Bowen Peng, Jeffrey Quesnelle, Diederik P. Kingma, 29 Nov 2024, DeMo: Decoupled Momentum Optimization, https://arxiv.org/abs/2411.19870 https://github.com/bloc97/DeMo (Extension to ADAM optimizer that greatly reduces network communication in training.)
Markus Rabe, Carl Case, November 14, 2024, Rethinking LLM Inference: Why Developer AI Needs a Different Approach, https://www.augmentcode.com/blog/rethinking-llm-inference-why-developer-ai-needs-a-different-approach
Y Tang, R Cheng, P Zhou, T Liu, F Liu, W Tang, K Bae, Nov 2024, Exploring CXL-based KV Cache Storage for LLMServing, https://mlforsystems.org/assets/papers/neurips2024/paper17.pdf
Carl Franzen, August 27, 2024, ‘This could change everything!’ Nous Research unveils new tool to train powerful AI models with 10,000x efficiency, https://venturebeat.com/ai/this-could-change-everything-nous-research-unveils-new-tool-to-train-powerful-ai-models-with-10000x-efficiency/
Carl Franzen, December 2, 2024, Nous Research is training an AI model using machines distributed across the internet, https://venturebeat.com/ai/nous-research-is-training-an-ai-model-using-machines-distributed-across-the-internet/
Leigh Engel and Anthony Larijani, Dec 11, 2024, Deploying NVIDIA H200 NVL at Scale with New Enterprise Reference Architecture, https://developer.nvidia.com/blog/deploying-nvidia-h200-nvl-at-scale-with-new-enterprise-reference-architecture/
Hongyi Jin, Ruihang Lai, Charlie F. Ruan, Yingcheng Wang, Todd C. Mowry, Xupeng Miao, Zhihao Jia, Tianqi Chen, 17 Dec 2024, A System for Microserving of LLMs, https://arxiv.org/abs/2412.12488 (Disaggregated prefill and decoding combined with context cache migration for sending the KV cache over the network.)
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng-Ann Heng, Minyi Guo, Chao Li, 18 Dec 2024, A Survey on Inference Optimization Techniques for Mixture of Experts Models, https://arxiv.org/abs/2412.14219 (Broad survey of MoE inference optimization from hardware to model compression to expert parallelism.)
Anirudha Agrawal, Shaizeen Aga, Suchita Pati, Mahzabeen Islam, 18 Dec 2024, Optimizing ML Concurrent Computation and Communication with GPU DMA Engines, https://arxiv.org/abs/2412.14335
J. Du et al., "Co-designing Transformer Architectures for Distributed Inference with Low Communication," in IEEE Transactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2024.3521582. https://ieeexplore.ieee.org/abstract/document/10812976/ (Distributed inference with sub-block parallelism capabilities and a planning phase that optimizes both compute and communications.)
Zongwu Wang, Fangxin Liu, Mingshuai Li, Li Jiang, 29 Dec 2024, TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication, https://arxiv.org/abs/2412.20501 https://github.com/ACA-Lab-SJTU/token-ring (Ring attention with inter-GPU network transmission optimizations.)
Zongbiao Li , Xiezhao Li , Yinghao Cui , Yijun Chen , Zhixuan Gu , Yuxuan Liu , Wenbo Zhu , Fei Jia , Ke Liu , Qifeng Li , Junyao Zhan , Jiangtao Zhou , Chenxi Zhang , Qike Liu, 31 Dec 2024, Automatically Planning Optimal Parallel Strategy for Large Language Models, https://arxiv.org/abs/2501.00254
You Zhou, Changsheng You, Kaibin Huang, 1 Jan 2025, Communication Efficient Cooperative Edge AI via Event-Triggered Computation Offloading, https://arxiv.org/abs/2501.02001
Ryan Lucchese, Niki Birkner, Yaron Hagai, Virginia Adams, August 13, 2024, Together AI, A practitioner's guide to testing and running large GPU clusters for training generative AI models, https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models
NVIDIA, 2024, nvbandwidth: A tool for bandwidth measurements on NVIDIA GPUs. https://github.com/NVIDIA/nvbandwidth
NVIDIA, 2024, DCGM Diagnostics, https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
Liu Qianli, Hong Zicong, Chen Fahao, Li Peng, Guo Song, 12 Jan 2025, Mell: Memory-Efficient Large Language Model Serving via Multi-GPU KV Cache Management, https://arxiv.org/abs/2501.06709
Daniel S. Berger, Yuhong Zhong, Pantea Zardoshti, Shuwei Teng, Fiodar Kazhamiaka, Rodrigo Fonseca, 15 Jan 2025, Octopus: Scalable Low-Cost CXL Memory Pooling, https://arxiv.org/abs/2501.09020
Nandini Lokesh Reddy, Jan 2025, DeepSeek: Bridging Performance and Efficiency in Modern AI, https://medium.com/@nandinilreddy/deepseek-bridging-performance-and-efficiency-in-modern-ai-106181a85693
Y Wang, B Li, MTI Ziad, L Eeckhout, J Yang, A Jaleel, Jan 2025, OASIS: Object-Aware Page Management for Multi-GPU Systems https://users.elis.ugent.be/~leeckhou/papers/HPCA2025-OASIS.pdf
Sylvain Jeaugey, Giuseppe Congiu, Thomas Gillis, Ben Williams and Fred Oh, Jan 31, 2025, New Scaling Algorithm and Initialization with NVIDIA Collective Communications Library 2.23, https://developer.nvidia.com/blog/new-scaling-algorithm-and-initialization-with-nvidia-collective-communications-library-2-23/
Nick Comly, Joe DeLaere, Ashraf Eassa, Ivan Goldwasser, Brian Pharris and Brian Slechta, Sep 26, 2024, Low Latency Inference Chapter 2: Blackwell is Coming. NVIDIA GH200 NVL32 with NVLink Switch Gives Signs of Big Leap in Time to First Token Performance, https://developer.nvidia.com/blog/low-latency-inference-chapter-2-blackwell-is-coming-nvidia-gh200-nvl32-with-nvlink-switch-gives-signs-of-big-leap-in-time-to-first-token-performance/
Anton Korzh, Brian Pharris, Nick Comly, Ashraf Eassa and Amr Elmeleegy, Nov 01, 2024, 3x Faster AllReduce with NVSwitch and TensorRT-LLM MultiShot, https://developer.nvidia.com/blog/3x-faster-allreduce-with-nvswitch-and-tensorrt-llm-multishot/
Andy Patrizio, Feb 05, 2025, Nvidia claims near 50% boost in AI storage speed, https://www.networkworld.com/article/3817927/nvidia-claims-near-50-boost-in-ai-storage-speed.html
Shenggan Cheng, Shengjie Lin, Lansong Diao, Hao Wu, Siyu Wang, Chang Si, Ziming Liu, Xuanlei Zhao, Jiangsu Du, Wei Lin, and Yang You. 2025. Concerto: Automatic Communication Optimization and Scheduling for Large-Scale Deep Learning. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (ASPLOS '25). Association for Computing Machinery, New York, NY, USA, 198–213. https://doi.org/10.1145/3669940.3707223 https://dl.acm.org/doi/abs/10.1145/3669940.3707223
Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, Eiko Yoneki, 13 Feb 2025, ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments, https://arxiv.org/abs/2502.09334
Hongsun Jang, Siung Noh, Changmin Shin, Jaewon Jung, Jaeyong Song, Jinho Lee, 14 Feb 2025, INF^2: High-Throughput Generative Inference of Large Language Models using Near-Storage Processing, https://arxiv.org/abs/2502.09921
Hulin Wang, Yaqi Xia, Donglin Yang, Xiaobo Zhou, and Dazhao Cheng. 2025. Harnessing Inter-GPU Shared Memory for Seamless MoE Communication-Computation Fusion. In Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP '25). Association for Computing Machinery, New York, NY, USA, 170–182. https://doi.org/10.1145/3710848.3710868 https://dl.acm.org/doi/abs/10.1145/3710848.3710868
Hao Ge, Junda Feng, Qi Huang, Fangcheng Fu, Xiaonan Nie, Lei Zuo, Haibin Lin, Bin Cui, Xin Liu, 28 Feb 2025, ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs, https://arxiv.org/abs/2502.21231 (Addressing training inefficiencies when training data ranges from short to very long queries, including via hybrid data parallelism and communications optimizations.)
Dylan Butts, May 19 2025, Nvidia announces new tech to keep it at the center of AI development, https://www.cnbc.com/2025/05/19/nvidia-announces-new-tech-to-keep-it-at-the-center-of-ai-development-.html
Chao Jin, Ziheng Jiang, Zhihao Bai, Zheng Zhong, Juncai Liu, Xiang Li, Ningxin Zheng, Xi Wang, Cong Xie, Qi Huang, Wen Heng, Yiyuan Ma, Wenlei Bao, Size Zheng, Yanghua Peng, Haibin Lin, Xuanzhe Liu, Xin Jin, Xin Liu, 19 May 2025 (v2), MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production,https://arxiv.org/abs/2505.11432
M. Kim, P. Pinyoanuntapong, B. Kim, W. Saad and D. Calin, "Edge vs Cloud: How Do We Balance Cost, Latency, and Quality for Large Language Models Over 5G Networks?," 2025 IEEE Wireless Communications and Networking Conference (WCNC), Milan, Italy, 2025, pp. 1-6, doi: 10.1109/WCNC61545.2025.10978177, https://ieeexplore.ieee.org/abstract/document/10978177/

AI Tech Stack

Research on AI tech stacks:

Stan Gibson, 03 Jun 2024, Getting infrastructure right for generative AI, CIO, https://www.cio.com/article/2128440/getting-infrastructure-right-for-generative-ai.html
Matt Murphy, Tim Tully, Grace Ge, Derek Xiao, Katie Keller, January 18, 2024, The Modern AI Stack: Design Principles for the Future of Enterprise AI Architectures, https://menlovc.com/perspective/the-modern-ai-stack-design-principles-for-the-future-of-enterprise-ai-architectures/?tpcc=NL_Marketing
David Spuler, March 2024, Chapter 5. Design Choices & Architectures, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
MongoDB, Jun 20, 2024, Understanding the AI Stack In the Era of Generative AI: Exploring the Layers and Components of Today’s AI Applications https://medium.com/mongodb/understanding-the-ai-stack-in-the-era-of-generative-ai-f1fcd66e1393
Akash Bajwa and Chia Jeng Yang, May 27, 2024, The RAG Stack: Featuring Knowledge Graphs: Reducing Hallucinations To Make LLMs Production-Grade With Complex RAG, https://akashbajwa.substack.com/p/the-rag-stack-featuring-knowledge
Melissa Malec, June 5, 2024, AI Orchestration Explained: The What, Why & How for 2024, https://hatchworks.com/blog/gen-ai/ai-orchestration/
Artem Shelamanov, Jun 30, 2024. Tech Stack For Production-Ready LLM Applications In 2024, https://python.plainenglish.io/tech-stack-for-production-ready-llm-applications-in-2024-5eb14105d1b4
David Spuler, March 2024, AI Tech Stack, in Generative AI in C++, https://www.aussieai.com/book/ch5-ai-tech-stack
Cobus Greyling, Sep 2024, An AI Agent Architecture & Framework Is Emerging, https://cobusgreyling.medium.com/an-ai-agent-architecture-framework-is-emerging-addae3804f23
Brandon Royal, Sam Stoelinga, 2024, Scaling and Optimizing Your LLM Pipeline for End-to-End Efficiency, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62006/
Michael Nuñez, September 25, 2024, AI for all: Meta’s ‘Llama Stack’ promises to simplify enterprise adoption, https://venturebeat.com/ai/ai-for-all-meta-llama-stack-promises-to-simplify-enterprise-ai-adoption/
Matt Marshall, October 24, 2024, The enterprise verdict on AI models: Why open source will win, https://venturebeat.com/ai/the-enterprise-verdict-on-ai-models-why-open-source-will-win/
Letta, November 14, 2024, The AI agents stack, https://www.letta.com/blog/ai-agents-stack
Narcisa Guran, Florian Knauf, Man Ngo, Stefan Petrescu, Jan S. Rellermeyer, 21 Nov 2024, Towards a Middleware for Large Language Models, https://arxiv.org/abs/2411.14513
Meta, July 2024, RFC-0001 - Llama Stack #6, https://github.com/meta-llama/llama-toolchain/issues/6 (Meta's request for comment on its "Llama stack" for AI.)
Tiernan Ray, Dec. 3, 2024, Enterprises are struggling with what to do with Gen AI, say venture capitalists Despite some uncertainty, enterprise investments in applications soared eight-fold in 2024, with spending on AI-generated code leading the way. https://www.zdnet.com/article/enterprises-are-struggling-with-what-to-do-with-gen-ai-say-venture-capitalists/ (Growing usage but some confusion. Dominant use cases are coding, support chatbots, enterprise search, and meeting summaries.)
Phoebe Lee and Kristina Joos, Jan 25, 2024, Advancing Production AI with NVIDIA AI Enterprise, https://developer.nvidia.com/blog/advancing-production-ai-with-nvidia-ai-enterprise/ ("... advances in NVIDIA AI software deliver up to 54% performance gains without a hardware upgrade...")
Sehoon Kim, Oct 2024, Full Stack Approach for Efficient Deep Learning Inference, Doctor of Philosophy, Computer Science, University of California, Berkeley, https://escholarship.org/content/qt4wf834q8/qt4wf834q8.pdf
Tanay Jaipuria, May 21, 2025, Infrastructure in the Age of AI Gatekeepers: What happens when AI agents choose your stack, https://www.tanayj.com/p/infrastructure-in-the-age-of-ai-gatekeepers
Apple, June 2025, Updates to Apple's On-Device and Server Foundation Language Models, https://machinelearning.apple.com/research/apple-foundation-models-2025-updates (Apple's 3B on-device model with cloud server alternative. The on-device architecture includes 2-bit quantization, 4-bit embeddings quantization, 8-bit KV quantization, a unique KV cache compression, interleaved local-global attention and multi-LoRA.)