Aussie AI

Serving and Deployment

  • Last Updated 17 November, 2025
  • by David Spuler, Ph.D.

Serving

Serving is the practical matter of how to architecture the full production application around the LLM. Other components may include a web server, application server, RAG datastore, retriever, load balancer, and more. Furthermore, there are some techniques that affect the speed of inference:

  • Batching
  • Prefill versus decoding phase
  • Scheduling
  • Load balancing
  • Frameworks (backend)

Research on LLM Serving

Recently, there has been an explosion of papers about the practical aspects of deployment, orchestration, and serving of LLM inference. Here's some of the papers:

  • Zao Zhang, 23 May 2024, Design Efficient Deep Neural Networks with System Optimization, Ph.D. Thesis, School of Electrical and Information Engineering, Faculty of Engineering, The University of Sydney, Australia, PDF: https://ses.library.usyd.edu.au/bitstream/handle/2123/32642/zhang_z_thesis.pdf?sequence=1&isAllowed=y https://ses.library.usyd.edu.au/handle/2123/32642 https://hdl.handle.net/2123/32642
  • Sohaib Ahmad, Hui Guan, Ramesh K. Sitaraman, 2024, Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy Scaling, https://guanh01.github.io/files/2024hpdc-loki.pdf
  • Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698 Code: https://github.com/abhibambhaniya/GenZ-LLM-Analyzer (Analysis of cost of serving LLMs, including separate profiles of prefill versus decoding phases, and the cost of extra prompt processing in RAG architectures with prepended information.)
  • Jeon, Byungsoo, May 2024, Automated and Portable Machine Learning Systems, Ph.D. Thesis, Carnegie Mellon University, https://doi.org/10.1184/R1/25746708.v1 https://kilthub.cmu.edu/articles/thesis/Automated_and_Portable_Machine_Learning_Systems/25746708/1 PDF: https://kilthub.cmu.edu/ndownloader/files/46074087 Code: https://github.com/cmu-catalyst/collage (Portability layer to integrate the various kernels and low-level backends more easily. Also covers pipeline parallelism in graph models, and KV cache parallelism similar to FlashDecode.)
  • Pai Zeng, Zhenyu Ning, Jieru Zhao, Weihao Cui, Mengwei Xu, Liwei Guo, Xusheng Chen, Yizhou Shan, 18 May 2024, The CAP Principle for LLM Serving, https://arxiv.org/abs/2405.11299
  • Vikranth Srivatsa∗, Zijian He∗, Reyna Abhyankar, Dongming Li, Yiying Zhang, 2024, Preble: Efficient Distributed Prompt Scheduling for LLM Serving, University of California, San Diego, https://escholarship.org/content/qt1bm0k1w0/qt1bm0k1w0.pdf (Evalulates prompt sharing including full inference cache or a partial prefix-based computation of a global KV cache for the prefill phase. Also schedules GPUs based on prefill versus decoding phase requirements.)
  • Paula Rooney, 14 May 2024, Private cloud makes its comeback, thanks to AI, CIO, https://www.cio.com/article/2104613/private-cloud-makes-its-comeback-thanks-to-ai.html
  • Chengyi Nie, Rodrigo Fonseca, Zhenhua Liu, 11 May 2024, Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving, https://arxiv.org/abs/2405.06856
  • Lequn Chen, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/bitstream/handle/1773/51337/Chen_washington_0250E_26603.pdf?sequence=1&isAllowed=y
  • Xue Geng, Zhe Wang, Chunyun Chen, Qing Xu, Kaixin Xu, Chao Jin, Manas Gupta, Xulei Yang, Zhenghua Chen, Mohamed M. Sabry Aly, Jie Lin, Min Wu, Xiaoli Li, 9 May 2024, From Algorithm to Hardware: A Survey on Efficient and Safe Deployment of Deep Neural Networks, https://arxiv.org/abs/2405.06038
  • Vinod Vijay Nigade, Latency-Critical Inference Serving for Deep Learning, Ph.D. Thesis, VRIJE UNIVERSITEIT, Netherlands, https://research.vu.nl/ws/portalfiles/portal/258499994/phdthesis-vinodvufinal+4+-+65043c3f62dc9.pdf
  • Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, Binhang Yuan, 2024, HEXGEN: Generative Inference of Large Language Model over Heterogeneous Environment. https://openreview.net/pdf?id=9ANyvRtFGa Code: https://github.com/Relaxed-System-Lab/HexGen
  • Shashank Verma and Neal Vaidya, Mastering LLM Techniques: Inference Optimization, Nov 17, 2023, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
  • Grant Wilkins, 3 June 202, Online Workload Allocation and Energy Optimization in Large Language Model Inference Systems, Master of Philosophy in Advanced Computer Science, Churchill College, University of Cambridge, https://grantwilkins.github.io/gfw27_project.pdf
  • David Spuler, March 2024, Chapter 7. Deployment Architecture, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Ke Cheng, Wen Hu, Zhi Wang, Peng Du, Jianguo Li, Sheng Zhang, 7 Jun 2024, Enabling Efficient Batch Serving for LMaaS via Generation Length Prediction, https://arxiv.org/abs/2406.04785
  • Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, Wei Lin, 5 Jun 2024, Llumnix: Dynamic Scheduling for Large Language Model Serving, https://arxiv.org/abs/2406.03243 Code: https://github.com/AlibabaPAI/llumnix
  • Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, Lili Qiu, 30 May 2024, Parrot: Efficient Serving of LLM-based Applications with Semantic Variable, https://arxiv.org/abs/2405.19888 (Uses prefix KV caching and a combined flash attention and paged attention module.)
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
  • 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
  • Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng ZHANG, Dahua Lin, Ion Stoica, Hao Zhang, 02 May 2024, MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving, ICML 2024, https://openreview.net/forum?id=R0SoZvqXyQ PDF: https://openreview.net/pdf?id=R0SoZvqXyQ Code: https://github.com/hao-ai-lab/MuxServe (Separates the prefill and decoding phases when serving, and also manages the LLM weights and KV cache data in blocks for memory efficiency.)
  • Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Shengkun Cui, Chandra Narayanaswami, Zbigniew Kalbarczyk, Ravishankar Iyer, 2024, One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving, https://haoran-qiu.com/pdf/qlm-preprint.pdf
  • Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
  • Schwinn Saereesitthipitak, Ashish Rao, Cathy Zhou, William Li, 2024, Prophet: An LLM Inference Engine Optimized For Head-of-Line Blocking, https://www.scs.stanford.edu/24sp-cs244b/projects/Prophet_An_LLM_Inference_Engine_Optimized_For_Head_of_Line_Blocking.pdf (Faster inference serving via iterative scheduling, separating prefill and decoding phase computations for batching, using priority-based schedulers with preemption, and controling transfer of KV caches from prefill to decoders.)
  • Ke Cheng, Wen Hu, Zhi Wang, Hongen Peng, Jianguo Li, Sheng Zhang, 19 Jun 2024, Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving, https://arxiv.org/abs/2406.13511 (Improved batched scheduling by splitting queries into fixed-size token generation slices.)
  • Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 26 Jun 2024 (v2), MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool, https://arxiv.org/abs/2406.17565 (Combined session-based prefix KV caching with disaggregation of prefill and decoding phases.)
  • Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu, 2 Jul 2024 (v2), Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving, https://arxiv.org/abs/2407.00079 Code: https://github.com/kvcache-ai/Mooncake (Disaggregates prefill and decoding phases for scheduling, with chunked prefill, while managing the KV cache.)
  • Isaac Ong, May 16, 2024, Efficient Distributed LLM Inference with Dynamic Partitioning, Masters Thesis, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-108, http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.pdf
  • Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
  • Yu, Lingfan, 2024, Improve Language Model Serving Efficiency With Fine-Grained and Stateful Scheduling, Ph.D. Thesis, Department of Computer Science, New York University, ProQuest Dissertations & Theses,   31139782, https://www.proquest.com/openview/7200cdfc0906f1d4edb8008b4368bcf9 PDF: https://cs.nyu.edu/media/publications/lingfan_yu_phd_thesis.pdf (Examines efficiency of batching methods and how to create a "stateful" version with cached multi-turn conversation history using session-based KV caching.)
  • Xin Tan, Jingzong Li, Jiamin Li, Yitao Yang, Hong Xu, August 2024, Arlo: Serving Transformer-based Language Models with Dynamic, Input Lengths, ICPP ’24, August 12–15, 2024, Gotland, Sweden, https://doi.org/10.1145/3673038.3673124 https://kanonjz.github.io/academic/share/xin-icpp24.pdf
  • Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
  • Chen, Lequn, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/items/13e14599-b4ee-4fbb-86bf-e58a4118d0f9
  • Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, Esha Choukse, 1 Aug 2024, DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency, https://arxiv.org/abs/2408.00741
  • Ke Cheng, Zhi Wang, Wen Hu, Tiannuo Yang, Jianguo Li, Sheng Zhang, 8 Aug 2024, Towards SLO-Optimized LLM Serving via Automatic Inference Engine Tuning, https://arxiv.org/abs/2408.04323
  • Andreas Kosmas Kakolyris, Dimosthenis Masouros, Petros Vavaroutsos, Sotirios Xydis, Dimitrios Soudris, 5 Aug 2024, SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving, https://arxiv.org/abs/2408.05235
  • Jaehong Cho, Minsu Kim, Hyunmin Choi, Guseul Heo, Jongse Park, 10 Aug 2024, LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale, https://arxiv.org/abs/2408.05499
  • Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo, Zhengyong Zhang, 15 Aug 2024, P/D-Serve: Serving Disaggregated Large Language Model at Scale, https://arxiv.org/abs/2408.08147 (Comprehensive serving system addressing disaggregated prefill and KV cache transfer with RDMA.)
  • Ryan Lucchese, Niki Birkner, Yaron Hagai, Virginia Adams, August 13, 2024, Together AI, A practitioner's guide to testing and running large GPU clusters for training generative AI models, https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models
  • Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, Hao Zhang, July 2024, MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving, Proceedings of the 41st July 2024, International Conference on Machine Learning, PMLR 235:11905-11917, 2024, https://proceedings.mlr.press/v235/duan24a.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/duan24a/duan24a.pdf Code: https://github.com/hao-ai-lab/MuxServe.
  • Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
  • Guangxuan Xiao, May 2024, Efficient Deployment Algorithms for Large Language Models, Masters Thesis, MIT, https://dspace.mit.edu/bitstream/handle/1721.1/156332/xiao-xgx-sm-eecs-2024-thesis.pdf
  • The SGLang Team, Jul 25, 2024, Achieving Faster Open-Source Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM), https://lmsys.org/blog/2024-07-25-sglang-llama3/
  • Ahmed Tremo, Aug 6, 2024, How to Efficiently Serve an LLM? https://ahmedtremo.com/posts/How-to-Efficiently-serve-an-llm/
  • Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci, 22 Aug 2024, NanoFlow: Towards Optimal Large Language Model Serving Throughput, https://arxiv.org/abs/2408.12757
  • Yao Lu, Song Bian, Lequn Chen, Yongjun He, Yulong Hui, Matthew Lentz, Beibin Li, Fei Liu, Jialin Li, Qi Liu, Rui Liu, Xiaoxuan Liu, Lin Ma, Kexin Rong, Jianguo Wang, Yingjun Wu, Yongji Wu, Huanchen Zhang, Minjia Zhang, Qizhen Zhang, Tianyi Zhou, Danyang Zhuo, 17 Jan 2024, Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native, https://arxiv.org/abs/2401.12230
  • Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica, Hao Zhang, 28 Aug 2024, Efficient LLM Scheduling by Learning to Rank, https://arxiv.org/abs/2408.15792 https://github.com/hao-ai-lab/vllm-ltr.git
  • Lightning AI, 2024, Serve LLMs, https://lightning.ai/docs/litserve/features/serve-llms
  • Y. Peng, W. Gao and H. Peng, "Serving DNN Inference With Fine-Grained Spatio-Temporal Sharing of GPU Servers," in IEEE Transactions on Services Computing, doi: 10.1109/TSC.2024.3463429. https://ieeexplore.ieee.org/document/10684028 https://www.computer.org/csdl/journal/sc/5555/01/10684028/20lm4PEVn9u
  • Aparna Dhinakaran, Sep 2024, Choosing Between LLM Agent Frameworks. The tradeoffs between building bespoke code-based agents and the major agent frameworks. https://towardsdatascience.com/choosing-between-llm-agent-frameworks-69019493b259
  • Yihua Cheng, Kuntai Du, Jiayi Yao, Junchen Jiang, 16 Sep 2024, Do Large Language Models Need a Content Delivery Network? https://arxiv.org/abs/2409.13761 https://github.com/LMCache/LMCache (Managing the process of sharing KV cache data over a network.)
  • Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
  • Yiyuan He, Minxian Xu, Jingfeng Wu, Wanyi Zheng, Kejiang Ye, Chengzhong Xu, 24 Sep 2024 (v2), UELLM: A Unified and Efficient Approach for LLM Inference Serving, https://arxiv.org/abs/2409.14961
  • Amey Agrawal, Junda Chen, Íñigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, Esha Choukse, 25 Sep 2024, Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations, https://arxiv.org/abs/2409.17264
  • Yifan Qiao, Shu Anzai, Shan Yu, Haoran Ma, Yang Wang, Miryung Kim, Harry Xu, 2 Oct 2024, ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving, https://arxiv.org/abs/2410.01228
  • Linke Song, Zixuan Pang, Wenhao Wang, Zihao Wang, XiaoFeng Wang, Hongbo Chen, Wei Song, Yier Jin, Dan Meng, Rui Hou, 30 Sep 2024, The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems, https://arxiv.org/abs/2409.20002
  • Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
  • Noah Martin, Abdullah Bin Faisal, Hiba Eltigani, Rukhshan Haroon, Swaminathan Lamelas, Fahad Dogar, 4 Oct 2024, LLMProxy: Reducing Cost to Access Large Language Models, https://arxiv.org/abs/2410.11857 (Deploying a proxy between user and LLM, with handling of conversational history context and caching.)
  • OpenVINO-toolkit, Oct 1, 2024, Introducing OpenVINO™ 2024.4, https://medium.com/openvino-toolkit/introducing-openvino-2024-4-28578870b264
  • Baolin Li, April 2024, Making Machine Learning on HPC Systems Cost-Effective and Carbon-Friendly, Ph.D. Thesis, The Department of Electrical and Computer Engineering, Computer Engineering, Northeastern University, Boston, Massachusetts, https://repository.library.northeastern.edu/files/neu:4f248m902/fulltext.pdf
  • Can Wang, Bolin Zhang, Dianbo Sui, Zhiying Tu, Xiaoyu Liu, Jiabao Kang, 1 Mar 2024 (v2), A Survey on Effective Invocation Methods of Massive LLM Services, https://arxiv.org/abs/2402.03408
  • Siddharth Jha, Coleman Hooper, Xiaoxuan Liu, Sehoon Kim, Kurt Keutzer, 15 Jul 2024 (v2), Learned Best-Effort LLM Serving, https://arxiv.org/abs/2401.07886
  • Yuhang Yao, Han Jin, Alay Dilipbhai Shah, Shanshan Han, Zijian Hu, Yide Ran, Dimitris Stripelis, Zhaozhuo Xu, Salman Avestimehr, Chaoyang He, 10 Sep 2024 (v2), ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency, https://arxiv.org/abs/2408.00008
  • Grant Wilkins, Srinivasan Keshav, Richard Mortier, 4 Jul 2024, Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems, https://arxiv.org/abs/2407.04014
  • Mastering LLM, Aug 17, 2024, How Much GPU Memory is Needed to Serve a Large Language Model (LLM)? https://masteringllm.medium.com/how-much-gpu-memory-is-needed-to-serve-a-large-languagemodel-llm-b1899bb2ab5d
  • Youpeng Zhao, Jun Wang, 31 Oct 2024, ALISE: Accelerating Large Language Model Serving with Speculative Scheduling, https://arxiv.org/abs/2410.23537
  • Yan Zhuang, Zhenzhe Zheng, Fan Wu, and Guihai Chen. 2024. LiteMoE: Customizing On-device LLM Serving via Proxy Submodel Tuning. In Proceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems (SenSys '24). Association for Computing Machinery, New York, NY, USA, 521–534. https://doi.org/10.1145/3666025.3699355 https://dl.acm.org/doi/abs/10.1145/3666025.3699355
  • Ziming Mao, Tian Xia, Zhanghao Wu, Wei-Lin Chiang, Tyler Griggs, Romil Bhardwaj, Zongheng Yang, Scott Shenker, Ion Stoica, 3 Nov 2024, SkyServe: Serving AI Models across Regions and Clouds with Spot Instances, https://arxiv.org/abs/2411.01438
  • R Mendoza, I Cruz, P Singh, A Martinez, N Kim, S Patel, Nov 2024, Dynamic Resource Management for Efficient Fast Device Placement https://www.researchgate.net/profile/Priya-Singh-103/publication/385528236_Dynamic_Resource_Management_for_Efficient_Fast_Device_Placement/links/672983c3ecbbde716b584acc/Dynamic-Resource-Management-for-Efficient-Fast-Device-Placement.pdf
  • H Zhang, Z Chen, XLY Liu, J Wu, L Wang, Nov 2024, Dynamic Fast Device Placement Strategies for Real-Time Resource Allocation, https://www.researchgate.net/profile/Haoran-Zhang-111/publication/385589353_Dynamic_Fast_Device_Placement_Strategies_for_Real-Time_Resource_Allocation/links/672b9ca977f274616d60a5e6/Dynamic-Fast-Device-Placement-Strategies-for-Real-Time-Resource-Allocation.pdf
  • OpenVINO™ toolkit, Sep 26, 2024, How To Efficiently Serve Today’s Large Language Models, https://medium.com/openvino-toolkit/how-to-efficiently-serve-todays-large-language-models-b3f1e8d33fdf
  • Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chandra Narayanaswami, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2024. Queue Management for SLO-Oriented Large Language Model Serving. In Proceedings of the 2024 ACM Symposium on Cloud Computing (SoCC '24). Association for Computing Machinery, New York, NY, USA, 18–35. https://doi.org/10.1145/3698038.3698523 https://dl.acm.org/doi/abs/10.1145/3698038.3698523
  • Haiying Shen, Tanmoy Sen, 10 Nov 2024, EcoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving, https://arxiv.org/abs/2411.06364
  • Kyoungmin Kim, Kijae Hong, Caglar Gulcehre, Anastasia Ailamaki, 12 Nov 2024, The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving, https://arxiv.org/abs/2411.07447
  • Redwan Ibne Seraj Khan, Kunal Jain, Haiying Shen, Ankur Mallick, Anjaly Parayil, Anoop Kulkarni, Steve Kofsky, Pankhuri Choudhary, Renèe St. Amant, Rujia Wang, Yue Cheng, Ali R. Butt, Victor Rühle, Chetan Bansal, Saravan Rajmohan, 24 Nov 2024, Ensuring Fair LLM Serving Amid Diverse Applications, https://arxiv.org/abs/2411.15997
  • Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, Ion Stoica, 25 Nov 2024, BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching, https://arxiv.org/abs/2411.16102
  • Ao Shen, Zhiyao Li, Mingyu Gao, 27 Nov 2024, FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving, https://arxiv.org/abs/2411.18424
  • Yanyu Chen, Ganhong Huang, 6 Dec 2024, GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments, https://arxiv.org/abs/2412.04788
  • He, Y., Xu, M., Wu, J., Zheng, W., Ye, K., Xu, C. (2025). UELLM: A Unified and Efficient Approach for Large Language Model Inference Serving. In: Gaaloul, W., Sheng, M., Yu, Q., Yangui, S. (eds) Service-Oriented Computing. ICSOC 2024. Lecture Notes in Computer Science, vol 15404. Springer, Singapore. https://doi.org/10.1007/978-981-96-0805-8_16 https://link.springer.com/chapter/10.1007/978-981-96-0805-8_16
  • Hongyi Jin, Ruihang Lai, Charlie F. Ruan, Yingcheng Wang, Todd C. Mowry, Xupeng Miao, Zhihao Jia, Tianqi Chen, 17 Dec 2024, A System for Microserving of LLMs, https://arxiv.org/abs/2412.12488 (Disaggregated prefill and decoding combined with context cache migration for sending the KV cache over the network.)
  • Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
  • Mingcong Song, Xinru Tang, Fengfan Hou, Jing Li, Wei Wei, Yipeng Ma, Runqiu Xiao, Hongjie Si, Dingcheng Jiang, Shouyi Yin, Yang Hu, Guoping Long, 24 Dec 2024, Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels, https://arxiv.org/abs/2412.18106
  • Y Xiao, Dec 2024, Optimizing the Serving System for Large Language Model Inference, https://charlie-xiao.github.io/assets/pdf/projects/fluidinfer.pdf (Concatenated or splits batches for higher throughput.)
  • Ahmet Caner Yüzügüler, Jiawei Zhuang, Lukas Cavigelli, 14 Jan 2025, PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving, https://arxiv.org/abs/2501.08192
  • Desen Sun, Zepeng Zhao, Yuke Wang, 16 Jan 2025, PATCHEDSERVE: A Patch Management Framework for SLO-Optimized Hybrid Resolution Diffusion Serving, https://arxiv.org/abs/2501.09253
  • Can Wang, Dianbo Sui, Bolin Zhang, Xiaoyu Liu, Jiabao Kang, Zhidong Qiao, Zhiying Tu, Jan 2025, A Framework for Effective Invocation Methods of Various LLM Services, Proceedings of the 31st International Conference on Computational Linguistics, pages 6953–6965, January 19–24, 2025, Association for Computational Linguistics, https://aclanthology.org/2025.coling-main.464.pdf
  • Dimitrios Liakopoulos, Tianrui Hu, Prasoon Sinha, Neeraja J. Yadwadkar, 8 Jan 2025, iServe: An Intent-based Serving System for LLMs, https://arxiv.org/abs/2501.13111 (Flexible LLM serving based on prioritizing latency versus cost.)
  • Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Baoquan Zhang, Shining Wan, Gengyuan Dan, Zhiyu Dong, Zhihao Ren, Jie Meng, Chao He, Changhong Liu, Tao Xie, Dayun Lin, Qin Zhang, Yue Yu, Hao Feng, Xusheng Chen, Yizhou Shan, 27 Jan 2025 (v2), DeepFlow: Serverless Large Language Model Serving at Scale, https://arxiv.org/abs/2501.14417
  • Ting Sun, Penghan Wang, Fan Lai, 15 Jan 2025, HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location, https://arxiv.org/abs/2501.14808
  • Xiaozhe Yao, Qinghao Hu, Ana Klimovic, 1 Nov 2024 (v2), DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs, https://arxiv.org/abs/2312.05215 (Serve multiple fine-tuned models with full parameters by using deltas/diffs, rather than PEFT or multi-LoRA.)
  • Haoran Qiu, Anish Biswas, Zihan Zhao, Jayashree Mohan, Alind Khare, Esha Choukse, Íñigo Goiri, Zeyu Zhang, Haiying Shen, Chetan Bansal, Ramachandran Ramjee, Rodrigo Fonseca, 2 Feb 2025, Towards Efficient Large Multimodal Model Serving, https://arxiv.org/abs/2502.00937 (Disaggregating or "decoupling" the different stages of multimodal LLM inference, not only prefill and decoding, but also the multimodal-specific bottlenecks in cross-attention and image encoding.)
  • Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. 2025. Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (ASPLOS '25). Association for Computing Machinery, New York, NY, USA, 586–602. https://doi.org/10.1145/3669940.3707215 https://dl.acm.org/doi/abs/10.1145/3669940.3707215
  • Gregory Dexter, Shao Tang, Ata Fatahi Baarzi, Qingquan Song, Tejas Dharamsi, Aman Gupta, 7 Feb 2025. LLM Query Scheduling with Prefix Reuse and Latency Constraints, https://arxiv.org/abs/2502.04677
  • Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, Eiko Yoneki, 13 Feb 2025, ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments, https://arxiv.org/abs/2502.09334
  • Shashwat Jaiswal, Kunal Jain, Yogesh Simmhan, Anjaly Parayil, Ankur Mallick, Rujia Wang, Renee St. Amant, Chetan Bansal, Victor Rühle, Anoop Kulkarni, Steve Kofsky, Saravan Rajmohan, 20 Feb 2025, Serving Models, Fast and Slow:Optimizing Heterogeneous LLM Inferencing Workloads at Scale, https://arxiv.org/abs/2502.14617
  • Alex Fazio, Feb 2025, How to Build an LLM Chat App: The New Litmus Test for Junior Devs, https://x.com/alxfazio/status/1893242657331101976 (How to build a wrapper chat app that scales by taking care of message queueing, API rate limits, history database management, caching, and other real-world deployment issues.)
  • Junsoo Kim, Hunjong Lee, Geonwoo Ko, Gyubin Choi, Seri Ham, Seongmin Hong, Joo-Young Kim, 6 Mar 2025, ADOR: A Design Exploration Framework for LLM Serving with Enhanced Latency and Throughput, https://arxiv.org/abs/2503.04253
  • Chen Zhang, Kuntai Du, Shu Liu, Woosuk Kwon, Xiangxi Mo, Yufeng Wang, Xiaoxuan Liu, Kaichao You, Zhuohan Li, Mingsheng Long, Jidong Zhai, Joseph Gonzalez, Ion Stoica, 24 Mar 2025, Jenga: Effective Memory Management for Serving LLM with Heterogeneity, https://arxiv.org/abs/2503.18292
  • AK Kakolyris, D Masouros, P Vavaroutsos, S Xydis, April 2025, throttLL’eM: Predictive GPU Throttling for Energy Efficient LLM Inference Serving, https://microlab.ntua.gr/wp-content/uploads/2025/03/throttLLeM_HPCA25.pdf https://github.com/WilliamBlaskowicz/throttLL-eM
  • Yuxing Xiang, Xue Li, Kun Qian, Wenyuan Yu, Ennan Zhai, Xin Jin, 15 May 2025, ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production, https://arxiv.org/abs/2505.09999
  • Shaoyu Wang, Guangrong He, Geon-Woo Kim, Yanqi Zhou, Seo Jin Park, 13 May 2025, Toward Cost-Efficient Serving of Mixture-of-Experts with Asynchrony, https://arxiv.org/abs/2505.08944
  • Hang Zhang, Jiuchen Shi, Yixiao Wang, Quan Chen, Yizhou Shan, Minyi Guo, 19 Apr 2025, Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management, https://arxiv.org/abs/2505.03756
  • Azam Ikram, Xiang Li, Sameh Elnikety, Saurabh Bagchi, 30 Apr 2025 (v2), Ascendra: Dynamic Request Prioritization for Efficient LLM Serving, https://arxiv.org/abs/2504.20828
  • Wei Zhang, Zhiyu Wu, Yi Mu, Banruo Liu, Myungjin Lee, Fan Lai, 24 Apr 2025, Tempo: Application-aware LLM Serving with Mixed SLO Requirements, https://arxiv.org/abs/2504.20068
  • Youhe Jiang, Fangcheng Fu, Wanru Zhao, Stephan Rabanser, Nicholas D. Lane, Binhang Yuan, 4 Jun 2025, Cascadia: A Cascade Serving System for Large Language Models, https://arxiv.org/abs/2506.04203
  • Xiannan Hu, Tianyou Zeng, Xiaoming Yuan, Liwei Song, Guangyuan Zhang, Bangzheng He, 6 Jun 2025, BestServe: Serving Strategies with Optimal Goodput in Collocation and Disaggregation Architectures, https://arxiv.org/abs/2506.05871
  • Jingfeng Wu, Yiyuan He, Minxian Xu, Xitong Gao, Kejiang Ye, Chengzhong Xu, 24 Jul 2025, Unlock the Potential of Fine-grained LLM Serving via Dynamic Module Scaling, https://arxiv.org/abs/2507.18006
  • Minxian Xu, Junhan Liao, Jingfeng Wu, Yiyuan He, Kejiang Ye, Chengzhong Xu, 24 Jul 2025, Cloud Native System for LLM Inference Serving, https://arxiv.org/abs/2507.18007
  • Wanyi Zheng, Minxian Xu, Shengye Song, Kejiang Ye, 23 Jul 2025, BucketServe: Bucket-Based Dynamic Batching for Smart and Efficient LLM Inference Serving, https://arxiv.org/abs/2507.17120
  • Jianmin Hu, Minxian Xu, Kejiang Ye, Chengzhong Xu, 23 Jul 2025, BrownoutServe: SLO-Aware Inference Serving under Bursty Workloads for MoE-based LLMs, https://arxiv.org/abs/2507.17133 (MoE serving optimization.)
  • Bodun Hu, Shuozhe Li, Saurabh Agarwal, Myungjin Lee, Akshay Jajoo, Jiamin Li, Le Xu, Geon-Woo Kim , Donghyun Kim, Hong Xu, Amy Zhang, Aditya Akella, Aug 2025, StitchLLM:Serving LLMs, One Block at a Time, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26887–26903 July 27- August 1, 2025, https://aclanthology.org/2025.acl-long.1305.pdf
  • Shiwei Gao, Qing Wang, Shaoxun Zeng, Youyou Lu, and Jiwu Shu, July 2025, Weaver: Efficient Multi-LLM Serving with Attention Offloading, 2025 USENIX Annual Technical Conference. July 7–9, 2025, Boston, MA, USA, https://www.usenix.org/conference/atc25/presentation/gao https://www.usenix.org/system/files/atc25-gao.pdf
  • Wenxin Zhang, Yueying Li, Tianyi Peng, Ciamac C. Moallemi, July 2025, Tail-Optimized Caching for LLM Inference, https://openreview.net/pdf?id=R3DICTGOkJ
  • Xiaoxiang Shi, Colin Cai, Junjia Du, 16 Jul 2025 (v4), Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving, https://arxiv.org/abs/2507.06608
  • Zheyu Shen, Yexiao He, Ziyao Wang, Yuning Zhang, Guoheng Sun, Wanghao Ye, Ang Li, 2 Jul 2025, EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices, https://arxiv.org/abs/2507.01438
  • Jiangsu Du, Hongbin Zhang, Taosheng Wei, Zhenyi Zheng, Kaiyi Wu, Zhiguang Chen, Yutong Lu, 25 Apr 2025, EcoServe: Enabling Cost-effective LLM Serving with Proactive Intra- and Inter-Instance Orchestration, https://arxiv.org/abs/2504.18154
  • Ranran Zhen, Juntao Li, Yixin Ji, Zhenlin Yang, Tong Liu, Qingrong Xia, Xinyu Duan, Zhefeng Wang, Baoxing Huai, Min Zhang, 28 Apr 2025, Taming the Titans: A Survey of Efficient LLM Inference Serving, https://arxiv.org/abs/2504.19720 (Surver of various inference and serving optimizations, such as parallelism, offloading, scheduling, length prediction, KV cache compression, and prefill-decode phase disaggregation.)
  • Wei Da, Evangelia Kalyvianaki, 5 Aug 2025, Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling, https://arxiv.org/abs/2508.03611
  • Yicheng Feng, Xin Tan, Kin Hang Sew, Yimin Jiang, Yibo Zhu, Hong Xu, 5 Aug 2025, Frontier: Simulating the Next Generation of LLM Inference Systems, https://arxiv.org/abs/2508.03148
  • Andrew Or, Apurva Jain, Daniel Vega-Myhre, Jesse Cai, Charles David Hernandez, Zhenrui Zheng, Driss Guessous, Vasiliy Kuznetsov, Christian Puhrsch, Mark Saroufim, Supriya Rao, Thien Tran, Aleksandar Samard\v{z}i\'c, 21 Jul 2025, TorchAO: PyTorch-Native Training-to-Serving Model Optimization, https://arxiv.org/abs/2507.16099
  • Juntao Zhao, Jiuru Li, Chuan Wu, 19 May 2025, Sandwich: Separating Prefill-Decode Compilation for Efficient CPU LLM Serving, https://arxiv.org/abs/2507.18454
  • Kan Zhu, Haiyang Shi, Le Xu, Jiaxin Shan, Arvind Krishnamurthy, Baris Kasikci, Liguang Xie, 17 Jul 2025, PolyServe: Efficient Multi-SLO Serving at Scale, https://arxiv.org/abs/2507.17769
  • Chang Xiao, Brenda Yang, 23 Jul 2025, Streaming, Fast and Slow: Cognitive Load-Aware Streaming for Efficient LLM Serving, https://arxiv.org/abs/2504.17999
  • Xutong Liu, Baran Atalar, Xiangxiang Dai, Jinhang Zuo, Siwei Wang, John C.S. Lui, Wei Chen, Carlee Joe-Wong, 11 Aug 2025, Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation, https://arxiv.org/abs/2508.07675
  • Xiaoxuan Liu, Jongseok Park, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Chen Zhang, Kuntai Du, Xiangxi Mo, Kaichao You, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang, 27 Jul 2025, TurboSpec: Closed-loop Speculation Control System for Optimizing LLM Serving Goodput, https://arxiv.org/abs/2406.14066
  • Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A. Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, Jianzhe Xiao, Xinyi Zhang, Lingjun Liu, Haibin Lin, Li-Wen Chang, Jianxi Ye, Xiao Yu, Xuanzhe Liu, Xin Jin, Xin Liu, 26 Jul 2025, MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism, https://arxiv.org/abs/2504.02263
  • Francisco Dur\'an, Matias Martinez, Patricia Lago, Silverio Mart\'inez-Fern\'andez, 30 Jul 2025, Insights into resource utilization of code small language models serving with runtime engines and execution providers, https://arxiv.org/abs/2412.15441
  • Lingyu Jiang, Yuping Wang, Yao Su, Shuo Xing, Wenjing Chen, Xin Zhang, Zhengzhong Tu, Ziming Zhang, Fangzhou Lin, Michael Zielewski, Kazunori D Yamada, 3 Aug 2025, KANMixer: Can KAN Serve as a New Modeling Core for Long-term Time Series Forecasting?, https://arxiv.org/abs/2508.01575
  • Wonung Kim, Yubin Lee, Yoonsung Kim, Jinwoo Hwang, Seongryong Oh, Jiyong Jung, Aziz Huseynov, Woong Gyu Park, Chang Hyun Park, Divya Mahajan, Jongse Park, 4 Aug 2025, Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving, https://arxiv.org/abs/2507.10178
  • Hao Zhang, Aining Jia, Weifeng Bu, Yushu Cai, Kai Sheng, Hao Chen, Xin He, 6 Aug 2025, FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design, https://arxiv.org/abs/2508.04405
  • Meixuan Wang, Yinyu Ye, Zijie Zhou, 8 Aug 2025, LLM Serving Optimization with Variable Prefill and Decode Lengths, https://arxiv.org/abs/2508.06133
  • Ferran Agullo, Joan Oliveras, Chen Wang, Alberto Gutierrez-Torre, Olivier Tardieu, Alaa Youssef, Jordi Torres, Josep Ll. Berral, 11 Aug 2025, Maximizing GPU Efficiency via Optimal Adapter Caching: An Analytical Approach for Multi-Tenant LLM Serving, https://arxiv.org/abs/2508.08343
  • Mohammed Saqr, Kamila Misiejuk, Sonsoles L\'opez-Pernas, 3 Aug 2025, Human-AI collaboration or obedient and often clueless AI in instruct, serve, repeat dynamics?, https://arxiv.org/abs/2508.10919
  • Zedong Liu, Shenggan Cheng, Guangming Tan, Yang You, and Dingwen Tao, 15 Aug 2025, ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism, https://arxiv.org/abs/2507.10069
  • Zahra Yousefijamarani, Xinglu Wang, Qian Wang, Morgan Lindsay Heisler, Taha Shabani, Niloofar Gholipour, Parham Yassini, Hong Chang, Kan Chen, Qiantao Zhang, Xiaolong Bai, Jiannan Wang, Ying Xiong, Yong Zhang, Zhenan Fan, 21 Aug 2025, HyperFlexis: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling, https://arxiv.org/abs/2508.15919
  • Zhixiang Wei, James Yen, Jingyi Chen, Ziyang Zhang, Zhibai Huang, Chen Chen, Xingzi Yu, Yicheng Gu, Chenggang Wu, Yun Wang, Mingyuan Xia, Jie Wu, Hao Wang, Zhengwei Qi, 19 Aug 2025, Equinox: Holistic Fair Scheduling in Serving Large Language Models, https://arxiv.org/abs/2508.16646
  • Bingyang Wu, Zili Zhang, Yinmin Zhong, Guanzhe Huang, Yibo Zhu, Xuanzhe Liu, Xin Jin, 24 Aug 2025, TokenLake: A Unified Segment-level Prefix Cache Pool for Fine-grained Elastic Long-Context LLM Serving, https://arxiv.org/abs/2508.17219
  • Wenbo Sun, Qiming Guo, Wenlu Wang, Rihan Hai, 25 Aug 2025, TranSQL+: Serving Large Language Models with SQL on Low-Resource Hardware, https://arxiv.org/abs/2502.02818
  • Yifan Yu, Yu Gan, Nikhil Sarda, Lillian Tsai, Jiaming Shen, Yanqi Zhou, Arvind Krishnamurthy, Fan Lai, Henry M. Levy, David Culler, 4 Sep 2025, IC-Cache: Efficient Large Language Model Serving via In-context Caching, https://arxiv.org/abs/2501.12689
  • Fangzhou Wu, Sandeep Silwal, 2 Sep 2025, Efficient Training-Free Online Routing for High-Volume Multi-LLM Serving, https://arxiv.org/abs/2509.02718
  • Jungwoo Kim, Minsang Kim, Jaeheon Lee, Chanwoo Moon, Heejin Kim, Taeho Hwang, Woosuk Chung, Yeseong Kim, Sungjin Lee, 26 Aug 2025, Rethinking Caching for LLM Serving Systems: Beyond Traditional Heuristics, https://arxiv.org/abs/2508.18736
  • Mingyu Yang, Jae-Young Choi, Kihyo Moon, Minsung Jang, and Eunjoo Joen, 1 Sep 2025, DSDE: Dynamic Speculative Decoding with KLD Stability for Real-World Serving, https://arxiv.org/abs/2509.01083
  • Huanqi Hu, Bowen Xiao, Shixuan Sun, Jianian Yin, Zhexi Zhang, Xiang Luo, Chengquan Jiang, Weiqi Xu, Xiaoying Jia, Xin Liu, Minyi Guo, 1 Sep 2025, LiquidGEMM: Hardware-Efficient W4A8 GEMM Kernel for High-Performance LLM Serving, https://arxiv.org/abs/2509.01229
  • Xiaoniu Song, Zihang Zhong, Rong Chen, Haibo Chen, 1 Sep 2025, ProMoE: Fast MoE-based LLM Serving using Proactive Caching, https://arxiv.org/abs/2410.22134
  • Fei Fang, Yifan Hua, Shengze Wang, Ruilin Zhou, Yi Liu, Chen Qian, Xiaoxue Zhang, 30 Aug 2025, GenTorrent: Scaling Large Language Model Serving with An Overlay Network, https://arxiv.org/abs/2504.20101
  • Kyungmin Bin, Seungbeom Choi, Jimyoung Son, Jieun Choi, Daseul Bae, Daehyeon Baek, Kihyo Moon, Minsung Jang, Hyojung Lee, 8 Sep 2025, FineServe: Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving, https://arxiv.org/abs/2509.06261
  • Aleksa Gordić, August 29, 2025, Inside vLLM: Anatomy of a High-Throughput LLM Inference System: From paged attention, continuous batching, prefix caching, specdec, etc. to multi-GPU, multi-node dynamic serving at scale https://www.aleksagordic.com/blog/vllm
  • Hamid Ahmad, Heiko Paulheim, Rita T. Sousa, 9 Sep 2025, Bio-KGvec2go: Serving up-to-date Dynamic Biomedical Knowledge Graph Embeddings, https://arxiv.org/abs/2509.07905
  • Jiahuan Yu (1), Aryan Taneja (1), Junfeng Lin (2), Minjia Zhang (1) ((1) University of Illinois Urbana-Champaign, (2) Tsinghua University), 5 Sep 2025, VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving, https://arxiv.org/abs/2509.04827
  • Ira Ceka, Feitong Qiao, Anik Dey, Aastha Valecha, Gail Kaiser, Baishakhi Ray, 11 Sep 2025, Can LLM Prompting Serve as a Proxy for Static Analysis in Vulnerability Detection, https://arxiv.org/abs/2412.12039
  • Dong Liu, Yanxuan Yu, 28 Aug 2025, TinyServe: Query-Aware Cache Selection for Efficient LLM Serving, https://arxiv.org/abs/2509.12211
  • Runyu Lu, Shiqi He, Wenxuan Tan, Shenggui Li, Ruofan Wu, Jeff J. Ma, Ang Chen, Mosharaf Chowdhury, 2 Oct 2025, TetriServe: Efficient DiT Serving for Heterogeneous Image Generation, https://arxiv.org/abs/2510.01565
  • Kevin Kuo, Chhavi Yadav, Virginia Smith, 14 Oct 2025, Research in Collaborative Learning Does Not Serve Cross-Silo Federated Learning in Practice, https://arxiv.org/abs/2510.12595
  • Sujun Tang, Christopher Priebe, Rohan Mahapatra, Lianhui Qin, Hadi Esmaeilzadeh, 27 Oct 2025, REASONING COMPILER: LLM-Guided Optimizations for Efficient Model Serving, https://arxiv.org/abs/2506.01374
  • Mohammad Firas Sada, John J. Graham, Elham E Khoda, Mahidhar Tatineni, Dmitry Mishin, Rajesh K. Gupta, Rick Wagner, Larry Smarr, Thomas A. DeFanti, Frank W\"urthwein, 22 Oct 2025, Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and NVIDIA Data Center GPUs, https://arxiv.org/abs/2507.00418
  • Tianhua Xia, Sai Qian Zhang, 16 Oct 2025, Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge Computing, https://arxiv.org/abs/2510.16040
  • Xingyu Fan, Feifei Li, Wenhui Que, Hailong Li, 22 Sep 2025, One Agent to Serve All: a Lite-Adaptive Stylized AI Assistant for Millions of Multi-Style Official Accounts, https://arxiv.org/abs/2509.17788
  • Shiju Zhao and Junhao Hu and Rongxiao Huang and Jiaqi Zheng and Guihai Chen, 20 Sep 2025, MPIC: Position-Independent Multimodal Context Caching System for Efficient MLLM Serving, https://arxiv.org/abs/2502.01960
  • Xinyu Wang, Jonas M. K\"ubler, Kailash Budhathoki, Yida Wang, Matth\"aus Kleindessner, 27 Oct 2025, Block-Diagonal LoRA for Eliminating Communication Overhead in Tensor Parallel LoRA Serving, https://arxiv.org/abs/2510.23346
  • Kayhan Behdin, Qingquan Song, Sriram Vasudevan, Jian Sheng, Xiaojing Ma, Z Zhou, Chuanrui Zhu, Guoyao Li, Chanh Nguyen, Sayan Ghosh, Hejian Sang, Ata Fatahi Baarzi, Sundara Raman Ramachandran, Xiaoqing Wang, Qing Lan, Vinay Y S, Qi Guo, Caleb Johnson, Zhipeng Wang, Fedor Borisyuk, 25 Oct 2025, Scaling Up Efficient Small Language Models Serving and Deployment for Semantic Job Search, https://arxiv.org/abs/2510.22101
  • Kayhan Behdin, Ata Fatahibaarzi, Qingquan Song, Yun Dai, Aman Gupta, Zhipeng Wang, Shao Tang, Hejian Sang, Gregory Dexter, Sirou Zhu, Siyu Zhu, Tejas Dharamsi, Vignesh Kothapalli, Zhoutong Fu, Yihan Cao, Pin-Lun Hsu, Fedor Borisyuk, Natesh Pillai, Luke Simon, Rahul Mazumder, 26 Oct 2025, Scaling Down, Serving Fast: Compressing and Deploying Efficient LLMs for Recommendation Systems, https://arxiv.org/abs/2502.14305
  • Rongxin Cheng and Yuxin Lai and Xingda Wei and Rong Chen and Haibo Chen, 8 Oct 2025, KunServe: Parameter-centric Memory Management for Efficient Memory Overloading Handling in LLM Serving, https://arxiv.org/abs/2412.18169
  • Junyi Chen, Chuheng Du, Renyuan Liu, Shuochao Yao, Dingtian Yan, Jiang Liao, Shengzhong Liu, Fan Wu, Guihai Chen, 3 Oct 2025, TokenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling, https://arxiv.org/abs/2510.02758
  • Qi Li, Junpan Wu, Xiang Liu, Yuxin Wang, Zeyu Li, Zhenheng Tang, Yuhan Chen, Shaohuai Shi, Xiaowen Chu, 21 Oct 2025, Reasoning Language Model Inference Serving Unveiled: An Empirical Study, https://arxiv.org/abs/2510.18672
  • Yue Duan, Lei Qi, Yinghuan Shi, Yang Gao, 25 Sep 2025, An Adaptor for Triggering Semi-Supervised Learning to Out-of-Box Serve Deep Image Clustering, https://arxiv.org/abs/2509.20976
  • Yuanyuan Yang, Ruimin Zhang, Jamie Morgenstern, Haifeng Xu, 26 Sep 2025, T-TAMER: Provably Taming Trade-offs in ML Serving, https://arxiv.org/abs/2509.22992
  • Yiheng Tao, Yihe Zhang, Matthew T. Dearing, Xin Wang, Yuping Fan, Zhiling Lan, 25 Sep 2025, PARS: Low-Latency LLM Serving via Pairwise Learning-to-Rank, https://arxiv.org/abs/2510.03243
  • Yufei Li, Yu Fu, Yue Dong, Cong Liu, 28 Sep 2025, MACE: A Hybrid LLM Serving System with Colocated SLO-aware Continuous Retraining Alignment, https://arxiv.org/abs/2510.03283
  • Hanfei Yu, Xingqi Cui, Hong Zhang, Hao Wang, Hao Wang, 4 Oct 2025, Taming Latency-Memory Trade-Off in MoE-Based LLM Serving via Fine-Grained Expert Offloading, https://arxiv.org/abs/2502.05370
  • Gabriele Oliaro, Xupeng Miao, Xinhao Cheng, Vineeth Kada, Mengdi Wu, Ruohan Gao, Yingyi Huang, Remi Delacourt, April Yang, Yingcheng Wang, Colin Unger, Zhihao Jia, 23 Oct 2025, FlexLLM: Token-Level Co-Serving of LLM Inference and Finetuning with SLO Guarantees, https://arxiv.org/abs/2402.18789
  • Sayan Mandal and Hua Jiang, 11 Oct 2025, Grounded AI for Code Review: Resource-Efficient Large-Model Serving in Enterprise Pipelines, https://arxiv.org/abs/2510.10290
  • Gunjun Lee and Jiwon Kim and Jaiyoung Park and Younjoo Lee and Jung Ho Ahn, 9 Oct 2025, From Tokens to Layers: Redefining Stall-Free Scheduling for LLM Serving with Layered Prefill, https://arxiv.org/abs/2510.08055
  • Shaoting Feng, Hanchen Li, Kuntai Du, Zhuohan Gu, Yuhan Liu, Jiayi Yao, Siddhant Ray, Samuel Shen, Yihua Cheng, Ganesh Ananthanarayanan, Junchen Jiang, 28 Aug 2025, AdaptCache: KV Cache Native Storage Hierarchy for Low-Delay and High-Quality Language Model Serving, https://arxiv.org/abs/2509.00105
  • Zhongkai Yu, Yue Guan, Zihao Yu, Chenyang Zhou, Shuyi Pei, Yangwook Kang, Yufei Ding, Po-An Tsai, 7 Oct 2025, Orders in Chaos: Enhancing Large-Scale MoE LLM Serving with Data Movement Forecasting, https://arxiv.org/abs/2510.05497
  • Yue Pan, Zihan Xia, Po-Kai Hsu, Lanxiang Hu, Hyungyo Kim, Janak Sharda, Minxuan Zhou, Nam Sung Kim, Shimeng Yu, Tajana Rosing, Mingu Kang, 6 Oct 2025, Stratum: System-Hardware Co-Design with Tiered Monolithic 3D-Stackable DRAM for Efficient MoE Serving, https://arxiv.org/abs/2510.05245
  • Tianhao Zhu, Dahu Feng, Erhu Feng, Yubin Xia, 7 Oct 2025, From Principles to Practice: A Systematic Study of LLM Serving on Multi-core NPUs, https://arxiv.org/abs/2510.05632
  • Jungi Lee, Junyong Park, Soohyun Cha, Jaehoon Cho, Jaewoong Sim, 16 Oct 2025, MX+: Pushing the Limits of Microscaling Formats for Efficient Large Language Model Serving, https://arxiv.org/abs/2510.14557

Deployment

Research on LLM deployment:

Batching

Research papers on batching:

Continuous Batching

Research papers on continuous batching:

Frameworks

Research on inference frameworks as part of serving:

Serverless

Scheduling

Load Balancing

Research papers on AI load balancing:

Networking

Research papers on networking optimizations for LLMs:

AI Tech Stack

Research on AI tech stacks:

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: