Aussie AI
Serving and Deployment
-
Last Updated 12 December, 2024
-
by David Spuler, Ph.D.
Serving
Serving is the practical matter of how to architecture the full production application around the LLM. Other components may include a web server, application server, RAG datastore, retriever, load balancer, and more. Furthermore, there are some techniques that affect the speed of inference:
- Batching
- Prefill versus decoding phase
- Scheduling
- Load balancing
- Frameworks (backend)
Research on LLM Serving
Recently, there has been an explosion of papers about the practical aspects of deployment, orchestration, and serving of LLM inference. Here's some of the papers:
- Zao Zhang, 23 May 2024, Design Efficient Deep Neural Networks with System Optimization, Ph.D. Thesis, School of Electrical and Information Engineering, Faculty of Engineering, The University of Sydney, Australia, PDF: https://ses.library.usyd.edu.au/bitstream/handle/2123/32642/zhang_z_thesis.pdf?sequence=1&isAllowed=y https://ses.library.usyd.edu.au/handle/2123/32642 https://hdl.handle.net/2123/32642
- Sohaib Ahmad, Hui Guan, Ramesh K. Sitaraman, 2024, Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy Scaling, https://guanh01.github.io/files/2024hpdc-loki.pdf
- Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698 Code: https://github.com/abhibambhaniya/GenZ-LLM-Analyzer (Analysis of cost of serving LLMs, including separate profiles of prefill versus decoding phases, and the cost of extra prompt processing in RAG architectures with prepended information.)
- Jeon, Byungsoo, May 2024, Automated and Portable Machine Learning Systems, Ph.D. Thesis, Carnegie Mellon University, https://doi.org/10.1184/R1/25746708.v1 https://kilthub.cmu.edu/articles/thesis/Automated_and_Portable_Machine_Learning_Systems/25746708/1 PDF: https://kilthub.cmu.edu/ndownloader/files/46074087 Code: https://github.com/cmu-catalyst/collage (Portability layer to integrate the various kernels and low-level backends more easily. Also covers pipeline parallelism in graph models, and KV cache parallelism similar to FlashDecode.)
- Pai Zeng, Zhenyu Ning, Jieru Zhao, Weihao Cui, Mengwei Xu, Liwei Guo, Xusheng Chen, Yizhou Shan, 18 May 2024, The CAP Principle for LLM Serving, https://arxiv.org/abs/2405.11299
- Vikranth Srivatsa∗, Zijian He∗, Reyna Abhyankar, Dongming Li, Yiying Zhang, 2024, Preble: Efficient Distributed Prompt Scheduling for LLM Serving, University of California, San Diego, https://escholarship.org/content/qt1bm0k1w0/qt1bm0k1w0.pdf (Evalulates prompt sharing including full inference cache or a partial prefix-based computation of a global KV cache for the prefill phase. Also schedules GPUs based on prefill versus decoding phase requirements.)
- Paula Rooney, 14 May 2024, Private cloud makes its comeback, thanks to AI, CIO, https://www.cio.com/article/2104613/private-cloud-makes-its-comeback-thanks-to-ai.html
- Chengyi Nie, Rodrigo Fonseca, Zhenhua Liu, 11 May 2024, Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving, https://arxiv.org/abs/2405.06856
- Lequn Chen, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/bitstream/handle/1773/51337/Chen_washington_0250E_26603.pdf?sequence=1&isAllowed=y
- Xue Geng, Zhe Wang, Chunyun Chen, Qing Xu, Kaixin Xu, Chao Jin, Manas Gupta, Xulei Yang, Zhenghua Chen, Mohamed M. Sabry Aly, Jie Lin, Min Wu, Xiaoli Li, 9 May 2024, From Algorithm to Hardware: A Survey on Efficient and Safe Deployment of Deep Neural Networks, https://arxiv.org/abs/2405.06038
- Vinod Vijay Nigade, Latency-Critical Inference Serving for Deep Learning, Ph.D. Thesis, VRIJE UNIVERSITEIT, Netherlands, https://research.vu.nl/ws/portalfiles/portal/258499994/phdthesis-vinodvufinal+4+-+65043c3f62dc9.pdf
- Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, Binhang Yuan, 2024, HEXGEN: Generative Inference of Large Language Model over Heterogeneous Environment. https://openreview.net/pdf?id=9ANyvRtFGa Code: https://github.com/Relaxed-System-Lab/HexGen
- Shashank Verma and Neal Vaidya, Mastering LLM Techniques: Inference Optimization, Nov 17, 2023, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
- Grant Wilkins, 3 June 202, Online Workload Allocation and Energy Optimization in Large Language Model Inference Systems, Master of Philosophy in Advanced Computer Science, Churchill College, University of Cambridge, https://grantwilkins.github.io/gfw27_project.pdf
- David Spuler, March 2024, Chapter 7. Deployment Architecture, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Ke Cheng, Wen Hu, Zhi Wang, Peng Du, Jianguo Li, Sheng Zhang, 7 Jun 2024, Enabling Efficient Batch Serving for LMaaS via Generation Length Prediction, https://arxiv.org/abs/2406.04785
- Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, Wei Lin, 5 Jun 2024, Llumnix: Dynamic Scheduling for Large Language Model Serving, https://arxiv.org/abs/2406.03243 Code: https://github.com/AlibabaPAI/llumnix
- Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, Lili Qiu, 30 May 2024, Parrot: Efficient Serving of LLM-based Applications with Semantic Variable, https://arxiv.org/abs/2405.19888 (Uses prefix KV caching and a combined flash attention and paged attention module.)
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng ZHANG, Dahua Lin, Ion Stoica, Hao Zhang, 02 May 2024, MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving, ICML 2024, https://openreview.net/forum?id=R0SoZvqXyQ PDF: https://openreview.net/pdf?id=R0SoZvqXyQ Code: https://github.com/hao-ai-lab/MuxServe (Separates the prefill and decoding phases when serving, and also manages the LLM weights and KV cache data in blocks for memory efficiency.)
- Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Shengkun Cui, Chandra Narayanaswami, Zbigniew Kalbarczyk, Ravishankar Iyer, 2024, One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving, https://haoran-qiu.com/pdf/qlm-preprint.pdf
- Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
- Schwinn Saereesitthipitak, Ashish Rao, Cathy Zhou, William Li, 2024, Prophet: An LLM Inference Engine Optimized For Head-of-Line Blocking, https://www.scs.stanford.edu/24sp-cs244b/projects/Prophet_An_LLM_Inference_Engine_Optimized_For_Head_of_Line_Blocking.pdf (Faster inference serving via iterative scheduling, separating prefill and decoding phase computations for batching, using priority-based schedulers with preemption, and controling transfer of KV caches from prefill to decoders.)
- Ke Cheng, Wen Hu, Zhi Wang, Hongen Peng, Jianguo Li, Sheng Zhang, 19 Jun 2024, Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving, https://arxiv.org/abs/2406.13511 (Improved batched scheduling by splitting queries into fixed-size token generation slices.)
- Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 26 Jun 2024 (v2), MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool, https://arxiv.org/abs/2406.17565 (Combined session-based prefix KV caching with disaggregation of prefill and decoding phases.)
- Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu, 2 Jul 2024 (v2), Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving, https://arxiv.org/abs/2407.00079 Code: https://github.com/kvcache-ai/Mooncake (Disaggregates prefill and decoding phases for scheduling, with chunked prefill, while managing the KV cache.)
- Isaac Ong, May 16, 2024, Efficient Distributed LLM Inference with Dynamic Partitioning, Masters Thesis, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-108, http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.pdf
- Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
- Yu, Lingfan, 2024, Improve Language Model Serving Efficiency With Fine-Grained and Stateful Scheduling, Ph.D. Thesis, Department of Computer Science, New York University, ProQuest Dissertations & Theses, 31139782, https://www.proquest.com/openview/7200cdfc0906f1d4edb8008b4368bcf9 PDF: https://cs.nyu.edu/media/publications/lingfan_yu_phd_thesis.pdf (Examines efficiency of batching methods and how to create a "stateful" version with cached multi-turn conversation history using session-based KV caching.)
- Xin Tan, Jingzong Li, Jiamin Li, Yitao Yang, Hong Xu, August 2024, Arlo: Serving Transformer-based Language Models with Dynamic, Input Lengths, ICPP ’24, August 12–15, 2024, Gotland, Sweden, https://doi.org/10.1145/3673038.3673124 https://kanonjz.github.io/academic/share/xin-icpp24.pdf
- Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
- Chen, Lequn, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/items/13e14599-b4ee-4fbb-86bf-e58a4118d0f9
- Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, Esha Choukse, 1 Aug 2024, DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency, https://arxiv.org/abs/2408.00741
- Ke Cheng, Zhi Wang, Wen Hu, Tiannuo Yang, Jianguo Li, Sheng Zhang, 8 Aug 2024, Towards SLO-Optimized LLM Serving via Automatic Inference Engine Tuning, https://arxiv.org/abs/2408.04323
- Andreas Kosmas Kakolyris, Dimosthenis Masouros, Petros Vavaroutsos, Sotirios Xydis, Dimitrios Soudris, 5 Aug 2024, SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving, https://arxiv.org/abs/2408.05235
- Jaehong Cho, Minsu Kim, Hyunmin Choi, Guseul Heo, Jongse Park, 10 Aug 2024, LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale, https://arxiv.org/abs/2408.05499
- Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo, Zhengyong Zhang, 15 Aug 2024, P/D-Serve: Serving Disaggregated Large Language Model at Scale, https://arxiv.org/abs/2408.08147 (Comprehensive serving system addressing disaggregated prefill and KV cache transfer with RDMA.)
- Ryan Lucchese, Niki Birkner, Yaron Hagai, Virginia Adams, August 13, 2024, Together AI, A practitioner's guide to testing and running large GPU clusters for training generative AI models, https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models
- Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, Hao Zhang, July 2024, MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving, Proceedings of the 41st July 2024, International Conference on Machine Learning, PMLR 235:11905-11917, 2024, https://proceedings.mlr.press/v235/duan24a.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/duan24a/duan24a.pdf Code: https://github.com/hao-ai-lab/MuxServe.
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
- Guangxuan Xiao, May 2024, Efficient Deployment Algorithms for Large Language Models, Masters Thesis, MIT, https://dspace.mit.edu/bitstream/handle/1721.1/156332/xiao-xgx-sm-eecs-2024-thesis.pdf
- The SGLang Team, Jul 25, 2024, Achieving Faster Open-Source Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM), https://lmsys.org/blog/2024-07-25-sglang-llama3/
- Ahmed Tremo, Aug 6, 2024, How to Efficiently Serve an LLM? https://ahmedtremo.com/posts/How-to-Efficiently-serve-an-llm/
- Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci, 22 Aug 2024, NanoFlow: Towards Optimal Large Language Model Serving Throughput, https://arxiv.org/abs/2408.12757
- Yao Lu, Song Bian, Lequn Chen, Yongjun He, Yulong Hui, Matthew Lentz, Beibin Li, Fei Liu, Jialin Li, Qi Liu, Rui Liu, Xiaoxuan Liu, Lin Ma, Kexin Rong, Jianguo Wang, Yingjun Wu, Yongji Wu, Huanchen Zhang, Minjia Zhang, Qizhen Zhang, Tianyi Zhou, Danyang Zhuo, 17 Jan 2024, Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native, https://arxiv.org/abs/2401.12230
- Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica, Hao Zhang, 28 Aug 2024, Efficient LLM Scheduling by Learning to Rank, https://arxiv.org/abs/2408.15792 https://github.com/hao-ai-lab/vllm-ltr.git
- Lightning AI, 2024, Serve LLMs, https://lightning.ai/docs/litserve/features/serve-llms
- Y. Peng, W. Gao and H. Peng, "Serving DNN Inference With Fine-Grained Spatio-Temporal Sharing of GPU Servers," in IEEE Transactions on Services Computing, doi: 10.1109/TSC.2024.3463429. https://ieeexplore.ieee.org/document/10684028 https://www.computer.org/csdl/journal/sc/5555/01/10684028/20lm4PEVn9u
- Aparna Dhinakaran, Sep 2024, Choosing Between LLM Agent Frameworks. The tradeoffs between building bespoke code-based agents and the major agent frameworks. https://towardsdatascience.com/choosing-between-llm-agent-frameworks-69019493b259
- Yihua Cheng, Kuntai Du, Jiayi Yao, Junchen Jiang, 16 Sep 2024, Do Large Language Models Need a Content Delivery Network? https://arxiv.org/abs/2409.13761 https://github.com/LMCache/LMCache (Managing the process of sharing KV cache data over a network.)
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Yiyuan He, Minxian Xu, Jingfeng Wu, Wanyi Zheng, Kejiang Ye, Chengzhong Xu, 24 Sep 2024 (v2), UELLM: A Unified and Efficient Approach for LLM Inference Serving, https://arxiv.org/abs/2409.14961
- Amey Agrawal, Junda Chen, Íñigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, Esha Choukse, 25 Sep 2024, Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations, https://arxiv.org/abs/2409.17264
- Yifan Qiao, Shu Anzai, Shan Yu, Haoran Ma, Yang Wang, Miryung Kim, Harry Xu, 2 Oct 2024, ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving, https://arxiv.org/abs/2410.01228
- Linke Song, Zixuan Pang, Wenhao Wang, Zihao Wang, XiaoFeng Wang, Hongbo Chen, Wei Song, Yier Jin, Dan Meng, Rui Hou, 30 Sep 2024, The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems, https://arxiv.org/abs/2409.20002
- Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
- Noah Martin, Abdullah Bin Faisal, Hiba Eltigani, Rukhshan Haroon, Swaminathan Lamelas, Fahad Dogar, 4 Oct 2024, LLMProxy: Reducing Cost to Access Large Language Models, https://arxiv.org/abs/2410.11857 (Deploying a proxy between user and LLM, with handling of conversational history context and caching.)
- OpenVINO-toolkit, Oct 1, 2024, Introducing OpenVINO™ 2024.4, https://medium.com/openvino-toolkit/introducing-openvino-2024-4-28578870b264
- Baolin Li, April 2024, Making Machine Learning on HPC Systems Cost-Effective and Carbon-Friendly, Ph.D. Thesis, The Department of Electrical and Computer Engineering, Computer Engineering, Northeastern University, Boston, Massachusetts, https://repository.library.northeastern.edu/files/neu:4f248m902/fulltext.pdf
- Can Wang, Bolin Zhang, Dianbo Sui, Zhiying Tu, Xiaoyu Liu, Jiabao Kang, 1 Mar 2024 (v2), A Survey on Effective Invocation Methods of Massive LLM Services, https://arxiv.org/abs/2402.03408
- Siddharth Jha, Coleman Hooper, Xiaoxuan Liu, Sehoon Kim, Kurt Keutzer, 15 Jul 2024 (v2), Learned Best-Effort LLM Serving, https://arxiv.org/abs/2401.07886
- Yuhang Yao, Han Jin, Alay Dilipbhai Shah, Shanshan Han, Zijian Hu, Yide Ran, Dimitris Stripelis, Zhaozhuo Xu, Salman Avestimehr, Chaoyang He, 10 Sep 2024 (v2), ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency, https://arxiv.org/abs/2408.00008
- Grant Wilkins, Srinivasan Keshav, Richard Mortier, 4 Jul 2024, Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems, https://arxiv.org/abs/2407.04014
- Mastering LLM, Aug 17, 2024, How Much GPU Memory is Needed to Serve a Large Language Model (LLM)? https://masteringllm.medium.com/how-much-gpu-memory-is-needed-to-serve-a-large-languagemodel-llm-b1899bb2ab5d
- Youpeng Zhao, Jun Wang, 31 Oct 2024, ALISE: Accelerating Large Language Model Serving with Speculative Scheduling, https://arxiv.org/abs/2410.23537
- Yan Zhuang, Zhenzhe Zheng, Fan Wu, and Guihai Chen. 2024. LiteMoE: Customizing On-device LLM Serving via Proxy Submodel Tuning. In Proceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems (SenSys '24). Association for Computing Machinery, New York, NY, USA, 521–534. https://doi.org/10.1145/3666025.3699355 https://dl.acm.org/doi/abs/10.1145/3666025.3699355
- Ziming Mao, Tian Xia, Zhanghao Wu, Wei-Lin Chiang, Tyler Griggs, Romil Bhardwaj, Zongheng Yang, Scott Shenker, Ion Stoica, 3 Nov 2024, SkyServe: Serving AI Models across Regions and Clouds with Spot Instances, https://arxiv.org/abs/2411.01438
- R Mendoza, I Cruz, P Singh, A Martinez, N Kim, S Patel, Nov 2024, Dynamic Resource Management for Efficient Fast Device Placement https://www.researchgate.net/profile/Priya-Singh-103/publication/385528236_Dynamic_Resource_Management_for_Efficient_Fast_Device_Placement/links/672983c3ecbbde716b584acc/Dynamic-Resource-Management-for-Efficient-Fast-Device-Placement.pdf
- H Zhang, Z Chen, XLY Liu, J Wu, L Wang, Nov 2024, Dynamic Fast Device Placement Strategies for Real-Time Resource Allocation, https://www.researchgate.net/profile/Haoran-Zhang-111/publication/385589353_Dynamic_Fast_Device_Placement_Strategies_for_Real-Time_Resource_Allocation/links/672b9ca977f274616d60a5e6/Dynamic-Fast-Device-Placement-Strategies-for-Real-Time-Resource-Allocation.pdf
- OpenVINO™ toolkit, Sep 26, 2024, How To Efficiently Serve Today’s Large Language Models, https://medium.com/openvino-toolkit/how-to-efficiently-serve-todays-large-language-models-b3f1e8d33fdf
- Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chandra Narayanaswami, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2024. Queue Management for SLO-Oriented Large Language Model Serving. In Proceedings of the 2024 ACM Symposium on Cloud Computing (SoCC '24). Association for Computing Machinery, New York, NY, USA, 18–35. https://doi.org/10.1145/3698038.3698523 https://dl.acm.org/doi/abs/10.1145/3698038.3698523
- Haiying Shen, Tanmoy Sen, 10 Nov 2024, EcoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving, https://arxiv.org/abs/2411.06364
- Kyoungmin Kim, Kijae Hong, Caglar Gulcehre, Anastasia Ailamaki, 12 Nov 2024, The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving, https://arxiv.org/abs/2411.07447
- Redwan Ibne Seraj Khan, Kunal Jain, Haiying Shen, Ankur Mallick, Anjaly Parayil, Anoop Kulkarni, Steve Kofsky, Pankhuri Choudhary, Renèe St. Amant, Rujia Wang, Yue Cheng, Ali R. Butt, Victor Rühle, Chetan Bansal, Saravan Rajmohan, 24 Nov 2024, Ensuring Fair LLM Serving Amid Diverse Applications, https://arxiv.org/abs/2411.15997
- Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, Ion Stoica, 25 Nov 2024, BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching, https://arxiv.org/abs/2411.16102
- Ao Shen, Zhiyao Li, Mingyu Gao, 27 Nov 2024, FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving, https://arxiv.org/abs/2411.18424
- Yanyu Chen, Ganhong Huang, 6 Dec 2024, GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments, https://arxiv.org/abs/2412.04788
Deployment
Research on LLM deployment:
- Sohaib Ahmad, Hui Guan, Ramesh K. Sitaraman, 2024, Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy Scaling, https://guanh01.github.io/files/2024hpdc-loki.pdf
- Jeon, Byungsoo, May 2024, Automated and Portable Machine Learning Systems, Ph.D. Thesis, Carnegie Mellon University, https://doi.org/10.1184/R1/25746708.v1 https://kilthub.cmu.edu/articles/thesis/Automated_and_Portable_Machine_Learning_Systems/25746708/1 PDF: https://kilthub.cmu.edu/ndownloader/files/46074087 Code: https://github.com/cmu-catalyst/collage (Portability layer to integrate the various kernels and low-level backends more easily. Also covers pipeline parallelism in graph models, and KV cache parallelism similar to FlashDecode.)
- Vikranth Srivatsa∗, Zijian He∗, Reyna Abhyankar, Dongming Li, Yiying Zhang, 2024, Preble: Efficient Distributed Prompt Scheduling for LLM Serving, University of California, San Diego, https://escholarship.org/content/qt1bm0k1w0/qt1bm0k1w0.pdf (Evalulates prompt sharing including full inference cache or a partial prefix-based computation of a global KV cache for the prefill phase. Also schedules GPUs based on prefill versus decoding phase requirements.)
- Yao Fu, 14 May 2024, Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis, https://arxiv.org/abs/2405.08944
- Paula Rooney, 14 May 2024, Private cloud makes its comeback, thanks to AI, CIO, https://www.cio.com/article/2104613/private-cloud-makes-its-comeback-thanks-to-ai.html
- Chengyi Nie, Rodrigo Fonseca, Zhenhua Liu, 11 May 2024, Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving, https://arxiv.org/abs/2405.06856
- JH Jones, May 2024, A Quantitative Comparison of Pre-Trained Model Registries to Traditional Software Package Registries, Masters Thesis, Electrical and Computer Engineering, Purdue University, https://hammer.purdue.edu/articles/thesis/A_Quantitative_Comparison_of_Pre-Trained_Model_Registries_to_Traditional_Software_Package_Registries/25686447/1 PDF: https://hammer.purdue.edu/ndownloader/files/46096152
- Jiamin Li, Le Xu, Hong Xu, Aditya Akella, 28 Apr 2024, BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models, https://arxiv.org/abs/2404.18322 (Partitioning inference over blocks for GPU.)
- Lequn Chen, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/bitstream/handle/1773/51337/Chen_washington_0250E_26603.pdf?sequence=1&isAllowed=y
- Cohere Toolkit, https://github.com/cohere-ai/cohere-toolkit (A set of open source components for RAG architectures.)
- Ahmed Menshawy, Zeeshan Nawaz, Mahmoud Fahmy, April 2024, Navigating Challenges and Technical Debt in Large Language Models Deployment, EuroMLSys '24: Proceedings of the 4th Workshop on Machine Learning and Systems, Pages 192–199, https://doi.org/10.1145/3642970.3655840 https://dl.acm.org/doi/abs/10.1145/3642970.3655840 PDF Slides: https://www.cl.cam.ac.uk/research/srg/netos/euromlsys2024/slides/P_5_27.pdf
- Jiachen Liu, Zhiyu Wu, Jae-Won Chung, Fan Lai, Myungjin Lee, Mosharaf Chowdhury, 25 Apr 2024, Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services, https://arxiv.org/abs/2404.16283 (Scheduling GPU activity for multiple queries to ensure good UI experience for text-streaming outputs like chatbots.)
- Xue Geng, Zhe Wang, Chunyun Chen, Qing Xu, Kaixin Xu, Chao Jin, Manas Gupta, Xulei Yang, Zhenghua Chen, Mohamed M. Sabry Aly, Jie Lin, Min Wu, Xiaoli Li, 9 May 2024, From Algorithm to Hardware: A Survey on Efficient and Safe Deployment of Deep Neural Networks, https://arxiv.org/abs/2405.06038
- Josef Pichlmeier, Philipp Ross, Andre Luckow, 22 Apr 2024, Expert Router: Orchestrating Efficient Language Model Inference through Prompt Classification, https://arxiv.org/abs/2404.15153
- Konstantinos Papaioannou, Thaleia Dimitra Doudali, April 2024, The Importance of Workload Choice in Evaluating LLM Inference Systems, EuroMLSys '24: Proceedings of the 4th Workshop on Machine Learning and Systems, April 2024, Pages 39–46, https://doi.org/10.1145/3642970.3655823 https://dl.acm.org/doi/abs/10.1145/3642970.3655823
- Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
- Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang, 2 Feb 2024, APIServe: Efficient API Support for Large-Language Model Inferencing, https://arxiv.org/abs/2402.01869
- Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
- Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 20 Jan 2024, Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads, https://arxiv.org/abs/2401.11181 (Separating the prefill and decoding phases for optimization.)
- Ajay Jaiswal, Bodun Hu, Lu Yin, Yeonju Ro, Shiwei Liu, Tianlong Chen, Aditya Akella, 5 Apr 2024, FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping, https://arxiv.org/abs/2404.03865 (Presents an input-adaptive layer skipping scheme for drops about 30% of FFN calculations. Avoids the KV caching problems by only skipping FFN computations in layers.)
- Stan Gibson, 03 Jun 2024, Getting infrastructure right for generative AI, CIO, https://www.cio.com/article/2128440/getting-infrastructure-right-for-generative-ai.html
- Vinod Vijay Nigade, Latency-Critical Inference Serving for Deep Learning, Ph.D. Thesis, VRIJE UNIVERSITEIT, Netherlands, https://research.vu.nl/ws/portalfiles/portal/258499994/phdthesis-vinodvufinal+4+-+65043c3f62dc9.pdf
- Jaskirat Singh, Bram Adams, Ahmed E. Hassan, 25 Mar 2024, On the Impact of Black-box Deployment Strategies for Edge AI on Latency and Model Performance, https://arxiv.org/abs/2403.17154 (MLOps deployment for quantization, partitioning and early-exit across mobile, edge, and cloud platforms, including running early exit on mobile.)
- LMDeploy Contributors, 2023, LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM, Apache 2.0 License, Code: https://github.com/InternLM/lmdeploy
- Mengke Ge, Junpeng Wang, Binhan Chen, Yingjian Zhong, Haitao Du, Song Chen, Yi Kang, 22 Mar 2024, Allspark: Workload Orchestration for Visual Transformers on Processing In-Memory Systems, https://arxiv.org/abs/2403.15069
- Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo, T Zhang, 2023, Deep Learning Workload Scheduling in GPU Datacenters: A Survey, ACM Computing Surveys, PDF: https://dl.acm.org/doi/pdf/10.1145/3638757
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
- Jiahang Zhou, Yanyu Chen, Zicong Hong, Wuhui Chen, Yue Yu, Tao Zhang, Hui Wang, Chuanfu Zhang, Zibin Zheng, 5 Jan 2024, Training and Serving System of Foundation Models: A Comprehensive Survey, https://arxiv.org/abs/2401.02643
- Meenu Mary John; Helena Holmström Olsson; Jan Bosch, 2020, AI Deployment Architecture: Multi-Case Study for Key Factor Identification, 2020 27th Asia-Pacific Software Engineering Conference (APSEC), https://ieeexplore.ieee.org/abstract/document/9359253
- Meenu Mary John, Helena Holmström Olsson, Jan Bosch, 2020, Architecting AI Deployment: A Systematic Review of State-of-the-Art and State-of-Practice Literature, ICSOB 2020: Software Business, pp 14–29, https://link.springer.com/chapter/10.1007/978-3-030-67292-8_2
- Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Stoica, Joseph Gonzalez, and Alexey Tumanov. 2020. InferLine: latency-aware provisioning and scaling for prediction serving pipelines. Proceedings of the 11th ACM Symposium on Cloud Computing. 477–491, https://arxiv.org/abs/1812.01776
- Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, Binhang Yuan, 2024, HEXGEN: Generative Inference of Large Language Model over Heterogeneous Environment. https://openreview.net/pdf?id=9ANyvRtFGa Code: https://github.com/Relaxed-System-Lab/HexGen
- Ali Rahmanian, Doctoral Thesis, April 2024, Edge Orchestration for Latency-Sensitive Applications, Department of Computing Science, Umea University, Sweden, https://www.diva-portal.org/smash/get/diva2:1849510/FULLTEXT02.pdf
- Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang, 19 Mar 2024 (v2), DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving, https://arxiv.org/abs/2401.09670 (Optimizing LLMs differently in the prefill and decoding phases.)
- Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, Chuan Wu, 2 Mar 2024, LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization, https://arxiv.org/abs/2403.01136 (Deployment of LLMs on heterogenous GPUs and also differences between the two phases of decoder-only Transformers: prefill and decoding computations.)
- Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, Hao Zhang, 2 Apr 2024, MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving, https://arxiv.org/abs/2404.02015
- Can Wang, Bolin Zhang, Dianbo Sui, Zhiying Tu, Xiaoyu Liu, Jiabao Kang, 1 Mar 2024 (v2), A Survey on Effective Invocation Methods of Massive LLM Services, https://arxiv.org/abs/2402.03408 (Deployment of LLMs as LLM-as-a-Service or LLMaaS architectures including prompt compression, semantic caching and model selection based on scoring inputs.)
- Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, Tianwei Zhang, https://arxiv.org/abs/2403.07648 12 Mar 2024, Characterization of Large Language Model Development in the Datacenter, (Analysis of deployment and LLOps issues in a 6-month production deployment.)
- Apple, June 2022, Deploying Transformers on the Apple Neural Engine, https://machinelearning.apple.com/research/neural-engine-transformers Code: https://github.com/apple/ml-ane-transformers
- Yao Lu, Song Bian, Lequn Chen, Yongjun He, Yulong Hui, Matthew Lentz, Beibin Li, Fei Liu, Jialin Li, Qi Liu, Rui Liu, Xiaoxuan Liu, Lin Ma, Kexin Rong, Jianguo Wang, Yingjun Wu, Yongji Wu, Huanchen Zhang, Minjia Zhang, Qizhen Zhang, Tianyi Zhou, Danyang Zhuo, 17 Jan 2024, Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native, https://arxiv.org/abs/2401.12230
- Chang, Xiangyu; Miraj Ahmed, Sk; Krishnamurthy, Srikanth V.; Guler, Basak; Swami, Ananthram; Oymak, Samet; Roy-Chowdhury, Amit K., Jan 2024, Plug-and-Play Transformer Modules for Test-Time Adaptation, https://arxiv.org/abs/2401.04130 https://ui.adsabs.harvard.edu/abs/2024arXiv240104130C/abstract
- Tal Peretz, 15 NOV 2023, The Developer's Guide to Production-Grade LLM Apps: Advanced Techniques for Maximizing LLM Performance, https://buildingaistuff.com/p/the-developers-guide-to-production
- Andrew Starc, Feb 22 2024, Mantel Group survey reveals AI challenges of large Australian businesses, CRN, https://www.crn.com.au/news/mantel-group-survey-reveals-ai-challenges-of-large-australian-businesses-605376
- Chris Parnin, Gustavo Soares, Rahul Pandita, Sumit Gulwani, Jessica Rich, Austin Z. Henley, 21 Dec 2023, Building Your Own Product Copilot: Challenges, Opportunities, and Needs, https://arxiv.org/abs/2312.14231
- Jacob Robbins, January 4, 2024, Why generative AI orchestration startups are poised for growth in 2024, Pitch Book, https://pitchbook.com/news/articles/generative-ai-orchestration-startups-venture-capital-unicorns
- Eberhard Hechler , Martin Oberhofer , Thomas Schaeck, 2020, Deploying AI in the Enterprise, Book, https://link.springer.com/book/10.1007/978-1-4842-6206-1
- Teresa Tung, June 2023, 7 architecture considerations for generative AI, Accenture, https://www.accenture.com/us-en/blogs/cloud-computing/7-generative-ai-architecture-considerations
- Hayden Wolff, Jun 02, 2024, A Simple Guide to Deploying Generative AI with NVIDIA NIM, NVIDIA Technical Blog, https://developer.nvidia.com/blog/a-simple-guide-to-deploying-generative-ai-with-nvidia-nim/
- Hong Zhang, Yupeng Tang, Anurag Khandelwal, and Ion Stoica. 2023. SHEPHERD: Serving DNNs in the Wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 787–808. https://www.usenix.org/conference/nsdi23/presentation/zhang-hong
- David Spuler, March 2024, Chapter 7. Deployment Architecture, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Kirill Kolodiazhnyi, May 15, 2020, Hands-On Machine Learning with C++: Build, train, and deploy end-to-end machine learning and deep learning pipelines, https://www.amazon.com/Hands-Machine-Learning-end-end/dp/1789955335/
- Deci Engineering Team, September 28, 2021, 5 Factors that Impact the Inference Pipeline in Production + Hardware Usage Metrics, https://deci.ai/blog/optimize-inference-pipeline-production/
- Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698
- Adva Nakash Peleg, May 30, 2024, An LLM Journey: From POC to Production, https://medium.com/cyberark-engineering/an-llm-journey-from-poc-to-production-6c5ec6a172fb
- Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, Wei Lin, 5 Jun 2024, Llumnix: Dynamic Scheduling for Large Language Model Serving, https://arxiv.org/abs/2406.03243 Code: https://github.com/AlibabaPAI/llumnix
- Fabian Both, June 2024, why we no longer use LangChain for building our AI agents , https://www.octomind.dev/blog/why-we-no-longer-use-langchain-for-building-our-ai-agents (Replaces LangChain with their own more-focused internal tool sets.)
- Waleed Kadous, August 23, 2023, Llama 2 is about as factually accurate as GPT-4 for summaries and is 30X cheaper, https://www.anyscale.com/blog/llama-2-is-about-as-factually-accurate-as-gpt-4-for-summaries-and-is-30x-cheaper Code: https://github.com/anyscale/factuality-eval
- Louis-François Bouchard, Louie Peters, May 2024, Chapter 11: Deployment, Building LLMs for Production: Enhancing LLM Abilities and Reliability with Prompting, Fine-Tuning, and RAG, https://www.amazon.com/Building-LLMs-Production-Reliability-Fine-Tuning/dp/B0D4FFPFW8/
- Aarushi Kansal, Chapter 7: Monitoring, Building Generative AI-Powered Apps: A Hands-on Guide for Developers, Apress, https://www.amazon.com/Building-Generative-AI-Powered-Apps-Hands-ebook/dp/B0CTXXP1S4/
- Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
- Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
- Ryan Lucchese, Niki Birkner, Yaron Hagai, Virginia Adams, August 13, 2024, Together AI, A practitioner's guide to testing and running large GPU clusters for training generative AI models, https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models
- Yeonhong Park, Jake Hyun, SangLyul Cho, Bonggeun Sim, Jae W. Lee, 21 Jun 2024 (v4), Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs, https://arxiv.org/abs/2402.10517 Code: https://github.com/SNU-ARC/any-precision-llm
- Guangxuan Xiao, May 2024, Efficient Deployment Algorithms for Large Language Models, Masters Thesis, MIT, https://dspace.mit.edu/bitstream/handle/1721.1/156332/xiao-xgx-sm-eecs-2024-thesis.pdf
- Intel, Jul 24, 2024, Generative AI Fundamentals: Deploying LLMs with OpenVINO™, OpenVINO™ toolkit, https://medium.com/openvino-toolkit/generative-ai-fundamentals-deploying-llms-with-openvino-3057861f6feb
- Hao Zhou, Chengming Hu, Ye Yuan, Yufei Cui, Yili Jin, Can Chen, Haolun Wu, Dun Yuan, Li Jiang, Di Wu, Xue Liu, Charlie Zhang, Xianbin Wang, Jiangchuan Liu, 17 May 2024, Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities, https://arxiv.org/abs/2405.10825
- Abhinand, Aug 20, 2024, Self-Hosting LLaMA 3.1 70B (or any ~70B LLM) Affordably, https://abhinand05.medium.com/self-hosting-llama-3-1-70b-or-any-70b-llm-affordably-2bd323d72f8d
- Dom Couldwell, Sep 03, 2024 Dealing with ‘day two’ issues in generative AI deployments, https://www.infoworld.com/article/3493255/dealing-with-day-two-issues-in-generative-ai-deployments.html
- Lightning AI, 2024, Serve LLMs, https://lightning.ai/docs/litserve/features/serve-llms
- Evan Schuman, 01 May 2024, LLM deployment flaws that catch IT by surprise, https://www.computerworld.com/article/2095216/llm-deployment-flaws-that-catch-it-by-surprise.html
- Michael Nuñez, September 10, 2024, Is Anthropic’s new ‘Workspaces’ feature the future of enterprise AI management? https://venturebeat.com/ai/is-anthropics-new-workspaces-feature-the-future-of-enterprise-ai-management/
- Andrei Paleyes, Raoul-Gabriel Urma, Neil D. Lawrence, 19 May 2022 (v3), Challenges in Deploying Machine Learning: a Survey of Case Studies, ACM Comput. Surv., Vol. 55, No. 6, Article 114, December 2022. https://doi.org/10.1145/3533378 https://arxiv.org/abs/2011.09926 https://dl.acm.org/doi/fullHtml/10.1145/3533378#Bib0005
- Yiyuan He, Minxian Xu, Jingfeng Wu, Wanyi Zheng, Kejiang Ye, Chengzhong Xu, 24 Sep 2024 (v2), UELLM: A Unified and Efficient Approach for LLM Inference Serving, https://arxiv.org/abs/2409.14961
- Dylan Patel and Daniel Nishball, Oct 03, 2024, AI Neocloud Playbook and Anatomy, https://www.semianalysis.com/p/ai-neocloud-playbook-and-anatomy
- Michael J. Zellinger, Matt Thomson, 3 Oct 2024, Efficiently Deploying LLMs with Controlled Risk, https://arxiv.org/abs/2410.02173
- Noah Martin, Abdullah Bin Faisal, Hiba Eltigani, Rukhshan Haroon, Swaminathan Lamelas, Fahad Dogar, 4 Oct 2024, LLMProxy: Reducing Cost to Access Large Language Models, https://arxiv.org/abs/2410.11857 (Deploying a proxy between user and LLM, with handling of conversational history context and caching.)
- Mastering LLM, Aug 17, 2024, How Much GPU Memory is Needed to Serve a Large Language Model (LLM)? https://masteringllm.medium.com/how-much-gpu-memory-is-needed-to-serve-a-large-languagemodel-llm-b1899bb2ab5d
- Fan Yang, Zehao Wang∗, Haoyu Zhang, Zhenhua Zhu, Xinhao Yang, Guohao Dai, Yu Wang, Oct 2024, Efficient Deployment of Large Language Model across Cloud-Device Systems, https://nicsefc.ee.tsinghua.edu.cn/nics_file/pdf/f06a14c1-4d6d-441d-b4e4-82545ac5781b.pdf
- Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh, 4 Nov 2024, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization, https://arxiv.org/abs/2411.02355
- Alina Mailach, Sebastian Simon, Johannes Dorn, Norbert Siegmund, 13 Nov 2024, Practitioners' Discussions on Building LLM-based Applications for Production, https://arxiv.org/abs/2411.08574
- Sonal Prabhune, Donald J. Berndt, 7 Nov 2024, Deploying Large Language Models With Retrieval Augmented Generation, https://arxiv.org/abs/2411.11895
- Narcisa Guran, Florian Knauf, Man Ngo, Stefan Petrescu, Jan S. Rellermeyer, 21 Nov 2024, Towards a Middleware for Large Language Models, https://arxiv.org/abs/2411.14513
- Yanyu Chen, Ganhong Huang, 6 Dec 2024, GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments, https://arxiv.org/abs/2412.04788
Batching
Research papers on batching:
- Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698 Code: https://github.com/abhibambhaniya/GenZ-LLM-Analyzer (Analysis of cost of serving LLMs, including separate profiles of prefill versus decoding phases, and the cost of extra prompt processing in RAG architectures with prepended information.)
- Xiao Fu, Weiling Yang, Dezun Dong, Xing Su, 03 June 2024, Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs, ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing, May 2024, Pages 137–149, https://doi.org/10.1145/3650200.3656620 https://dl.acm.org/doi/abs/10.1145/3650200.3656620
- Chengyi Nie, Rodrigo Fonseca, Zhenhua Liu, 11 May 2024, Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving, https://arxiv.org/abs/2405.06856
- D Shin, May 8, 2024, Multi-User Language Model Resource Allocation Using Contextual Pause Token Aware Transformers, Technical Disclosure Commons, https://www.tdcommons.org/dpubs_series/6981/ PDF: https://www.tdcommons.org/cgi/viewcontent.cgi?article=8121&context=dpubs_series (Interesting idea of training a model how and when to pause during inference, so it can be pre-empted if needed, and thus the overall system can schedule batching of multiple queries more optimally.)
- Lequn Chen, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/bitstream/handle/1773/51337/Chen_washington_0250E_26603.pdf?sequence=1&isAllowed=y
- Shashank Verma and Neal Vaidya, Nov 17, 2023 Mastering LLM Techniques: Inference Optimization, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/ (An overview that covers a lot of inference optimization techniques.)
- Theia Vogel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
- Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang, 2 Feb 2024, APIServe: Efficient API Support for Large-Language Model Inferencing, https://arxiv.org/abs/2402.01869
- Qidong Su, Christina Giannoula, Gennady Pekhimenko, Oct 2023, The Synergy of Speculative Decoding and Batching in Serving Large Language Models, https://arxiv.org/abs/2310.18813 (Optimizing by adapting dynamically the length of the speculated sequence in batches.)
- Jaehyun Park, Jaewan Choi, Kwanhee Kyung, Michael Jaemin Kim, Yongsuk Kwon, Nam Sung Kim, Jung Ho Ahn, April 2024, AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, Pages 103–119, https://doi.org/10.1145/3620665.3640422
- Gyeongin Yu, Geon-Woo Kim, Joo Seong Jeong, Soo Jeong Kim, Byung-Gon Chun, 2022, Selective Batching for Inference System for Transformer-Based Generation Tasks, U.S. Patent, US20230177401A1 https://patents.google.com/patent/US20230177401A1/en
- Ke Cheng, Wen Hu, Zhi Wang, Peng Du, Jianguo Li, Sheng Zhang, 7 Jun 2024, Enabling Efficient Batch Serving for LMaaS via Generation Length Prediction, https://arxiv.org/abs/2406.04785
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Siyan Zhao, Daniel Israel, Guy Van den Broeck, Aditya Grover, 15 Apr 2024, Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models, https://arxiv.org/abs/2404.09529 Code: https://github.com/siyan-zhao/prepacking (Optimizes prefill KV cache computations by batching multiple query prefill phases together via packing, since prefill token sequence lengths are fully known, and further combined with simple modifications to positional encoding and masking to avoid cross-query attention.)
- Ke Cheng, Wen Hu, Zhi Wang, Hongen Peng, Jianguo Li, Sheng Zhang, 19 Jun 2024, Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving, https://arxiv.org/abs/2406.13511 (Improved batched scheduling by splitting queries into fixed-size token generation slices.)
- Hui Wu, Yi Gan, Feng Yuan, Jing Ma, Wei Zhu, Yutao Xu, Hong Zhu, Yuhua Zhu, Xiaoli Liu, Jinghui Gu, Peng Zhao, 23 Jun 2024 (v2), Efficient LLM inference solution on Intel GPU, https://arxiv.org/abs/2401.05391 (Disaggregated the KV cache between prefill and decoding tokens, since theh KV cache size is known for prefill, thereby reducing memory fragmentation, and also applying kernel fusion to several modules include the scaled dot product attention.)
- Kartik Talamadupula, March 4, 2024, A Guide to LLM Inference Performance Monitoring, https://symbl.ai/developers/blog/a-guide-to-llm-inference-performance-monitoring/
- Yuqing Yang, Yuedong Xu, Lei Jiao, 7 Jul 2024, A Queueing Theoretic Perspective on Low-Latency LLM Inference with Variable Token Length, https://arxiv.org/abs/2407.05347
- Myung Beom Her, Jisu Jeong, Hojoon Song, Ji-Hyeong Han, 5 Jul 2024, Batch Transformer: Look for Attention in Batch, https://arxiv.org/abs/2407.04218
- Isaac Ong, May 16, 2024, Efficient Distributed LLM Inference with Dynamic Partitioning, Masters Thesis, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-108, http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.pdf
- Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
- Yu, Lingfan, 2024, Improve Language Model Serving Efficiency With Fine-Grained and Stateful Scheduling, Ph.D. Thesis, Department of Computer Science, New York University, ProQuest Dissertations & Theses, 31139782, https://www.proquest.com/openview/7200cdfc0906f1d4edb8008b4368bcf9 PDF: https://cs.nyu.edu/media/publications/lingfan_yu_phd_thesis.pdf (Examines efficiency of batching methods and how to create a "stateful" version with cached multi-turn conversation history using session-based KV caching.)
- Felippe Vieira Zacarias, Kiran Palli, Sudharshan Vazhkudai, Evelyn Grevelink, July 2024, Analyzing LLM performance: The impact of high-bandwidth memory on model inference, https://www.micron.com/content/dam/micron/global/public/documents/products/product-flyer/llm-inference-engineering-report.pdf
- Ruijie Miao, Yihan Yan, Xinshuo Yao, Tong Yang, 25 Jul 2024, An Efficient Inference Framework for Early-exit Large Language Models, https://arxiv.org/abs/2407.20272 (Faster early exit using batching and KV cache resolution.)
- Amr Elmeleegy, Shivam Raj, Brian Slechta and Vishal Mehta, Jun 12, 2024, Demystifying AI Inference Deployments for Trillion Parameter Large Language Models, NVIDIA Technical Blog, https://developer.nvidia.com/blog/demystifying-ai-inference-deployments-for-trillion-parameter-large-language-models/
- Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo, Zhengyong Zhang, 15 Aug 2024, P/D-Serve: Serving Disaggregated Large Language Model at Scale, https://arxiv.org/abs/2408.08147 (Comprehensive serving system addressing disaggregated prefill and KV cache transfer with RDMA.)
- S. Selvam, A. Nagarajan, A. Raghunathan, 16 August 2024, Efficient Batched Inference in Conditional Neural Networks, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3445263, https://ieeexplore.ieee.org/abstract/document/10638141 Code: https://github.com/surya00060/BatchCond
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
- Sergei Savvov, Jun 27, 2023, 7 Ways To Speed Up Inference of Your Hosted LLMs, https://betterprogramming.pub/speed-up-llm-inference-83653aa24c47
- S. Selvam, A. Nagarajan and A. Raghunathan, 2024, Efficient Batched Inference in Conditional Neural Networks, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3445263, https://ieeexplore.ieee.org/document/10638141
- O. Khan, J. Yu, Y. Kim and E. Seo, 2024, Efficient Adaptive Batching of DNN Inference Services for Improved Latency, 2024 International Conference on Information Networking (ICOIN), Ho Chi Minh City, Vietnam, 2024, pp. 197-200, doi: 10.1109/ICOIN59985.2024.10572152, https://ieeexplore.ieee.org/document/10572152
- Yosuke Oyama, Tal Ben-Nun, Torsten Hoefler, Satoshi Matsuoka, 2024, Accelerating Deep Learning Frameworks with Micro-Batches, 2018 IEEE International Conference on Cluster Computing (CLUSTER), Year: 2018, Pages: 402-412, DOI Bookmark: 10.1109/CLUSTER.2018.00058, https://www.computer.org/csdl/proceedings-article/cluster/2018/831900a402/17D45WHONl5
- Ahmed Tremo, Aug 6, 2024, How to Efficiently Serve an LLM? https://ahmedtremo.com/posts/How-to-Efficiently-serve-an-llm/
- Cade Daniel, Chen Shen, Eric Liang and Richard Liaw , June 22, 2023, How continuous batching enables 23x throughput in LLM inference while reducing p50 latency, https://www.anyscale.com/blog/continuous-batching-llm-inference
- Zihao Ye,, Lequn Chen, Ruihang Lai, Yilong Zhao, Size Zheng, Junru Shao, Bohan Hou, Hongyi Jin, Yifei Zuo, Liangsheng Yin, Tianqi Chen, Luis Ceze, Feb 2, 2024, Accelerating Self-Attentions for LLM Serving with FlashInfer, https://flashinfer.ai/2024/02/02/introduce-flashinfer.html
- Zihao Ye, Ruihang Lai, Bo-Ru Lu, Chien-Yu Lin, Size Zheng, Lequn Chen, Tianqi Chen, Luis Ceze, Feb 2, 2024 , Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding, https://flashinfer.ai/2024/02/02/cascade-inference.html
- Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci, 22 Aug 2024, NanoFlow: Towards Optimal Large Language Model Serving Throughput, https://arxiv.org/abs/2408.12757
- Yao Lu, Song Bian, Lequn Chen, Yongjun He, Yulong Hui, Matthew Lentz, Beibin Li, Fei Liu, Jialin Li, Qi Liu, Rui Liu, Xiaoxuan Liu, Lin Ma, Kexin Rong, Jianguo Wang, Yingjun Wu, Yongji Wu, Huanchen Zhang, Minjia Zhang, Qizhen Zhang, Tianyi Zhou, Danyang Zhuo, 17 Jan 2024, Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native, https://arxiv.org/abs/2401.12230
- Jiayi Liu, Tinghan Yang, Jennifer Neville, 17 Feb 2024, CliqueParcel: An Approach For Batching LLM Prompts That Jointly Optimizes Efficiency And Faithfulness, https://arxiv.org/abs/2402.14833
- Lightning AI, 2024, Batching, https://lightning.ai/docs/litserve/features/batching
- DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et. al. (many additional authors), 19 Jun 2024 (v5), DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, https://arxiv.org/abs/2405.04434
- Yiyuan He, Minxian Xu, Jingfeng Wu, Wanyi Zheng, Kejiang Ye, Chengzhong Xu, 24 Sep 2024 (v2), UELLM: A Unified and Efficient Approach for LLM Inference Serving, https://arxiv.org/abs/2409.14961
- Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
- Shixiaowei02, Oct 2024, TensorRT-LLM 0.13.0 Release Latest, https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.13.0
- Yifan Qiao, Shu Anzai, Shan Yu, Haoran Ma, Yang Wang, Miryung Kim, Harry Xu, 2 Oct 2024, ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving, https://arxiv.org/abs/2410.01228
- Shi, J., Shi, C. (2025). Improve LLM Inference Performance with Matrix Decomposition Strategies. In: Shi, Z., Witbrock, M., Tian, Q. (eds) Intelligence Science V. ICIS 2024. IFIP Advances in Information and Communication Technology, vol 720. Springer, Cham. https://doi.org/10.1007/978-3-031-71253-1_12 https://link.springer.com/chapter/10.1007/978-3-031-71253-1_12 (Speed up matrix operations with SVD and NMF via adaptive block sizing based on batching.)
- Michael Nuñez, October 8, 2024, Anthropic challenges OpenAI with affordable batch processing, https://venturebeat.com/ai/anthropic-challenges-openai-with-affordable-batch-processing/
- Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
- Y Cong, 2024, Research for Enhancing Processing and Computational Efficiency in LLM, 2024 2nd International Conference on Image, https://www.atlantis-press.com/article/126004157.pdf
- Xin He, Shunkang Zhang, Yuxin Wang, Haiyan Yin, Zihao Zeng, Shaohuai Shi, Zhenheng Tang, Xiaowen Chu, Ivor Tsang, Ong Yew Soon, 23 Oct 2024, ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference, https://arxiv.org/abs/2410.17954
- Peizhuang Cong, Qizhi Chen, Haochen Zhao, Tong Yang, 24 Oct 2024, BATON: Enhancing Batch-wise Inference Efficiency for Large Language Models via Dynamic Re-batching, https://arxiv.org/abs/2410.18701
- Yuzhe Yang, Yipeng Du, Ahmad Farhan, Claudio Angione, Yue Zhao, Harry Yang, Fielding Johnston, James Buban, Patrick Colangelo, 28 Oct 2024, Meta-Learning for Speeding Up Large Model Inference in Decentralized Environments, https://arxiv.org/abs/2410.21340 (Choosing between multiple acceleration techniques).
- Don Moon, Aug 28, 2024, Chunked Prefill and Decode-Maximal Batching https://medium.com/byte-sized-ai/llm-inference-optimizations-2-chunked-prefill-764407b3a67a
- Ming Yin, Minshuo Chen, Kaixuan Huang, Mengdi Wang, 30 Oct 2024, A Theoretical Perspective for Speculative Decoding Algorithm, https://arxiv.org/abs/2411.00841
- Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, Ion Stoica, 25 Nov 2024, BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching, https://arxiv.org/abs/2411.16102
- Zhen Zheng, Xin Ji, Taosong Fang, Fanghao Zhou, Chuanjie Liu, Gang Peng, 29 Nov 2024, BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching, https://arxiv.org/abs/2412.03594
- Yanyu Chen, Ganhong Huang, 6 Dec 2024, GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments, https://arxiv.org/abs/2412.04788
- Michael R. Metel, Boxing Chen, Mehdi Rezagholizadeh, 7 Dec 2024, Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression, https://arxiv.org/abs/2412.05693 (KV cache compression in prefill or prompt processing phase.)
Continuous Batching
Research papers on continuous batching:
- Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698 Code: https://github.com/abhibambhaniya/GenZ-LLM-Analyzer (Analysis of cost of serving LLMs, including separate profiles of prefill versus decoding phases, and the cost of extra prompt processing in RAG architectures with prepended information.)
- Theia Vogel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
- Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang, 2 Feb 2024, APIServe: Efficient API Support for Large-Language Model Inferencing, https://arxiv.org/abs/2402.01869
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang, 20 Jun 2024, Optimizing Speculative Decoding for Serving Large Language Models Using Goodput, https://arxiv.org/abs/2406.14066 (Estimation of the draft length for increased acceptance to improve overall performance.)
- Yuqing Yang, Yuedong Xu, Lei Jiao, 7 Jul 2024, A Queueing Theoretic Perspective on Low-Latency LLM Inference with Variable Token Length, https://arxiv.org/abs/2407.05347
- Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
- Sergei Savvov, Jun 27, 2023, 7 Ways To Speed Up Inference of Your Hosted LLMs, https://betterprogramming.pub/speed-up-llm-inference-83653aa24c47
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
- Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng, 6 Jun 2024 (v2), SGLang: Efficient Execution of Structured Language Model Programs, https://arxiv.org/abs/2312.07104 https://github.com/sgl-project/sglang
- https://community.juniper.net/blogs/sharada-yeluri/2024/02/20/llm-inference-hw-sw-optimizations
- Cade Daniel, Chen Shen, Eric Liang and Richard Liaw , June 22, 2023, How continuous batching enables 23x throughput in LLM inference while reducing p50 latency, https://www.anyscale.com/blog/continuous-batching-llm-inference
- Hugging Face, 2024, Text Generation Inference, https://huggingface.co/docs/text-generation-inference/index
- Sungmin Yun, Kwanhee Kyung, Juhwan Cho, Jaewan Choi, Jongmin Kim, Byeongho Kim, Sukhan Lee, Kyomin Sohn, Jung Ho Ahn, 2 Sep 2024, Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching, https://arxiv.org/abs/2409.01141
- OpenVINO-toolkit, Oct 1, 2024, Introducing OpenVINO™ 2024.4, https://medium.com/openvino-toolkit/introducing-openvino-2024-4-28578870b264
- Yuhang Yao, Han Jin, Alay Dilipbhai Shah, Shanshan Han, Zijian Hu, Yide Ran, Dimitris Stripelis, Zhaozhuo Xu, Salman Avestimehr, Chaoyang He, 10 Sep 2024 (v2), ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency, https://arxiv.org/abs/2408.00008
- Yuzhe Yang, Yipeng Du, Ahmad Farhan, Claudio Angione, Yue Zhao, Harry Yang, Fielding Johnston, James Buban, Patrick Colangelo, 28 Oct 2024, Meta-Learning for Speeding Up Large Model Inference in Decentralized Environments, https://arxiv.org/abs/2410.21340 (Choosing between multiple acceleration techniques).
- OpenVINO™ toolkit, Sep 26, 2024, How To Efficiently Serve Today’s Large Language Models, https://medium.com/openvino-toolkit/how-to-efficiently-serve-todays-large-language-models-b3f1e8d33fdf
Frameworks
Research on inference frameworks as part of serving:
- Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698 Code: https://github.com/abhibambhaniya/GenZ-LLM-Analyzer (Analysis of cost of serving LLMs, including separate profiles of prefill versus decoding phases, and the cost of extra prompt processing in RAG architectures with prepended information.)
- Jeon, Byungsoo, May 2024, Automated and Portable Machine Learning Systems, Ph.D. Thesis, Carnegie Mellon University, https://doi.org/10.1184/R1/25746708.v1 https://kilthub.cmu.edu/articles/thesis/Automated_and_Portable_Machine_Learning_Systems/25746708/1 PDF: https://kilthub.cmu.edu/ndownloader/files/46074087 Code: https://github.com/cmu-catalyst/collage (Portability layer to integrate the various kernels and low-level backends more easily. Also covers pipeline parallelism in graph models, and KV cache parallelism similar to FlashDecode.)
- Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
- Martin Thissen, April 20, 2024, Llama 3 on Your Local Computer | Free GPT-4 Alternative, https://medium.com/@martin-thissen/llama-3-on-your-local-computer-free-gpt-4-alternative-1f533e9abff7 (Llama3-70B with 4-bit quantization using vLLM for inference on NVIDIA RTX 6000 Ada GPU.)
- Yushuo Chen, Tianyi Tang, Erge Xiang, Linjiang Li, Wayne Xin Zhao, Jing Wang, Yunpeng Chai, Ji-Rong Wen, 17 Apr 2024, Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models, https://arxiv.org/abs/2404.11502 (Benchmarks the performance of various Transformer inference frameworks: Transformers, vLLM, DeepSpeed-MII, TGI, TenserRT-LLM, llama.cpp, LightLLM, LMDeploy, StreamingLLM.)
- Pierrick Pochelu, 9 Oct 2022, Deep Learning Inference Frameworks Benchmark, https://arxiv.org/abs/2210.04323 (Benchmarking study in 2022 of various frameworks.)
- Max A. Cherney, March 26, 2024, Exclusive: Behind the plot to break Nvidia's grip on AI by targeting software, https://www.reuters.com/technology/behind-plot-break-nvidias-grip-ai-by-targeting-software-2024-03-25/
- Fucheng Jia, Shiqi Jiang, Ting Cao, Wei Cui, Tianrui Xia, Xu Cao, Yuanchun Li, Deyu Zhang, Ju Ren, Yunxin Liu, Lili Qiu, Mao Yang, Sep 2023, Accelerating In-Browser Deep Learning Inference on Diverse Edge Clients through Just-in-Time Kernel Optimizations, https://arxiv.org/pdf/2309.08978.pdf
- Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Stoica, Joseph Gonzalez, and Alexey Tumanov. 2020. InferLine: latency-aware provisioning and scaling for prediction serving pipelines. Proceedings of the 11th ACM Symposium on Cloud Computing. 477–491, https://arxiv.org/abs/1812.01776
- Maurizio Capra, Beatrice Bussolino, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique, 2020, Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead, https://ieeexplore.ieee.org/iel7/6287639/6514899/09269334.pdf, https://arxiv.org/abs/2012.11233 (Analysis of optimizations for DNNs and SNNs.)
- Suresh G, Sep 25, 2023, 7 Frameworks for Serving LLMs, Medium, https://medium.com/@gsuresh957/7-frameworks-for-serving-llms-5044b533ee88
- Doug Eadline, October 5, 2023, How AMD May Get Across the CUDA Moat, HPC Wire, https://www.hpcwire.com/2023/10/05/how-amd-may-get-across-the-cuda-moat/
- Hayden Wolff, Jun 02, 2024, A Simple Guide to Deploying Generative AI with NVIDIA NIM, NVIDIA Technical Blog, https://developer.nvidia.com/blog/a-simple-guide-to-deploying-generative-ai-with-nvidia-nim/
- K Dinghofer, F Hartung, 2020, Analysis of criteria for the selection of machine learning frameworks 2020 International Conference on Computing, Networking and Communications (ICNC), https://ieeexplore.ieee.org/document/9049650
- H Dai, X Peng, X Shi, L He, Q Xiong, H Jin, 2022, Reveal training performance mystery between TensorFlow and PyTorch in the single GPU environment, Science China Information Sciences volume 65, Article number: 112103 (2022), https://link.springer.com/article/10.1007/s11432-020-3182-1 http://scis.scichina.com/en/2022/112103.pdf
- C Luo, X He, J Zhan, L Wang, W Gao, J Dai, 2020, Comparison and benchmarking of AI models and frameworks on mobile devices, https://arxiv.org/abs/2005.05085
- Daniel Nichols, Siddharth Singh, Shu-Huai Lin, Abhinav Bhatele, July 2022, A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks, https://arxiv.org/abs/2111.04949 PDF: https://pssg.cs.umd.edu/assets/papers/2022-07-dl-survey-arxiv.pdf (Survey of frameworks from the theoretical perspective of parallelism.)
- R. Sanchez-Iborra and A. F. Skarmeta, Tinyml-enabled frugal smart objects: Challenges and opportunities, IEEE Circuits and Systems Magazine, vol. 20, no. 3, pp. 4–18, 2020. https://ieeexplore.ieee.org/document/9166461 PDF: https://sci-hub.se/10.1109/MCAS.2020.3005467
- R. Immonen, T. Hämäläinen et al., Tiny machine learning for resource-constrained microcontrollers, Journal of Sensors, vol. 2022, 2022, https://www.hindawi.com/journals/js/2022/7437023/
- M. Giordano, L. Piccinelli, and M. Magno, Survey and comparison of milliwatts micro controllers for tiny machine learning at the edge, in 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 2022, pp. 94–97. https://ieeexplore.ieee.org/document/9870017
- Hong Zhang, Yupeng Tang, Anurag Khandelwal, and Ion Stoica. 2023. SHEPHERD: Serving DNNs in the Wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 787–808. https://www.usenix.org/conference/nsdi23/presentation/zhang-hong
- Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Mérouane Debbah, Deepak Gupta, 24 Apr 2024 (v2), Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward, https://arxiv.org/abs/2402.01799 Code: https://github.com/nyunAI/Faster-LLM-Survey
- Fabian Both, June 2024, why we no longer use LangChain for building our AI agents , https://www.octomind.dev/blog/why-we-no-longer-use-langchain-for-building-our-ai-agents (Replaces LangChain with their own more-focused internal tool sets.)
- LiLMod, Aug 27, 2024, Haystack: the new LLM framework that is shaking its competitors, https://ai.plainenglish.io/haystack-the-new-llm-framework-that-is-shaking-its-competitors-1a083a153fd9
- Aparna Dhinakaran, Sep 2024, Choosing Between LLM Agent Frameworks. The tradeoffs between building bespoke code-based agents and the major agent frameworks. https://towardsdatascience.com/choosing-between-llm-agent-frameworks-69019493b259
- Nicola Sessions, Oct 15, 2024, DataStax Announces New AI Development Platform, Built with NVIDIA AI, https://developer.nvidia.com/blog/datastax-announces-new-ai-development-platform-built-with-nvidia-ai/
- Anurag Guda and Shruthii Sathyanarayanan, Oct 16, 2024, Simplify AI Application Development with NVIDIA Cloud Native Stack, https://developer.nvidia.com/blog/simplify-ai-application-development-with-nvidia-cloud-native-stack/
- Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, Ping Luo, 17 Oct 2024, Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation, https://arxiv.org/abs/2410.13848 https://github.com/deepseek-ai/Janus?tab=readme-ov-file
- Robert Corwin Nov 2024, Running Large Language Models Privately: A comparison of frameworks, models, and costs, https://towardsdatascience.com/running-large-language-models-privately-a-comparison-of-frameworks-models-and-costs-ac33cfe3a462
- Kristian McCann, November 13, 2024, Top 10 AI Frameworks, https://aimagazine.com/articles/top-10-ai-frameworks
Serverless
- Yao Lu, Song Bian, Lequn Chen, Yongjun He, Yulong Hui, Matthew Lentz, Beibin Li, Fei Liu, Jialin Li, Qi Liu, Rui Liu, Xiaoxuan Liu, Lin Ma, Kexin Rong, Jianguo Wang, Yingjun Wu, Yongji Wu, Huanchen Zhang, Minjia Zhang, Qizhen Zhang, Tianyi Zhou, Danyang Zhuo, 17 Jan 2024, Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native, https://arxiv.org/abs/2401.12230
- Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai, 25 Jan 2024, ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models, https://arxiv.org/abs/2401.14351 Code: https://github.com/ServerlessLLM/ServerlessLLM
- David Linthicum, July 2, 2024, Serverless cloud technology fades away, InfoWorld, https://www.infoworld.com/article/3715605/serverless-cloud-technology-fades-away.html
- Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
- Joe Oakley, Hakan Ferhatosmanoglu, 22 Mar 2024, FSD-Inference: Fully Serverless Distributed Inference with Scalable Cloud Communication, https://arxiv.org/abs/2403.15195
- Hao Wu, Yue Yu, and Junxiao Deng, Shadi Ibrahim, Inria; Song Wu and Hao Fan, Ziyue Cheng, Hai Jin, Huazhong, 2024, StreamBox: A Lightweight GPU SandBox for Serverless Inference Workflow, Usenix 2024, https://www.usenix.org/conference/atc24/presentation/wu-hao PDF: https://www.usenix.org/system/files/atc24-wu-hao.pdf
- Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai, ServerlessLLM: Low-Latency Serverless Inference for Large Language Models, 2024, OSDI 2024, https://www.usenix.org/conference/osdi24/presentation/fu
- Google, 2024, L’Oréal: Launching Gen AI as a Service in 3 months with Cloud Run and LangChain, https://services.google.com/fh/files/misc/google_loreal_with_langchain_case_study.pdf
- Jianfeng Gu, Yichao Zhu, Puxuan Wang, Mohak Chadha, Michael Gerndt, 1 Sep 2023, FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference, https://arxiv.org/abs/2309.00558
- C. Lu et al., "SMIless: Serving DAG-based Inference with Dynamic Invocations under Serverless Computing," in 2024 SC24: International Conference for High Performance Computing, Networking, Storage and Analysis SC, Atlanta, GA, United States, 2024, pp. 590-606, doi: 10.1109/SC41406.2024.00044. https://www.computer.org/csdl/proceedings-article/sc/2024/529100a590/21HUVxvcnoA
Scheduling
- Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, Ion Stoica, 22 Apr 2024, Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity, https://arxiv.org/abs/2404.14527 Code: https://github.com/tyler-griggs/melange-release
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng ZHANG, Dahua Lin, Ion Stoica, Hao Zhang, 02 May 2024, MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving, ICML 2024, https://openreview.net/forum?id=R0SoZvqXyQ PDF: https://openreview.net/pdf?id=R0SoZvqXyQ Code: https://github.com/hao-ai-lab/MuxServe (Separates the prefill and decoding phases when serving, and also manages the LLM weights and KV cache data in blocks for memory efficiency.)
- Schwinn Saereesitthipitak, Ashish Rao, Cathy Zhou, William Li, 2024, Prophet: An LLM Inference Engine Optimized For Head-of-Line Blocking, https://www.scs.stanford.edu/24sp-cs244b/projects/Prophet_An_LLM_Inference_Engine_Optimized_For_Head_of_Line_Blocking.pdf (Faster inference serving via iterative scheduling, separating prefill and decoding phase computations for batching, using priority-based schedulers with preemption, and controling transfer of KV caches from prefill to decoders.)
- Ke Cheng, Wen Hu, Zhi Wang, Hongen Peng, Jianguo Li, Sheng Zhang, 19 Jun 2024, Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving, https://arxiv.org/abs/2406.13511 (Improved batched scheduling by splitting queries into fixed-size token generation slices.)
- Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 26 Jun 2024 (v2), MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool, https://arxiv.org/abs/2406.17565 (Combined session-based prefix KV caching with disaggregation of prefill and decoding phases.)
- Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu, 2 Jul 2024 (v2), Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving, https://arxiv.org/abs/2407.00079 Code: https://github.com/kvcache-ai/Mooncake (Disaggregates prefill and decoding phases for scheduling, with chunked prefill, while managing the KV cache.)
- Grant Wilkins, Srinivasan Keshav, Richard Mortier, 4 Jul 2024, Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems, https://arxiv.org/abs/2407.04014
- Xin Tan, Jingzong Li, Jiamin Li, Yitao Yang, Hong Xu, August 2024, Arlo: Serving Transformer-based Language Models with Dynamic, Input Lengths, ICPP ’24, August 12–15, 2024, Gotland, Sweden, https://doi.org/10.1145/3673038.3673124 https://kanonjz.github.io/academic/share/xin-icpp24.pdf
- Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
- Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, Esha Choukse, 1 Aug 2024, DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency, https://arxiv.org/abs/2408.00741
- Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo, Zhengyong Zhang, 15 Aug 2024, P/D-Serve: Serving Disaggregated Large Language Model at Scale, https://arxiv.org/abs/2408.08147 (Comprehensive serving system addressing disaggregated prefill and KV cache transfer with RDMA.)
- Ahmed Tremo, Aug 6, 2024, How to Efficiently Serve an LLM? https://ahmedtremo.com/posts/How-to-Efficiently-serve-an-llm/
- vLLM, 2024, Performance and Tuning, https://docs.vllm.ai/en/latest/models/performance.html
- Mingjin Zhang, 2024, High-performance scheduling of deep learning tasks in collaborative edge computing, Ph.D. Thesis, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, https://theses.lib.polyu.edu.hk/bitstream/200/13080/3/7528.pdf (Scheduling of inference and training tasks on edge devices with techniques such as model splitting/partitioning.)
- Kunal Jain, Anjaly Parayil, Ankur Mallick, Esha Choukse, Xiaoting Qin, Jue Zhang, Íñigo Goiri, Rujia Wang, Chetan Bansal, Victor Rühle, Anoop Kulkarni, Steve Kofsky, Saravan Rajmohan, 24 Aug 2024, Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Scheduling, https://arxiv.org/abs/2408.13510
- Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 20 Jan 2024, Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads, https://arxiv.org/abs/2401.11181
- Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica, Hao Zhang, 28 Aug 2024, Efficient LLM Scheduling by Learning to Rank, https://arxiv.org/abs/2408.15792 https://github.com/hao-ai-lab/vllm-ltr.git
- Eric Samikwa, 2024, Resource-Aware Distributed Machine Learning for Artificial Intelligence of Things, Ph.D. thesis, Faculty of Science, University of Bern, Switzerland, https://boristheses.unibe.ch/5378/1/24samikwa_e_1_.pdf https://doi.org/10.48549/5378 (Multi-edge device with early exit, "micro-split" scheduling, split/federated learning, and distributed inference.)
- DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et. al. (many additional authors), 19 Jun 2024 (v5), DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, https://arxiv.org/abs/2405.04434
- T Zhao, 2024, Acceleration of Deep Learning Algorithms with Transformers, https://escholarship.org/uc/item/3419t2z6
- Y. Peng, W. Gao and H. Peng, "Serving DNN Inference With Fine-Grained Spatio-Temporal Sharing of GPU Servers," in IEEE Transactions on Services Computing, doi: 10.1109/TSC.2024.3463429. https://ieeexplore.ieee.org/document/10684028 https://www.computer.org/csdl/journal/sc/5555/01/10684028/20lm4PEVn9u
- Jianfeng Gu, Yichao Zhu, Puxuan Wang, Mohak Chadha, Michael Gerndt, 1 Sep 2023, FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference, https://arxiv.org/abs/2309.00558
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Yiyuan He, Minxian Xu, Jingfeng Wu, Wanyi Zheng, Kejiang Ye, Chengzhong Xu, 24 Sep 2024 (v2), UELLM: A Unified and Efficient Approach for LLM Inference Serving, https://arxiv.org/abs/2409.14961
- Yifan Qiao, Shu Anzai, Shan Yu, Haoran Ma, Yang Wang, Miryung Kim, Harry Xu, 2 Oct 2024, ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving, https://arxiv.org/abs/2410.01228
- Rana Shahout, Eran Malach, Chunwei Liu, Weifan Jiang, Minlan Yu, Michael Mitzenmacher, 1 Oct 2024, Don't Stop Me Now: Embedding Based Scheduling for LLMs, https://arxiv.org/abs/2410.01035
- Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
- Wei Zhao, Anand Jayarajan, Gennady Pekhimenko, 9 Oct 2024, Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads, https://arxiv.org/abs/2410.07381 (Interleaved scheduling layer for GPU workloads.)
- S Durvasula, A Zhao, R Kiguru, Y Guan, Z Chen, Oct 2024, ACE: Efficient GPU Kernel Concurrency for Input-Dependent Irregular Computational Graphs, PACT ’24, October 14–16, 2024, Southern California, CA, USA, https://www.embarclab.com/static/media/ace.1c73b44bc2ad143f7b9f.pdf (Identify parallel kernels at runtime.)
- Ferdi Kossmann, Bruce Fontaine, Daya Khudia, Michael Cafarella, Samuel Madden, 23 Oct 2024, Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs, https://arxiv.org/abs/2410.17840
- Zebin Yang, Renze Chen, Taiqiang Wu, Ngai Wong, Yun Liang, Runsheng Wang, Ru Huang, Meng Li, 23 Oct 2024, MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers https://arxiv.org/abs/2410.17957
- Rana Shahout, Cong Liang, Shiji Xin, Qianru Lao, Yong Cui, Minlan Yu, Michael Mitzenmacher, 23 Oct 2024, Efficient Inference for Augmented Large Language Models, https://arxiv.org/abs/2410.18248
- Youpeng Zhao, Jun Wang, 31 Oct 2024, ALISE: Accelerating Large Language Model Serving with Speculative Scheduling, https://arxiv.org/abs/2410.23537
- R Mendoza, I Cruz, P Singh, A Martinez, N Kim, S Patel, Nov 2024, Dynamic Resource Management for Efficient Fast Device Placement https://www.researchgate.net/profile/Priya-Singh-103/publication/385528236_Dynamic_Resource_Management_for_Efficient_Fast_Device_Placement/links/672983c3ecbbde716b584acc/Dynamic-Resource-Management-for-Efficient-Fast-Device-Placement.pdf
- H Zhang, Z Chen, XLY Liu, J Wu, L Wang, Nov 2024, Dynamic Fast Device Placement Strategies for Real-Time Resource Allocation, https://www.researchgate.net/profile/Haoran-Zhang-111/publication/385589353_Dynamic_Fast_Device_Placement_Strategies_for_Real-Time_Resource_Allocation/links/672b9ca977f274616d60a5e6/Dynamic-Fast-Device-Placement-Strategies-for-Real-Time-Resource-Allocation.pdf
- Zhiqiang Xie, Hao Kang, Ying Sheng, Tushar Krishna, Kayvon Fatahalian, Christos Kozyrakis, 5 Nov 2024, AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution https://arxiv.org/abs/2411.03519 (Scheduling multiple agents.)
- Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chandra Narayanaswami, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2024. Queue Management for SLO-Oriented Large Language Model Serving. In Proceedings of the 2024 ACM Symposium on Cloud Computing (SoCC '24). Association for Computing Machinery, New York, NY, USA, 18–35. https://doi.org/10.1145/3698038.3698523 https://dl.acm.org/doi/abs/10.1145/3698038.3698523
- Yuka Ikarashi, Kevin Qian, Samir Droubi, Alex Reinking, Gilbert Bernstein, Jonathan Ragan-Kelley, 14 Nov 2024 (v2), Exo 2: Growing a Scheduling Language, https://arxiv.org/abs/2411.07211
- M. Gil et al., "TLP Balancer: Predictive Thread Allocation for Multi-Tenant Inference in Embedded GPUs," in IEEE Embedded Systems Letters, doi: 10.1109/LES.2024.3497587. https://ieeexplore.ieee.org/abstract/document/10753458/
- Kyoungmin Kim, Kijae Hong, Caglar Gulcehre, Anastasia Ailamaki, 12 Nov 2024, The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving, https://arxiv.org/abs/2411.07447
- Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, Ion Stoica, 25 Nov 2024, BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching, https://arxiv.org/abs/2411.16102
- Wenxiang Lin, Xinglin Pan, Shaohuai Shi, Xuan Wang, Xiaowen Chu, 24 Nov 2024, Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems, https://arxiv.org/abs/2411.15715
Load Balancing
Research papers on AI load balancing:
- Grant Wilkins, 3 June 202, Online Workload Allocation and Energy Optimization in Large Language Model Inference Systems, Master of Philosophy in Advanced Computer Science, Churchill College, University of Cambridge, https://grantwilkins.github.io/gfw27_project.pdf
- David Spuler, March 2024, Chapter 7. Deployment Architecture, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu, 2 Jul 2024 (v2), Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving, https://arxiv.org/abs/2407.00079 Code: https://github.com/kvcache-ai/Mooncake (Disaggregates prefill and decoding phases for scheduling, with chunked prefill, while managing the KV cache.)
- Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
- J Liu, 2024, Data-driven Performance Optimization for Data-intensive Applications, Ph.D. Thesis, Electrical Engineering and Computer Science, University of California, Merced, https://escholarship.org/content/qt6gn2p8mn/qt6gn2p8mn.pdf (Optimization of data movement intensive algorithms, mostly non-AI applications.)
- An Efficient Network Orchestrator for Distributed Compound Language Model Systems Muhammad Shahir Abdurrahman, Stanford University, Stanford, California, USA, https://www.scs.stanford.edu/24sp-cs244b/projects/An_Efficient_Network_Orchestrator_for_Distributed_Compound_Language_Model_Systems.pdf
- David Spuler, March 2024, Load Balancing, in Generative AI in C++, https://www.aussieai.com/book/ch7-load-balancer
- N Kim, P Singh, S Patel, A Martinez, I Cruz, R Mendoza, Oct 2024, Dynamic Load Balancing Techniques for Efficient Fast Device Placement, https://www.researchgate.net/profile/Priya-Singh-103/publication/385103680_Dynamic_Load_Balancing_Techniques_for_Efficient_Fast_Device_Placement/links/6716aa9a68ac304149aa2fa6/Dynamic-Load-Balancing-Techniques-for-Efficient-Fast-Device-Placement.pdf
- Yuhang Yao, Han Jin, Alay Dilipbhai Shah, Shanshan Han, Zijian Hu, Yide Ran, Dimitris Stripelis, Zhaozhuo Xu, Salman Avestimehr, Chaoyang He, 10 Sep 2024 (v2), ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency, https://arxiv.org/abs/2408.00008
- Ilias Bournias, Lukas Cavigelli, Georgios Zacharopoulos, 8 Nov 2024, AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing and Data Locality, https://arxiv.org/abs/2411.05555
Networking
Research papers on networking optimizations for LLMs:
- Ari Lotter, Jeffrey Quesnelle, Umer H. Adil, Dillon Rolnick, Esteban La Rocca, A Preliminary Report on Distro, 2024, https://github.com/NousResearch/DisTrO/blob/main/A_Preliminary_Report_on_DisTrO.pdf https://venturebeat.com/wp-content/uploads/2024/08/A_Preliminary_Report_on_DisTrO.pdf (Reducing the inter-GPU networking bandwidth cost during training.)
- Wei An, Xiao Bi, Guanting Chen, Shanhuang Chen, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Wenjun Gao, Kang Guan, Jianzhong Guo, Yongqiang Guo, Zhe Fu, Ying He, Panpan Huang, Jiashi Li, Wenfeng Liang, Xiaodong Liu, Xin Liu, Yiyuan Liu, Yuxuan Liu, Shanghao Lu, Xuan Lu, Xiaotao Nie, Tian Pei, Junjie Qiu, Hui Qu, Zehui Ren, Zhangli Sha, Xuecheng Su, Xiaowen Sun, Yixuan Tan, Minghui Tang, Shiyu Wang, Yaohui Wang, Yongji Wang, Ziwei Xie, Yiliang Xiong, Yanhong Xu, Shengfeng Ye, Shuiping Yu, Yukun Zha, Liyue Zhang, Haowei Zhang, Mingchuan Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Yuheng Zou, 31 Aug 2024 (v2), Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning, DeepSeek AI, https://www.arxiv.org/abs/2408.14158
- David Spuler, 26th August, 2024, State-of-the-Art LLM Backends, Aussie AI Blog, https://www.aussieai.com/blog/state-of-the-art-llm-backends
- Zeyu Zhang, Haiying Shen, 23 Sep 2024, CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts, https://arxiv.org/abs/2409.15104 (Sparse attention and overlapped communication with computation and disaggregates prefill/decoding with chunked prefill, with a novel QKV splitting approach focused on the Q values.)
- Stephen Jones, March 2024, CUDA: New Features and Beyond, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62400/
- Dylan Patel and Daniel Nishball, Oct 03, 2024, AI Neocloud Playbook and Anatomy, https://www.semianalysis.com/p/ai-neocloud-playbook-and-anatomy
- Yuhang Yao, Han Jin, Alay Dilipbhai Shah, Shanshan Han, Zijian Hu, Yide Ran, Dimitris Stripelis, Zhaozhuo Xu, Salman Avestimehr, Chaoyang He, 10 Sep 2024 (v2), ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency, https://arxiv.org/abs/2408.00008
- Nir Barazida, Mar 9, 2022, Distributed training of deep learning models: handling stragglers and latency in synchronous training A review of the challenges in Synchronous distributed training and best solutions for stragglers and high latency https://towardsdatascience.com/stragglers-and-latency-in-synchronous-distributed-training-of-deep-learning-models-43783b0266d9
- Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, Rafal Jozefowicz, 21 Mar 2017 (v3), Revisiting Distributed Synchronous SGD, https://arxiv.org/abs/1604.00981
- Palak (Microsoft Research India), Rohan Gandhi (Microsoft Research India), Karan Tandon (Microsoft Research India), Debopam Bhattacherjee (Microsoft Research India), Venkata N. Padmanabhan (Microsoft Research India), 16 Nov 2024, Improving training time and GPU utilization in geo-distributed language model training, https://arxiv.org/abs/2411.14458
- Haiquan Wang, Chaoyi Ruan, Jia He, Jiaqi Ruan, Chengjie Tang, Xiaosong Ma, Cheng Li, 24 Nov 2024, Hiding Communication Cost in Distributed LLM Training via Micro-batch Co-execution, https://arxiv.org/abs/2411.15871
- Greg Gutmann, Sep 2020, Peer-to-peer Memory Copy with NVLink: CUDA Feature Testing, https://codingbyexample.com/2020/09/14/p2p-memcpy-with-nvlink/
- Chaoyi Jiang, Lei Gao, Hossein Entezari Zarch, Murali Annavaram, 26 Nov 2024, Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation, https://arxiv.org/abs/2411.17089 (Overlapping/optimizing CPU-GPU network bandwidth for KV cache with some recomputation.)
- Bowen Peng, Jeffrey Quesnelle, Diederik P. Kingma, 29 Nov 2024, DeMo: Decoupled Momentum Optimization, https://arxiv.org/abs/2411.19870 https://github.com/bloc97/DeMo (Extension to ADAM optimizer that greatly reduces network communication in training.)
- Markus Rabe, Carl Case, November 14, 2024, Rethinking LLM Inference: Why Developer AI Needs a Different Approach, https://www.augmentcode.com/blog/rethinking-llm-inference-why-developer-ai-needs-a-different-approach
- Y Tang, R Cheng, P Zhou, T Liu, F Liu, W Tang, K Bae, Nov 2024, Exploring CXL-based KV Cache Storage for LLMServing, https://mlforsystems.org/assets/papers/neurips2024/paper17.pdf
- Carl Franzen, August 27, 2024, ‘This could change everything!’ Nous Research unveils new tool to train powerful AI models with 10,000x efficiency, https://venturebeat.com/ai/this-could-change-everything-nous-research-unveils-new-tool-to-train-powerful-ai-models-with-10000x-efficiency/
- Carl Franzen, December 2, 2024, Nous Research is training an AI model using machines distributed across the internet, https://venturebeat.com/ai/nous-research-is-training-an-ai-model-using-machines-distributed-across-the-internet/
AI Tech Stack
Research on AI tech stacks:
- Stan Gibson, 03 Jun 2024, Getting infrastructure right for generative AI, CIO, https://www.cio.com/article/2128440/getting-infrastructure-right-for-generative-ai.html
- Matt Murphy, Tim Tully, Grace Ge, Derek Xiao, Katie Keller, January 18, 2024, The Modern AI Stack: Design Principles for the Future of Enterprise AI Architectures, https://menlovc.com/perspective/the-modern-ai-stack-design-principles-for-the-future-of-enterprise-ai-architectures/?tpcc=NL_Marketing
- David Spuler, March 2024, Chapter 5. Design Choices & Architectures, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- MongoDB, Jun 20, 2024, Understanding the AI Stack In the Era of Generative AI: Exploring the Layers and Components of Today’s AI Applications https://medium.com/mongodb/understanding-the-ai-stack-in-the-era-of-generative-ai-f1fcd66e1393
- Akash Bajwa and Chia Jeng Yang, May 27, 2024, The RAG Stack: Featuring Knowledge Graphs: Reducing Hallucinations To Make LLMs Production-Grade With Complex RAG, https://akashbajwa.substack.com/p/the-rag-stack-featuring-knowledge
- Melissa Malec, June 5, 2024, AI Orchestration Explained: The What, Why & How for 2024, https://hatchworks.com/blog/gen-ai/ai-orchestration/
- Artem Shelamanov, Jun 30, 2024. Tech Stack For Production-Ready LLM Applications In 2024, https://python.plainenglish.io/tech-stack-for-production-ready-llm-applications-in-2024-5eb14105d1b4
- David Spuler, March 2024, AI Tech Stack, in Generative AI in C++, https://www.aussieai.com/book/ch5-ai-tech-stack
- Cobus Greyling, Sep 2024, An AI Agent Architecture & Framework Is Emerging, https://cobusgreyling.medium.com/an-ai-agent-architecture-framework-is-emerging-addae3804f23
- Brandon Royal, Sam Stoelinga, 2024, Scaling and Optimizing Your LLM Pipeline for End-to-End Efficiency, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62006/
- Michael Nuñez, September 25, 2024, AI for all: Meta’s ‘Llama Stack’ promises to simplify enterprise adoption, https://venturebeat.com/ai/ai-for-all-meta-llama-stack-promises-to-simplify-enterprise-ai-adoption/
- Matt Marshall, October 24, 2024, The enterprise verdict on AI models: Why open source will win, https://venturebeat.com/ai/the-enterprise-verdict-on-ai-models-why-open-source-will-win/
- Letta, November 14, 2024, The AI agents stack, https://www.letta.com/blog/ai-agents-stack
- Narcisa Guran, Florian Knauf, Man Ngo, Stefan Petrescu, Jan S. Rellermeyer, 21 Nov 2024, Towards a Middleware for Large Language Models, https://arxiv.org/abs/2411.14513
- Meta, July 2024, RFC-0001 - Llama Stack #6, https://github.com/meta-llama/llama-toolchain/issues/6 (Meta's request for comment on its "Llama stack" for AI.)
- Tiernan Ray, Dec. 3, 2024, Enterprises are struggling with what to do with Gen AI, say venture capitalists Despite some uncertainty, enterprise investments in applications soared eight-fold in 2024, with spending on AI-generated code leading the way. https://www.zdnet.com/article/enterprises-are-struggling-with-what-to-do-with-gen-ai-say-venture-capitalists/ (Growing usage but some confusion. Dominant use cases are coding, support chatbots, enterprise search, and meeting summaries.)
- Phoebe Lee and Kristina Joos, Jan 25, 2024, Advancing Production AI with NVIDIA AI Enterprise, https://developer.nvidia.com/blog/advancing-production-ai-with-nvidia-ai-enterprise/ ("... advances in NVIDIA AI software deliver up to 54% performance gains without a hardware upgrade...")
More AI Research
Read more about: