Aussie AI
Open Source Inference Engines for LLMs
-
Last Updated 28 November, 2024
-
by David Spuler, Ph.D.
AI inference for an LLM requires an engine. There are many open source LLMs available online, notably from Meta and Mistral, but they need an engine to run. Fortunately, there are multiple open-source full implementations of Transformer engines that can run inference for an LLM.
List of Open Source Inference Frameworks
Many examples are listed below and it's quite an overwhelming group. Most "famous" are PyTorch and TensorFlow, but there are many others with a stack. There are also several new fully-coded inference-specific engines, which don't have much training capability. Some of these frameworks are ML compilers (e.g. XLA and MLIR). There are also several frameworks that have gained a reputation for running RAG architectures, such as LangChain and Ollama.
These frameworks are mostly offered with permissive and non-copyleft licenses that allow commercial usage (review each package for its license details).
Here's the list so far:
- PyTorch
- TensorFlow
- LangChain
- TensorRT (NVIDIA)
- ROCm (AMD)
- GGML
- Llama.cpp
- MLIR (LLVM)
- Ollama
- LLMFarm
- Llama2.c
- OpenVINO (Intel)
- Transformers (Hugging Face)
- FasterTransformer (NVIDIA)
- vLLM
- TGI (Text Generation Inference) (Hugging Face)
- MXNet
- CTranslate2
- DeepSpeed/DeepSpeed-MII
- OpenLLM
- RayServe
- tinygrad
- MLX (Apple)
- TinyChatEngine
ML compilers (graph compilers) that are open source:
- ONNX (Industry coalition)
- TVM (Apache)
- MLC LLM
- XLA (TensorFlow)
And some non-open source inference platforms:
- CUDA (NVIDIA)
- AICore (Google)
- MediaPipe (Google)
Research on Inference Frameworks
Industry articles. Online blog articles and industry press releases on inference frameworks:
- Max A. Cherney, March 26, 2024, Exclusive: Behind the plot to break Nvidia's grip on AI by targeting software, https://www.reuters.com/technology/behind-plot-break-nvidias-grip-ai-by-targeting-software-2024-03-25/ (UXL new group with Intel OneAPI and others.)
- Doug Eadline, October 5, 2023, How AMD May Get Across the CUDA Moat, HPC Wire, https://www.hpcwire.com/2023/10/05/how-amd-may-get-across-the-cuda-moat/
Research papers. Academic papers about inference frameworks, with evaluations or theoretical aspects:
- MMH Shuvo, SK Islam, J Cheng, Efficient acceleration of deep learning inference on resource-constrained edge devices: A review, 2022, Proceedings of the IEEE ( Volume: 111, Issue: 1, January 2023), pp 42 - 91, 14 December 2022 , https://ieeexplore.ieee.org/abstract/document/9985008 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9985008
- H Dai, X Peng, X Shi, L He, Q Xiong, H Jin, 2022, Reveal training performance mystery between TensorFlow and PyTorch in the single GPU environment, Science China Information Sciences volume 65, Article number: 112103 (2022), https://link.springer.com/article/10.1007/s11432-020-3182-1, http://scis.scichina.com/en/2022/112103.pdf
- Fucheng Jia, Shiqi Jiang, Ting Cao, Wei Cui, Tianrui Xia, Xu Cao, Yuanchun Li, Deyu Zhang, Ju Ren, Yunxin Liu, Lili Qiu, Mao Yang, Sep 2023, Accelerating In-Browser Deep Learning Inference on Diverse Edge Clients through Just-in-Time Kernel Optimizations, https://arxiv.org/pdf/2309.08978.pdf
- Hong Zhang, Yupeng Tang, Anurag Khandelwal, and Ion Stoica. 2023. SHEPHERD: Serving DNNs in the Wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 787–808. https://www.usenix.org/conference/nsdi23/presentation/zhang-hong
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, Jan 2024, Understanding LLMs: A Comprehensive Overview from Training to Inference https://arxiv.org/abs/2401.02038
- MLC team. 2023. MLC-LLM. https://github.com/mlc-ai/mlc-llm
- tinygrad. 2023. Tinygrad. https://github.com/tinygrad/tinygrad
- Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, Oct 2023, Efficient Memory Management for Large Language Model Serving with PagedAttention, SOSP ’23, October 23–26, 2023, Koblenz, Germany, https://dl.acm.org/doi/pdf/10.1145/3600006.3613165 (The original Paged Attention and vLLM paper, focusing on optimizing memory size of the KV cache using methods similar to operating-system memory paging.)
- Myeonghwa Lee, Seonho An, Min-Soo Kim, 18 Jun 2024, PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers, https://arxiv.org/abs/2406.12430 Code: https://github.com/myeon9h/PlanRAG
- Mark Zuckerberg, July 23, 2024 Open Source AI Is the Path Forward https://about.fb.com/news/2024/07/open-source-ai-is-the-path-forward/
- Vince Lam, Mar 12, 2024, 50+ Open-Source Options for Running LLMs Locally, https://medium.com/thedeephub/50-open-source-options-for-running-llms-locally-db1ec6f5a54f
- Jason Perlow, Aug. 6, 2024, How to run dozens of AI models on your Mac or PC - no third-party cloud needed, https://www.zdnet.com/article/how-to-run-dozens-of-ai-models-on-your-mac-or-pc-no-third-party-cloud-needed/
- Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng, 6 Jun 2024 (v2), SGLang: Efficient Execution of Structured Language Model Programs, https://arxiv.org/abs/2312.07104 https://github.com/sgl-project/sglang
- The SGLang Team, Jul 25, 2024, Achieving Faster Open-Source Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM), https://lmsys.org/blog/2024-07-25-sglang-llama3/
- Anna Popovych, Sofiya Merenych, February 16, 2024, Top AI Frameworks in 2024: Comparison of Artificial Intelligence Frameworks, https://clockwise.software/blog/artificial-intelligence-framework/
- Hugging Face, 2024, Text Generation Inference, https://huggingface.co/docs/text-generation-inference/index
- Shrestha, Y.R., von Krogh, G. & Feuerriegel, S., 2023, Building open-source AI. Nat Comput Sci 3, 908–911 (2023). https://doi.org/10.1038/s43588-023-00540-0 https://www.nature.com/articles/s43588-023-00540-0
- Dennis Rall, Bernhard Bauer, Thomas Fraunholz, 8 Nov 2023, Towards Democratizing AI: A Comparative Analysis of AI as a Service Platforms and the Open Space for Machine Learning Approach, https://arxiv.org/abs/2311.04518
- DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et. al. (many additional authors), 19 Jun 2024 (v5), DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, https://arxiv.org/abs/2405.04434
- ZML, Sep 2024, ZML: High performance AI inference stack. Built for productionl https://docs.zml.ai/ https://github.com/zml/zml?tab=readme-ov-file
- Mistral, Sep 2024, AI in abundance. Introducing a free API, improved pricing across the board, a new enterprise-grade Mistral Small, and free vision capabilities on le Chat. https://mistral.ai/news/september-24-release/
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
- Sebastian Petrus, Sep 4, 2024, Top 10 RAG Frameworks Github Repos 2024, https://sebastian-petrus.medium.com/top-10-rag-frameworks-github-repos-2024-12b2a81f4a49
- Krishna Teja Chitty-Venkata, Siddhisanket Raskar, Bharat Kale, Farah Ferdaus, Aditya Tanikanti, Ken Raffenetti, Valerie Taylor, Murali Emani, Venkatram Vishwanath, 31 Oct 2024, LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators, https://arxiv.org/abs/2411.00136
- Rick Zhou, Larme Zhao, Bo Jiang, and Sean Sheng, June 5, 2024, Benchmarking LLM Inference Backends: vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and TGI, https://www.bentoml.com/blog/benchmarking-llm-inference-backends
- Sahar Mor, Nov 28, 2024, The Open-Source Toolkit for Building AI Agents. Curated frameworks, tools, and libraries every developer needs to build functional and efficient AI agents, https://www.aitidbits.ai/p/open-source-agents
Benchmarking papers. Various research papers with performance measurement and benchmarking of inference frameworks:
- Suresh G, Sep 25, 2023, 7 Frameworks for Serving LLMs, Medium, https://medium.com/@gsuresh957/7-frameworks-for-serving-llms-5044b533ee88 (Review of inference frameworks: vLLM, TGI, CTranslate2, DeepSpeed-MII, OpenLLM, RayServe, and MLC LLM.)
- Daniel Nichols, Siddharth Singh, Shu-Huai Lin, Abhinav Bhatele, July 2022, A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks, https://arxiv.org/abs/2111.04949 PDF: https://pssg.cs.umd.edu/assets/papers/2022-07-dl-survey-arxiv.pdf (Survey of frameworks from the theoretical perspective of parallelism.)
- C Luo, X He, J Zhan, L Wang, W Gao, J Dai, 2020, Comparison and benchmarking of AI models and frameworks on mobile devices, https://arxiv.org/abs/2005.05085 (A somewhat dated evaluation from 2020.)
- Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, Hamed Haddadi, 20 Mar 2024 (v2), MELTing point: Mobile Evaluation of Language Transformers, https://arxiv.org/abs/2403.12844 (Survey and benchmarking of SOTA methods for running LLM inference natively on phones including iPhone and Android, with quantization levels, and with measurement of speed and battery depletion.)
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092/a> Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey from January 2024 with many optimizations including this topic.)
- Pierrick Pochelu, 9 Oct 2022, Deep Learning Inference Frameworks Benchmark, https://arxiv.org/abs/2210.04323 (Benchmarking study in 2022 of various frameworks.)
More AI Research
Read more about: