Aussie AI

Open Source Inference Engines for LLMs

Last Updated 22 October, 2025

by David Spuler, Ph.D.

AI inference for an LLM requires an engine. There are many open source LLMs available online, notably from Meta and Mistral, but they need an engine to run. Fortunately, there are multiple open-source full implementations of Transformer engines that can run inference for an LLM.

List of Open Source Inference Frameworks

Many examples are listed below and it's quite an overwhelming group. Most "famous" are PyTorch and TensorFlow, but there are many others with a stack. There are also several new fully-coded inference-specific engines, which don't have much training capability. Some of these frameworks are ML compilers (e.g. XLA and MLIR). There are also several frameworks that have gained a reputation for running RAG architectures, such as LangChain and Ollama.

These frameworks are mostly offered with permissive and non-copyleft licenses that allow commercial usage (review each package for its license details).

Here's the list so far:

PyTorch
TensorFlow
LangChain
TensorRT (NVIDIA)
ROCm (AMD)
GGML
Llama.cpp
MLIR (LLVM)
Ollama
LLMFarm
Llama2.c
OpenVINO (Intel)
Transformers (Hugging Face)
FasterTransformer (NVIDIA)
vLLM
TGI (Text Generation Inference) (Hugging Face)
MXNet
CTranslate2
DeepSpeed/DeepSpeed-MII
OpenLLM
RayServe
tinygrad
MLX (Apple)
TinyChatEngine

ML compilers (graph compilers) that are open source:

ONNX (Industry coalition)
TVM (Apache)
MLC LLM
XLA (TensorFlow)

And some non-open source inference platforms:

CUDA (NVIDIA)
AICore (Google)
MediaPipe (Google)

Research on Inference Frameworks

Industry articles. Online blog articles and industry press releases on inference frameworks:

Max A. Cherney, March 26, 2024, Exclusive: Behind the plot to break Nvidia's grip on AI by targeting software, https://www.reuters.com/technology/behind-plot-break-nvidias-grip-ai-by-targeting-software-2024-03-25/ (UXL new group with Intel OneAPI and others.)
Doug Eadline, October 5, 2023, How AMD May Get Across the CUDA Moat, HPC Wire, https://www.hpcwire.com/2023/10/05/how-amd-may-get-across-the-cuda-moat/

Research papers. Academic papers about inference frameworks, with evaluations or theoretical aspects:

MMH Shuvo, SK Islam, J Cheng, Efficient acceleration of deep learning inference on resource-constrained edge devices: A review, 2022, Proceedings of the IEEE ( Volume: 111, Issue: 1, January 2023), pp 42 - 91, 14 December 2022 , https://ieeexplore.ieee.org/abstract/document/9985008 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9985008
H Dai, X Peng, X Shi, L He, Q Xiong, H Jin, 2022, Reveal training performance mystery between TensorFlow and PyTorch in the single GPU environment, Science China Information Sciences volume 65, Article number: 112103 (2022), https://link.springer.com/article/10.1007/s11432-020-3182-1, http://scis.scichina.com/en/2022/112103.pdf
Fucheng Jia, Shiqi Jiang, Ting Cao, Wei Cui, Tianrui Xia, Xu Cao, Yuanchun Li, Deyu Zhang, Ju Ren, Yunxin Liu, Lili Qiu, Mao Yang, Sep 2023, Accelerating In-Browser Deep Learning Inference on Diverse Edge Clients through Just-in-Time Kernel Optimizations, https://arxiv.org/pdf/2309.08978.pdf
Hong Zhang, Yupeng Tang, Anurag Khandelwal, and Ion Stoica. 2023. SHEPHERD: Serving DNNs in the Wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 787–808. https://www.usenix.org/conference/nsdi23/presentation/zhang-hong
Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, Jan 2024, Understanding LLMs: A Comprehensive Overview from Training to Inference https://arxiv.org/abs/2401.02038
MLC team. 2023. MLC-LLM. https://github.com/mlc-ai/mlc-llm
tinygrad. 2023. Tinygrad. https://github.com/tinygrad/tinygrad
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, Oct 2023, Efficient Memory Management for Large Language Model Serving with PagedAttention, SOSP ’23, October 23–26, 2023, Koblenz, Germany, https://dl.acm.org/doi/pdf/10.1145/3600006.3613165 (The original Paged Attention and vLLM paper, focusing on optimizing memory size of the KV cache using methods similar to operating-system memory paging.)
Myeonghwa Lee, Seonho An, Min-Soo Kim, 18 Jun 2024, PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers, https://arxiv.org/abs/2406.12430 Code: https://github.com/myeon9h/PlanRAG
Mark Zuckerberg, July 23, 2024 Open Source AI Is the Path Forward https://about.fb.com/news/2024/07/open-source-ai-is-the-path-forward/
Vince Lam, Mar 12, 2024, 50+ Open-Source Options for Running LLMs Locally, https://medium.com/thedeephub/50-open-source-options-for-running-llms-locally-db1ec6f5a54f
Jason Perlow, Aug. 6, 2024, How to run dozens of AI models on your Mac or PC - no third-party cloud needed, https://www.zdnet.com/article/how-to-run-dozens-of-ai-models-on-your-mac-or-pc-no-third-party-cloud-needed/
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng, 6 Jun 2024 (v2), SGLang: Efficient Execution of Structured Language Model Programs, https://arxiv.org/abs/2312.07104 https://github.com/sgl-project/sglang
The SGLang Team, Jul 25, 2024, Achieving Faster Open-Source Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM), https://lmsys.org/blog/2024-07-25-sglang-llama3/
Anna Popovych, Sofiya Merenych, February 16, 2024, Top AI Frameworks in 2024: Comparison of Artificial Intelligence Frameworks, https://clockwise.software/blog/artificial-intelligence-framework/
Hugging Face, 2024, Text Generation Inference, https://huggingface.co/docs/text-generation-inference/index
Shrestha, Y.R., von Krogh, G. & Feuerriegel, S., 2023, Building open-source AI. Nat Comput Sci 3, 908–911 (2023). https://doi.org/10.1038/s43588-023-00540-0 https://www.nature.com/articles/s43588-023-00540-0
Dennis Rall, Bernhard Bauer, Thomas Fraunholz, 8 Nov 2023, Towards Democratizing AI: A Comparative Analysis of AI as a Service Platforms and the Open Space for Machine Learning Approach, https://arxiv.org/abs/2311.04518
DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et. al. (many additional authors), 19 Jun 2024 (v5), DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, https://arxiv.org/abs/2405.04434
ZML, Sep 2024, ZML: High performance AI inference stack. Built for productionl https://docs.zml.ai/ https://github.com/zml/zml?tab=readme-ov-file
Mistral, Sep 2024, AI in abundance. Introducing a free API, improved pricing across the board, a new enterprise-grade Mistral Small, and free vision capabilities on le Chat. https://mistral.ai/news/september-24-release/
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
Sebastian Petrus, Sep 4, 2024, Top 10 RAG Frameworks Github Repos 2024, https://sebastian-petrus.medium.com/top-10-rag-frameworks-github-repos-2024-12b2a81f4a49
Krishna Teja Chitty-Venkata, Siddhisanket Raskar, Bharat Kale, Farah Ferdaus, Aditya Tanikanti, Ken Raffenetti, Valerie Taylor, Murali Emani, Venkatram Vishwanath, 31 Oct 2024, LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators, https://arxiv.org/abs/2411.00136
Rick Zhou, Larme Zhao, Bo Jiang, and Sean Sheng, June 5, 2024, Benchmarking LLM Inference Backends: vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and TGI, https://www.bentoml.com/blog/benchmarking-llm-inference-backends
Sahar Mor, Nov 28, 2024, The Open-Source Toolkit for Building AI Agents. Curated frameworks, tools, and libraries every developer needs to build functional and efficient AI agents, https://www.aitidbits.ai/p/open-source-agents
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Meta, Jan 2025 (accessed), Llama Stack: Composable building blocks to build Llama Apps, https://github.com/meta-llama/llama-stack
Mozhgan Navardi, Romina Aalishah, Yuzhe Fu, Yueqian Lin, Hai Li, Yiran Chen, Tinoosh Mohsenin, 19 Feb 2025, GenAI at the Edge: Comprehensive Survey on Empowering Edge Devices, https://arxiv.org/abs/2502.15816
Amr Elmeleegy, Harry Kim, David Zier, Kyle Kranen, Neelay Shah, Ryan Olson and Omri Kahalon, Mar 18, 2025, Introducing NVIDIA Dynamo, A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models, https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/
Devansh, Jun 1, 2025, The Costly Open-Source LLM Lie: Open Source LLMs are not Free, https://machine-learning-made-simple.medium.com/the-costly-open-source-llm-lie-f83fdc5d5701
Matthias Jobst, Tim Langer, Chen Liu, Mehmet Alici, Hector A. Gonzalez, Christian Mayr, 18 Jul 2025, An End-to-End DNN Inference Framework for the SpiNNaker2 Neuromorphic MPSoC, https://arxiv.org/abs/2507.13736
Ankita Vaishnobi Bisoi, Shreyas V, Jose Siguenza and Bharath Ramsundar, 28 Jul 2025, A Modular Open Source Framework for Genomic Variant Calling, https://arxiv.org/abs/2411.11513
Kaichuan Kong, Dongjie Liu, Xiaobo Jin, Guanggang Geng, Zhiying Li, Jian Weng, 6 Aug 2025, DMFI: Dual-Modality Fine-Tuning and Inference Framework for LLM-Based Insider Threat Detection, https://arxiv.org/abs/2508.05694
Soorya Ram Shimgekar, Shayan Vassef, Abhay Goyal, Navin Kumar, Koustuv Saha, 24 Jul 2025, Agentic AI framework for End-to-End Medical Data Inference, https://arxiv.org/abs/2507.18115
Eduardo Aguilar-Bejarano, Daniel Lea, Karthikeyan Sivakumar, Jimiama M. Mase, Reza Omidvar, Ruizhe Li, Troy Kettle, James Mitchell-White, Morgan R Alexander, David A Winkler, Grazziela Figueredo, 23 Jul 2025, Helix 1.0: An Open-Source Framework for Reproducible and Interpretable Machine Learning on Tabular Scientific Data, https://arxiv.org/abs/2507.17791
Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, Rajesh Ranganath, 18 Jul 2025, A General Framework for Inference-time Scaling and Steering of Diffusion Models, https://arxiv.org/abs/2501.06848
Jiawen Qi, Chang Gao, Zhaochun Ren, Qinyu Chen, 25 Jul 2025, DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference, https://arxiv.org/abs/2507.19608
Riddhi J. Pitliya, Ozan Catal, Toon Van de Maele, Corrado Pezzato, Tim Verbelen, 1 Aug 2025, Theory of Mind Using Active Inference: A Framework for Multi-Agent Cooperation, https://arxiv.org/abs/2508.00401
Chakattrai Sookkongwaree, Tattep Lakmuang, and Chainarong Amornbunchornvej, 1 Aug 2025, Multi-Band Variable-Lag Granger Causality: A Unified Framework for Causal Time Series Inference across Frequencies, https://arxiv.org/abs/2508.00658
Bo Wen, 7 Aug 2025, A Framework for Inherently Safer AGI through Language-Mediated Active Inference, https://arxiv.org/abs/2508.05766
Bj\"orn Volkmann, Jan-Hendrik Ewering, Michael Meindl, Simon F. G. Ehlers, Thomas Seel, 21 Aug 2025, Bayesian Inference and Learning in Nonlinear Dynamical Systems: A Framework for Incorporating Explicit and Implicit Prior Knowledge, https://arxiv.org/abs/2508.15345
Robert Corwin Nov 2024, Running Large Language Models Privately: A comparison of frameworks, models, and costs, https://towardsdatascience.com/running-large-language-models-privately-a-comparison-of-frameworks-models-and-costs-ac33cfe3a462
Zucheng Liang, Wenxin Wei, Kaijie Zhang, Hongyi Chen, 5 Sep 2025, Research on Multi-hop Inference Optimization of LLM Based on MQUAKE Framework, https://arxiv.org/abs/2509.04770
Yongsheng Feng, Yuetonghui Xu, Jiehui Luo, Hongjia Liu, Xiaobing Li, Feng Yu, Wei Li, 19 Sep 2025, TISDiSS: A Training-Time and Inference-Time Scalable Framework for Discriminative Source Separation, https://arxiv.org/abs/2509.15666
Enyu Zhou, Kai Sheng, Hao Chen, Xin He, 19 Sep 2025, CARD: A Cache-Assisted Parallel Speculative Decoding Framework via Query-and-Correct Paradigm for Accelerating LLM Inference, https://arxiv.org/abs/2508.04462
Yudong Shen, Wenyu Wu, Jiali Mao, Yixiao Tong, Guoping Liu, Chaoya Wang, 15 Sep 2025, Bridging the Gap Between Sparsity and Redundancy: A Dual-Decoding Framework with Global Context for Map Inference, https://arxiv.org/abs/2509.11731
Giorgos Armeniakos, Alexis Maras, Sotirios Xydis, Dimitrios Soudris, 18 Sep 2025, MaRVIn: A Cross-Layer Mixed-Precision RISC-V Framework for DNN Inference, from ISA Extension to Hardware Acceleration, https://arxiv.org/abs/2509.15187

Benchmarking papers. Various research papers with performance measurement and benchmarking of inference frameworks:

Suresh G, Sep 25, 2023, 7 Frameworks for Serving LLMs, Medium, https://medium.com/@gsuresh957/7-frameworks-for-serving-llms-5044b533ee88 (Review of inference frameworks: vLLM, TGI, CTranslate2, DeepSpeed-MII, OpenLLM, RayServe, and MLC LLM.)
Daniel Nichols, Siddharth Singh, Shu-Huai Lin, Abhinav Bhatele, July 2022, A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks, https://arxiv.org/abs/2111.04949 PDF: https://pssg.cs.umd.edu/assets/papers/2022-07-dl-survey-arxiv.pdf (Survey of frameworks from the theoretical perspective of parallelism.)
C Luo, X He, J Zhan, L Wang, W Gao, J Dai, 2020, Comparison and benchmarking of AI models and frameworks on mobile devices, https://arxiv.org/abs/2005.05085 (A somewhat dated evaluation from 2020.)
Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, Hamed Haddadi, 20 Mar 2024 (v2), MELTing point: Mobile Evaluation of Language Transformers, https://arxiv.org/abs/2403.12844 (Survey and benchmarking of SOTA methods for running LLM inference natively on phones including iPhone and Android, with quantization levels, and with measurement of speed and battery depletion.)
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092/a> Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey from January 2024 with many optimizations including this topic.)
Pierrick Pochelu, 9 Oct 2022, Deep Learning Inference Frameworks Benchmark, https://arxiv.org/abs/2210.04323 (Benchmarking study in 2022 of various frameworks.)