Aussie AI
Inference Frameworks
-
Last Updated 18 November, 2024
-
by David Spuler, Ph.D.
Inference frameworks are software platforms that take a model and execute it against requests from users. Many inference frameworks also provide training and fine-tuning capabilities, but not all do. Many frameworks have been open-sourced, but there are also many that remain proprietary, and there is much competition occurring in the space.
There is much overlap between the concept of a framework and a "deep learning compiler". And there is also overlap with companies that are offering "AI cloud hosting" services, including both new startups and the major cloud hosts (e.g. Amazon AWS, Microsoft Azure, and Google GCP), which typically include both training and inference features.
Software frameworks are only one part of the AI tech stack. Read more about inference optimization, training optimization, hardware accelerators, ML compilers, and our list of common and obscure AI optimization techniques.
List of Machine Learning Frameworks
Some of the many frameworks include:
- TensorFlow, open-sourced by Google.
- PyTorch
- Torch
- MXNet
- HuggingFace Transformers
- LangChain
- GGML
- Llama.cpp
- Llvm
- Caffe and Caffe2
- Theano
- RNN
- Keras
- Microsoft CNTK (Cognitive Toolkit)
- Amazon ML
- Google Cloud AutoML
- Microsoft Azure (various)
- SciKit-learn
Features of ML Frameworks
Some of the desirable features include:
- GPU and hardware acceleration support
- Training optimizations
- Quantization
- Pruning
- Kernel operator fusion
- Server hosting support (i.e. deployment to run your model as a website backend service)
Survey Papers on ML Software Frameworks
Papers that review or survey software frameworks:
- G Menghani, 2023, Efficient deep learning: A survey on making deep learning models smaller, faster, and better, ACM Computing Surveys, https://dl.acm.org/doi/abs/10.1145/3578938, https://arxiv.org/abs/2106.08962
- Maurizio Capra, Beatrice Bussolino, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique, 2020, Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead, https://ieeexplore.ieee.org/iel7/6287639/6514899/09269334.pdf, https://arxiv.org/abs/2012.11233
- Daniel Nichols, Siddharth Singh, Shu-Huai Lin, Abhinav Bhatele, July 2022, A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks, https://arxiv.org/abs/2111.04949, https://pssg.cs.umd.edu/assets/papers/2022-07-dl-survey-arxiv.pdf
- Saba Amiri; Sara Salimzadeh; A.S.Z. Belloum, 2019, A Survey of Scalable Deep Learning Frameworks, 2019 15th International Conference on eScience (eScience), https://ieeexplore.ieee.org/document/9041689, PDF: https://pure.uva.nl/ws/files/58721994/09041689.pdf (Short survey paper from 2019.)
- Daniel Nichols, Siddharth Singh, Shu-Huai Lin, Abhinav Bhatele, July 2022, A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks, https://arxiv.org/abs/2111.04949, PDF: https://pssg.cs.umd.edu/assets/papers/2022-07-dl-survey-arxiv.pdf (Survey of frameworks from the theoretical perspective of parallelism.)
- MM YAPICI, N Topaloğlu, 2021, Computers and Informatics, Performance comparison of deep learning frameworks https://dergipark.org.tr/en/pub/ci/issue/60236/769457, PDF: https://dergipark.org.tr/en/download/article-file/1201877 (Examines Torch, Theano, Caffe, Caffe2, MXNet, Keras, TensorFlow, and CNTK frameworks in terms of training speed.)
- Maurizio Capra, Beatrice Bussolino, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique, 2020, Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead, https://ieeexplore.ieee.org/iel7/6287639/6514899/09269334.pdf, https://arxiv.org/abs/2012.11233
General Research on ML Software Frameworks
Research papers about general issues or specific frameworks:
- F Mince, D Dinh, J Kgomo, N Thompson, S Hooker, 2023, The Grand Illusion: The Myth of Software Portability and Implications for ML Progress, arXiv preprint arXiv:2309.07181, https://arxiv.org/pdf/2309.07181.pdf (Examines ML software frameworks TensorFlow, Pytorch, and JAX, and their portability across hardware.)
- H Guan, Y Xiao, J Li, Y Liu, G Bai, May 2023, A comprehensive study of real-world bugs in machine learning model optimization, 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), https://ieeexplore.ieee.org/document/10172690, PDF: https://yepangliu.github.io/files/ICSE2023-MOB.pdf, PDF: https://baigd.github.io/files/ICSE23-MOB.pdf (Frameworks can have bugs? Who knew?)
- N Mungoli, Apr 2023, Scalable, Distributed AI Frameworks: Leveraging Cloud Computing for Enhanced Deep Learning Performance and Efficiency, arXiv preprint arXiv:2304.13738, https://arxiv.org/abs/2304.13738 (Extending frameworks for distributed AI.)
- Arpan Jain, Ammar Ahmad Awan, Quentin Anthony, Hari Subramoni, Dhableswar K. DK Panda, 2019, Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters, 2019 IEEE International Conference on Cluster Computing (CLUSTER), https://ieeexplore.ieee.org/abstract/document/8891042, PDF Slides: http://nbcl.cse.ohio-state.edu/static/media/talks/slide/Arpan_booth_talk_2.pdf
- Marc-André Zöller, Marco F. Huber, Jan 2021, Benchmark and Survey of Automated Machine Learning Frameworks, https://arxiv.org/abs/1904.12054
- Yushuo Chen, Tianyi Tang, Erge Xiang, Linjiang Li, Wayne Xin Zhao, Jing Wang, Yunpeng Chai, Ji-Rong Wen, 17 Apr 2024, Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models, https://arxiv.org/abs/2404.11502 (Benchmarks the performance of various Transformer inference frameworks: Transformers, vLLM, DeepSpeed-MII, TGI, TenserRT-LLM, llama.cpp, LightLLM, LMDeploy, StreamingLLM.)
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, Jan 2024, Understanding LLMs: A Comprehensive Overview from Training to Inference https://arxiv.org/abs/2401.02038
- MLC team. 2023. MLC-LLM. https://github.com/mlc-ai/mlc-llm
- tinygrad. 2023. Tinygrad. https://github.com/tinygrad/tinygrad
- Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, Oct 2023, Efficient Memory Management for Large Language Model Serving with PagedAttention, SOSP ’23, October 23–26, 2023, Koblenz, Germany, https://dl.acm.org/doi/pdf/10.1145/3600006.3613165 (The original Paged Attention and vLLM paper, focusing on optimizing memory size of the KV cache using methods similar to operating-system memory paging.)
- Vince Lam, Mar 12, 2024, 50+ Open-Source Options for Running LLMs Locally, https://medium.com/thedeephub/50-open-source-options-for-running-llms-locally-db1ec6f5a54f
- Jason Perlow, Aug. 6, 2024, How to run dozens of AI models on your Mac or PC - no third-party cloud needed, https://www.zdnet.com/article/how-to-run-dozens-of-ai-models-on-your-mac-or-pc-no-third-party-cloud-needed/
- Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng, 6 Jun 2024 (v2), SGLang: Efficient Execution of Structured Language Model Programs, https://arxiv.org/abs/2312.07104 https://github.com/sgl-project/sglang
- The SGLang Team, Jul 25, 2024, Achieving Faster Open-Source Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM), https://lmsys.org/blog/2024-07-25-sglang-llama3/
- Anna Popovych, Sofiya Merenych, February 16, 2024, Top AI Frameworks in 2024: Comparison of Artificial Intelligence Frameworks, https://clockwise.software/blog/artificial-intelligence-framework/
- Hugging Face, 2024, Text Generation Inference, https://huggingface.co/docs/text-generation-inference/index
- ZML, Sep 2024, ZML: High performance AI inference stack. Built for productionl https://docs.zml.ai/ https://github.com/zml/zml?tab=readme-ov-file
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
- Sebastian Petrus, Sep 4, 2024, Top 10 RAG Frameworks Github Repos 2024, https://sebastian-petrus.medium.com/top-10-rag-frameworks-github-repos-2024-12b2a81f4a49
- Rick Zhou, Larme Zhao, Bo Jiang, and Sean Sheng, June 5, 2024, Benchmarking LLM Inference Backends: vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and TGI, https://www.bentoml.com/blog/benchmarking-llm-inference-backends
More AI Research
Read more about: