Aussie AI

Hardware Acceleration

Last Updated 8 June, 2025

by David Spuler, Ph.D.

It all started with the "math coprocessor" chips back in the 1990s. The modern-day version is the Graphics Processing Unit (GPU). As the name suggests, they were originally intended to handle graphics calculations, and are certainly still used for floating point calculations in gaming boxes to display the amazingly fast 3D first-person views that are found in games such as FortNite and MineCraft. However, the role of GPUs has broadened to become that of a general mathematical calculation engine, which has found extensive use in two other massive trends: cryptographic calculations (e.g. bitcoin mining), and the matrix calculations inherent to neural networks and Transformer engines for AI. Such chips are more accurately called "General Purpose GPUs" or GPGPUs, but lately they are all simply called GPUs.

Hardware acceleration is by far the most successful method of optimization for AI engines to date. As the number of floating point operations used by AI models has grown into the billions, the fastest GPU chips have kept up with numerous improvements to hardware acceleration algorithms. The primary advancements have included raw on-chip speed increases to reduce response time, increased on-chip memory size and performance, and the use of parallelization and pipelining methods for improved throughput.

Types of AI Hardware Acceleration

There are various types of hardware acceleration that can make a model run faster.

Graphics Processing Unit (GPU)
Application-Specific Integrated Circuit (ASIC)
Field-Programmable Gate Array (FPGA)
Central Processing Unit (CPU)
Neural Processing Unit (NPU)

Specific hardware acceleration architectural techniques include:

General Purpose GPUs (GPGPUs)
Caches (on-chip memory caching)
Multi-core CPUs
Multi-threaded CPUs
Single-Instruction Multiple Data (SIMD)
Non-Uniform Memory Access (NUMA)

Software Integrations to Hardware Accelerators

Software interfaces to hardware accelaration:

BLAS (Basic Linear Algebra Subroutines)
CUDA (NVIDIA's proprietary Compute Unified Device Architecture)
AVX (Advanced Vector Extensions; also AVX2, AVX-512 and AVX10)
OpenCL
cuBLAS (NVIDIA GPU BLAS version in CUDA)

Software Strategies for Hardware Acceleration

General software acceleration strategies for maximizing the benefits from hardware-accelerated computation:

Pipelining. This refers to keeping the GPU busy with a stream of data to chomp through, and avoiding "bubbles" in the pipeline, which is time when the GPU has nothing to do.
Partitioning and dataflow management. This is the software technique of organizing data so it's ready to send quickly to the GPU, usually in contiguous memory.
Cache management. Judicious use of the various levels of cache memory can help with pipelining efficiency.
Parallelizing. It's all parallel, isn't it? This point refers to writing the overarching algorithms in a parallelism-friendly manner, ensuring that nothing waits for nobody.
Deep learning compilers. The full stack of software acceleration to maximize hardware.

Other software acceleration issues that are closely related to hardware efficiency:

Model compression. Reduced total data size by making the whole model smaller in size (e.g. see quantization, pruning, model compression strategies).
Lower-precision data: Using smaller byte sizes for data (i.e. quantization, end-to-end integer-only computations)
Dataflow reduction: Reducing the amount of data being copied around, such as via caching and data reuse.
Memory reduction and memory management algorithms. There's a lot of data being pumped through memory; see memory optimizations

For many other optimization strategies that are orthogonal to hardware acceleration, and can be used to further optimize a model, see the complete list of AI acceleration techniques.

Survey Papers on AI Hardware Accelerators

Papers that review hardware acceleration frameworks:

Albert Reuther, Peter Michaleas, Michael Jones, Vijay Gadepally, Siddharth Samsi, Jeremy Kepner, AI and ML Accelerator Survey and Trends, Oct 2022, https://arxiv.org/abs/2210.04055
C Åleskog, H Grahn, A Borg, 2022, Recent Developments in Low-Power AI Accelerators: A Survey, Algorithms 2022, 15, 419. https://doi.org/10.3390/a15110419, https://www.mdpi.com/1999-4893/15/11/419, PDF: https://www.mdpi.com/1999-4893/15/11/419/pdf
Reuther, A.; Michaleas, P.; Jones, M.; Gadepally, V.; Samsi, S.; Kepner, J., AI Accelerator Survey and Trends. In Proceedings of the 2021 IEEE High Performance Extreme Computing Conference (HPEC), Virtual, 19–23 September 2022; pp. 1–9. http://dx.doi.org/10.1109/HPEC49654.2021.9622867, https://ieeexplore.ieee.org/document/9622867
Talib, M.A.; Majzoub, S.; Nasir, Q.; Jamal, D., 2021, A systematic literature review on hardware implementation of artificial intelligence algorithms. J. Supercomput. 2021, 77, 1897–1938. http://dx.doi.org/10.1007/s11227-020-03325-8 https://link.springer.com/article/10.1007/s11227-020-03325-8
Reuther, A.; Michaleas, P.; Jones, M.; Gadepally, V.; Samsi, S.; Kepner, J., 2019, Survey and Benchmarking of Machine Learning Accelerators. In Proceedings of the 2019 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 24–26 September 2019; pp. 1–9. http://dx.doi.org/10.1109/HPEC.2019.8916327, https://arxiv.org/abs/1908.11348
Reuther, A.; Michaleas, P.; Jones, M.; Gadepally, V.; Samsi, S.; Kepner, J., 2020, Survey of Machine Learning Accelerators. In Proceedings of the 2020 IEEE High Performance Extreme Computing Conference (HPEC), Greater Boston Area, MA, USA, 22–24 September 2020; pp. 1–12, http://dx.doi.org/10.1109/HPEC43674.2020.9286149, https://arxiv.org/abs/2009.00993
Li, W.; Liewig, M., A survey of AI accelerators for edge environment. In Proceedings of the World Conference on Information Systems and Technologies, Budva, Montenegro, 7–10 April 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 35–44. https://link.springer.com/chapter/10.1007/978-3-030-45691-7_4
Lin, W.; Adetomi, A.; Arslan, T., Low-Power Ultra-Small Edge AI Accelerators for Image Recognition with Convolution Neural Networks: Analysis and Future Directions. Electronics 2021, 10, 2048. http://dx.doi.org/10.3390/electronics10172048, https://www.mdpi.com/2079-9292/10/17/2048
Seo, J.s.; Saikia, J.; Meng, J.; He, W.; Suh, H.s.; Anupreetham; Liao, Y.; Hasssan, A.; Yeo, I., Digital Versus Analog Artificial Intelligence Accelerators: Advances, trends, and emerging designs. IEEE-Solid-State Circuits Mag. 2022, 14, 65–79. https://ieeexplore.ieee.org/document/9864008
M Capra, B Bussolino, A Marchisio, M Shafique, 2020, An updated survey of efficient hardware architectures for accelerating deep convolutional neural networks, Future Internet, https://www.mdpi.com/1999-5903/12/7/113/pdf
J Zhong, Z Liu, X Chen, Apr 2023, Transformer-based models and hardware acceleration analysis in autonomous driving: A survey, https://arxiv.org/abs/2304.10891
L Capogrosso, F Cunico, DS Cheng, F Fummi, 2023, A Machine Learning-oriented Survey on Tiny Machine Learning arXiv preprint arXiv:2309.11932, https://arxiv.org/pdf/2309.11932.pdf
S. Kalapothas, M. Galetakis, G. Flamis, F. Plessas, and P. Kitsos, A survey on RISC-V-based machine learning ecosystem, Information, vol. 14, no. 2, p. 64, 2023, https://www.mdpi.com/2078-2489/14/2/64 PDF: https://www.academia.edu/98345984/A_Survey_on_RISC_V_Based_Machine_Learning_Ecosystem
R. Sanchez-Iborra and A. F. Skarmeta, Tinyml-enabled frugal smart objects: Challenges and opportunities, IEEE Circuits and Systems Magazine, vol. 20, no. 3, pp. 4–18, 2020. https://ieeexplore.ieee.org/document/9166461 PDF: https://sci-hub.se/10.1109/MCAS.2020.3005467
R. Immonen, T. Hämäläinen et al., Tiny machine learning for resource-constrained microcontrollers, Journal of Sensors, vol. 2022, 2022, https://www.hindawi.com/journals/js/2022/7437023/
M. Giordano, L. Piccinelli, and M. Magno, Survey and comparison of milliwatts micro controllers for tiny machine learning at the edge, in 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 2022, pp. 94–97. https://ieeexplore.ieee.org/document/9870017
C. S. Lindsey and T. Lindblad, “Survey of Neural Network Hardware,” in SPIE 2492, Applications and Science of Artificial Neural Networks, S. K. Rogers and D. W. Ruck, Eds., vol. 2492. International Society for Optics and Photonics, apr 1995, pp. 1194–1205. http://proceedings.spiedigitallibrary.org/proceeding.aspx?articleid=1001095, https://www.semanticscholar.org/paper/Survey-of-neural-network-hardware-Lindsey-Lindblad/0729ff5b500565a23fc9faf7dba6df4f465a3b4b (An AI hardware survey paper from back in 1995.)
Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, Henri E. Bal, 2023, Optimization techniques for GPU programming, ACM Computing Surveys, Volume 55, Issue 11, Article No. 239, pp 1–81, https://dl.acm.org/doi/abs/10.1145/3570638, PDF: https://dl.acm.org/doi/pdf/10.1145/3570638 (Extensive survey of software optimizations to improve GPU latency and throughput.)
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265

AI Announcements from Hardware Vendors

Neal Vaidya, Nick Comly, Joe DeLaere, Ankit Patel and Fred Oh, Sep 08, 2023, NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs, NVIDIA Technical Blog, https://developer.nvidia.com/blog/nvidia-tensorrt-llm-supercharges-large-language-model-inference-on-nvidia-h100-gpus/
Michael Kan, July 2023, Intel CEO: Get Ready for the 'AI PC', PCMag UK, https://uk.pcmag.com/laptops/147984/intel-ceo-get-ready-for-the-ai-pc
Jesse Clayton, May 23, 2023, NVIDIA and Microsoft Drive Innovation for Windows PCs in New Era of Generative AI, NVIDIA Blog, https://blogs.nvidia.com/blog/2023/05/23/microsoft-build-nvidia-ai-windows-rtx/
Dan Robinson, 26 July 2023, Intel adds fresh x86 and vector instructions for future chips, The Register, https://www.theregister.com/2023/07/26/intel_x86_vector_instructions/
Mann, Tobias, August 15, 2023, Intel's AVX10 promises benefits of AVX-512 without baggage, The Register, https://www.theregister.com/2023/08/15/avx10_intel_interviews/
Intel, July 2023, Intel® Advanced Vector Extensions 10 Architecture Specification, Revision 1.0, https://cdrdv2.intel.com/v1/dl/getContent/784267
NVIDIA, 2023, CUDA Zone, NVIDIA Developer, https://developer.nvidia.com/cuda-zone
Kevin Okemwa, Oct 2023, Microsoft may debut its first AI chip at Ignite 2023 to mitigate cost, https://www.windowscentral.com/software-apps/microsoft-may-debut-its-first-ai-chip-at-ignite-2023-to-mitigate-cost
Kyle Wiggers, October 7, 2023, OpenAI said to be considering developing its own AI chips TechCrunch, https://techcrunch.com/2023/10/06/openai-said-to-be-considering-developing-its-own-ai-chips/
John Timmer, Oct 21, 2023, IBM has made a new, highly efficient AI processor, Ars Technica, https://arstechnica.com/science/2023/10/ibm-has-made-a-new-highly-efficient-ai-processor/

Hardware-Acceleration Research

Various papers on hardware acceleration, out of thousands, include:

Vikram Jain, Marian Verhelst, Towards Heterogeneous Multi-core Systems-on-Chip for Edge Machine Learning: Journey from Single-core Acceleration to Multi-core Heterogeneous Systems, Springer Nature, 15 Sept 2023, https://link.springer.com/book/10.1007/978-3-031-38230-7
Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks,” ISCA, 2016. https://ieeexplore.ieee.org/document/7551407, PDF: http://www.rle.mit.edu/eems/wp-content/uploads/2016/04/eyeriss_isca_2016.pdf, PDF Slides: https://eems.mit.edu/wp-content/uploads/2016/06/eyeriss_isca_2016_slides.pdf, Project: http://eyeriss.mit.edu/
Ruizhe Zhao; Wayne Luk; Xinyu Niu; Huifeng Shi; Haitao Wang, 2017, Hardware acceleration for machine learning, 2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), https://ieeexplore.ieee.org/abstract/document/7987595/, PDF: https://www.doc.ic.ac.uk/~wl/papers/17/vlsi17rz.pdf
A Auten, M Tomei, R Kumar, 2020, Hardware acceleration of graph neural networks 2020 57th ACM/IEEE Design Automation Conference (DAC), https://ieeexplore.ieee.org/abstract/document/9218751, PDF: http://rakeshk.web.engr.illinois.edu/dac20.pdf
Nabavinejad, S.M.; Baharloo, M.; Chen, K.C.; Palesi, M.; Kogel, T.; Ebrahimi, M., An Overview of Efficient Interconnection Networks for Deep Neural Network Accelerators. IEEE J. Emerg. Sel. Top. Circuits Syst. 2020, 10, 268–282. http://dx.doi.org/10.1109/JETCAS.2020.3022920, https://ieeexplore.ieee.org/abstract/document/9189825 (On-chip interconnection optimizations.)
Gobieski, G.; Atli, A.O.; Mai, K.; Lucia, B.; Beckmann, N. Snafu: An Ultra-Low-Power, Energy-Minimal CGRA-Generation Framework and Architecture. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 14–18 June 2021; pp. 1027–1040. http://dx.doi.org/10.1109/ISCA52012.2021.00084
Singh, S.; Sarma, A.; Jao, N.; Pattnaik, A.; Lu, S.; Yang, K.; Sengupta, A.; Narayanan, V.; Das, C.R., NEBULA: A Neuromorphic Spin-Based Ultra-Low Power Architecture for SNNs and ANNs. In Proceedings of the 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Virtual, 30 May–3 June 2020; pp. 363–376. http://dx.doi.org/10.1109/ISCA45697.2020.00039
Sairam Sri Vatsavai, Venkata Sai Praneeth Karempudi, Ishan Thakkar, Ahmad Salehi, Todd Hastings, Feb 2023, SCONNA: A Stochastic Computing Based Optical Accelerator for Ultra-Fast, Energy-Efficient Inference of Integer-Quantized CNNs, https://arxiv.org/abs/2302.07036, Code: https://github.com/uky-UCAT/SC_ONN_SIM.git
S Moon, HG Mun, H Son, JY Sim, 2023, Multipurpose Deep-Learning Accelerator for Arbitrary Quantization With Reduction of Storage, Logic, and Latency Waste, IEEE Journal of Solid-State Circuits, https://ieeexplore.ieee.org/abstract/document/10268412
Shashank Prasanna, Oct 21, 2020, A complete guide to AI accelerators for deep learning inference — GPUs, AWS Inferentia and Amazon Elastic Inference, Towards Data Science, https://towardsdatascience.com/a-complete-guide-to-ai-accelerators-for-deep-learning-inference-gpus-aws-inferentia-and-amazon-7a5d6804ef1c (Good introduction to hardware acceleration, but from 2020, which is a few years ago now.)
Robert Clausecker, Daniel Lemire. Dec 2022, Transcoding Unicode Characters with AVX-512 Instructions, https://arxiv.org/abs/2212.05098 (The use of AVX-512 bitwise operations to convert Unicode and UTF8 bytes much faster in parallel.)
Matthew Kolbe, 2023, Lightning Talk: How to Leverage SIMD Intrinsics for Massive Slowdowns, CppNow, https://www.youtube.com/watch?v=GleC3SZ8gjU (Discusses how using the C++ intrinsics can actually worsen speed versus allowing the compiler to optimize it automatically.)
Arjun Sha, February 22, 2024, Meet Groq, a Lightning Fast AI Accelerator that Beats ChatGPT and Gemini, https://beebom.com/groq-lpu-chip-ai-platform-beats-chatgpt-gemini/ (Groq is a startup that runs LLM inference on a special hardware chip called LPU for fast inference.)
CNBC, Apr 10, 2024 Meta debuts new generation of AI chip, CNBC, https://www.cnbc.com/2024/04/10/meta-debuts-new-generation-of-ai-chip.html
Kif Leswing, April 9, 2024, Intel unveils latest AI chip as Nvidia competition heats up, CNBC, https://www.cnbc.com/2024/04/09/intel-unveils-gaudi-3-ai-chip-as-nvidia-competition-heats-up-.html (Intel Gaudi 3 chip for high-end datacenter usage, completing with NVIDIA H100.)
Maurizio Capra, Beatrice Bussolino, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique, 2020, Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead, https://ieeexplore.ieee.org/iel7/6287639/6514899/09269334.pdf, https://arxiv.org/abs/2012.11233 (Analysis of optimizations for DNNs and SNNs.)
Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu, 23 Feb 2024, MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, https://arxiv.org/abs/2402.15627
Tamador Mohaidat, Kasem Khalil, 2023, A Survey on Neural Network Hardware Accelerators IEEE Transactions on Artificial Intelligence, Aug. pp. 1-21, vol. 1, https://www.computer.org/csdl/journal/ai/5555/01/10472723/1ViYSMvUFI4
Doug Eadline, October 5, 2023, How AMD May Get Across the CUDA Moat, HPC Wire, https://www.hpcwire.com/2023/10/05/how-amd-may-get-across-the-cuda-moat/
Arnab Raha, Raymond Sung, Soumendu Ghosh, Praveen Kumar Gupta, Deepak A. Mathaikutty, Umer I. Cheema, Kevin Hyland, Cormac Brick, Vijay Raghunathan, Efficient Hardware Acceleration of Emerging Neural Networks for Embedded Machine Learning: An Industry Perspective, 2023, In: Pasricha, S., Shafique, M. (eds) Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing. Springer, Cham. https://doi.org/10.1007/978-3-031-19568-6_5 https://link.springer.com/chapter/10.1007/978-3-031-19568-6_5
S Tuli, NK Jha, 2023, TransCODE: Co-design of Transformers and Accelerators for Efficient Training and Inference, IEEE Transactions on Computer-Aided Design, https://ieeexplore.ieee.org/abstract/document/10144614/ https://arxiv.org/pdf/2303.14882
S Kalapothas, M Galetakis, G Flamis, F Plessas, 2023, A Survey on RISC-V-Based Machine Learning Ecosystem, Information, https://www.mdpi.com/2078-2489/14/2/64 PDF: https://www.mdpi.com/2078-2489/14/2/64/pdf
V Sze, YH Chen, J Emer, A Suleiman, 2017, Hardware for machine learning: Challenges and opportunities, 2017 IEEE Custom Integrated Circuits Conference (CICC) https://ieeexplore.ieee.org/abstract/document/7993626/ https://arxiv.org/pdf/1612.07625
L Du, Y Du, 2017, Hardware accelerator design for machine learning, https://books.google.com/books?hl=en&lr=&id=EG-QDwAAQBAJ&oi=fnd&pg=PA1&dq=hardware+acceleration+machine+learning&ots=UXH17LVbVv&sig=apwSxxHT82TQJg4H_rzceL9NSMU https://www.researchgate.net/publication/327781400_Hardware_Accelerator_Design_for_Machine_Learning
S Bavikadi, A Dhavlle, A Ganguly, 2022, A survey on machine learning accelerators and evolutionary hardware platforms, IEEE Design & Test ( Volume: 39, Issue: 3, June 2022), https://ieeexplore.ieee.org/abstract/document/9739030/
Z Pan, P Mishra, 2022, Hardware acceleration of explainable machine learning, 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), https://ieeexplore.ieee.org/abstract/document/9774739/ https://par.nsf.gov/servlets/purl/10354093
Alberto Marchisio, Davide Dura, Maurizio Capra, Maurizio Martina, Guido Masera, Muhammad Shafique, Apr 2023, SwiftTron: An Efficient Hardware Accelerator for Quantized Transformers, https://arxiv.org/abs/2304.03986 Code: https://github.com/albertomarchisio/SwiftTron
Francesco Ratto, Ángela Porras Máinez, Carlo Sau, Paolo Meloni, Gianfranco Deriu, Stefano Delucchi, Massimo Massa, Luigi Raffo, Francesca Palumbo, April 2023, An Automated Design Flow for Adaptive Neural Network Hardware Accelerators. Journal of Signal Processing Systems (2023): 1-23. https://link.springer.com/article/10.1007/s11265-023-01855-x (Adapatable inference for a CNN by dynamic modification of FPGA-accelerated hardware integrations.)
Sara Hooker. The hardware lottery. Communications of the ACM, 64(12):58–65, November 2021. ISSN 0001-0782. doi: 10.1145/3467017. https://doi.org/10.1145/3467017
R. Sanchez-Iborra and A. F. Skarmeta, Tinyml-enabled frugal smart objects: Challenges and opportunities, IEEE Circuits and Systems Magazine, vol. 20, no. 3, pp. 4–18, 2020. https://ieeexplore.ieee.org/document/9166461 PDF: https://sci-hub.se/10.1109/MCAS.2020.3005467
R. Immonen, T. Hämäläinen et al., Tiny machine learning for resource-constrained microcontrollers, Journal of Sensors, vol. 2022, 2022, https://www.hindawi.com/journals/js/2022/7437023/
M. Giordano, L. Piccinelli, and M. Magno, Survey and comparison of milliwatts micro controllers for tiny machine learning at the edge, in 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 2022, pp. 94–97. https://ieeexplore.ieee.org/document/9870017
Y. Liao, “Neural Networks in Hardware: A Survey,” Department of Computer Science, University of California, Tech. Rep., 2001. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi= 10.1.1.460.3235
V. Sze, Y. Chen, T. Yang, and J. S. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, dec 2017. https://doi.org/10.1109/JPROC.2017.2761740
H. F. Langroudi, T. Pandit, M. Indovina, and D. Kudithipudi, “Digital Neuromorphic Chips for Deep Learning Inference: A Comprehensive Study,” in Applications of Machine Learning, M. E. Zelinski, T. M. Taha, J. Howe, A. A. Awwal, and K. M. Iftekharuddin, Eds. SPIE, sep 2019, p. 9. https://doi.org/10.1117/12.2529407
J. L. Hennessy and D. A. Patterson, “A New Golden Age for Computer Architecture,” Communications of the ACM, vol. 62, no. 2, pp. 48–60, jan 2019. http://dl.acm.org/citation.cfm?doid=3310134.3282307
W. J. Dally, Y. Turakhia, and S. Han, “Domain-Specific Hardware Accelerators,” Communications of the ACM, vol. 63, no. 7, pp. 48–57, jun 2020. https://dl.acm.org/doi/10.1145/3361682
R. Smith, “NVIDIA Hopper GPU Architecture and H100 Accelerator Announced: Working Smarter and Harder,” Mar 2022. https://www.anandtech.com/show/17327/nvidia-hopper-gpu-architecture-and-h100-accelerator-announced
B. Funk, “NVIDIA Jetson AGX Orin: The Next-Gen Platform That Will Power Our AI Robot Overlords Unveiled,” mar 2022. https://hothardware.com/news/nvidia-jetson-agx-orin
93] “NVIDIA Tesla P100.” https://www.nvidia.com/en-us/data-center/tesla-p100/
N. P. Jouppi, C. Young, N. Patil, and D. Patterson, “A Domain-Specific Architecture for Deep Neural Networks,” Communications of the ACM, vol. 61, no. 9, pp. 50–59, aug 2018. http://doi.acm.org/10.1145/3154484
Y. Chen, Y. Xie, L. Song, F. Chen, and T. Tang, “A Survey of Accelerator Architectures for Deep Neural Networks,” Engineering, vol. 6, no. 3, pp. 264–274, mar 2020. https://doi.org/10.1016/j.eng.2020.01.007
E. Wang, J. J. Davis, R. Zhao, H.-C. C. Ng, X. Niu, W. Luk, P. Y. K. Cheung, and G. A. Constantinides, “Deep Neural Network Approximation for Custom Hardware,” ACM Computing Surveys, vol. 52, no. 2, pp. 1–39, may 2019. https://dl.acm.org/doi/10.1145/3309551
S. Khan and A. Mann, “AI Chips: What They Are and Why They Matter,” Georgetown Center for Security and Emerging Technology, Tech. Rep., apr 2020. https://cset.georgetown.edu/research/ai-chips-what-they-are-and-why-they-matter/
U. Rueckert, “Digital Neural Network Accelerators,” in NANO-CHIPS 2030: On-Chip AI for an Efficient Data-Driven World, B. Murmann and B. Hoefflinger, Eds. Springer, Cham, 2020, ch. 12, pp. 181–202. https://link.springer.com/chapter/10.1007%2F978- 3-030-18338-7 12
T. Rogers and M. Khairy, “An Academic’s Attempt to Clear the Fog of the Machine Learning Accelerator War — SIGARCH,” aug 2021. https://www.sigarch.org/an-academics-attempt-to-clear-the-fog-of-the-machine-learning-accelerator-war/
F. P. Sunny, E. Taheri, M. Nikdast, and S. Pasricha, “A Survey on Silicon Photonics for Deep Learning,” ACM Journal on Emerging Technologies in Computing Systems, vol. 17, no. 4, oct 2021. https://dl.acm.org/doi/10.1145/3459009
KAA Fuad, L Chen, 2023, A Survey on Sparsity Exploration in Transformer-Based Accelerators https://www.mdpi.com/2079-9292/12/10/2299
Christopher Wolters, Xiaoxuan Yang, Ulf Schlichtmann, Toyotaro Suzumura, 12 Jun 2024, Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference, https://arxiv.org/abs/2406.08413
S Bhowmick, 2023 Optimizing Transformer Inference on FPGA: A Study on Hardware Acceleration using Vitis HLS, Thesis, PDF: https://aaltodoc.aalto.fi/bitstream/handle/123456789/123155/master_Bhowmick_Soujanya_2023.pdf?sequence=1&isAllowed=y
David Spuler, March 2024, Chapter 16. Hardware Acceleration, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Flegar G, Scheidegger F, Novaković V, Mariani G, Tom´s AE, Malossi ACI, Quintana-Ortí ES, 2019, FloatX: a C++ library for customized floating-point arithmetic. ACM Trans Math Softw 45(4):40, https://dl.acm.org/doi/10.1145/3368086
H.-Y. Wang and T.-S. Chang, 2022, “Row-wise accelerator for vision transformer,” in 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS), 2022, pp. 399–402. https://arxiv.org/abs/2205.03998
Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698
Lucas Mearian, 05 Jun 2024, Can Intel’s new chips compete with Nvidia in the AI universe? https://www.computerworld.com/article/2138358/can-intels-new-chips-compete-with-nvidia-in-the-ai-universe.html
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, Evan Shelhamer Oct 2014, cuDNN: Efficient Primitives for Deep Learning, https://arxiv.org/abs/1410.0759
Zhang, X., Wang, Q., and Chothia, Z., Openblas. 2014. http://xianyi.github.io/OpenBLAS
Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., and Shelhamer, E., 2014, Cudnn: Efficient primitives for deep learning. CoRR, abs/1410.0759, 2014. http://arxiv.org/abs/1410.0759.
Intel. 2020. Intel® math kernel library for deep learning networks, https://github.com/oneapi-src/oneDNN. [Online; accessed 3-Mar2021].
Christoforos Kachris, 18 Jan 2024, A Survey on Hardware Accelerators for Large Language Models, https://arxiv.org/abs/2401.09890
Chen, C, 2024, Hardware‑software co‑exploration and optimization for next‑generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 (Extensive coverage of hardware design with multiple contributions to accelerating various neural network types, ranging from acceleration of various single non-linear functions and end-to-end optimization algorithms. Specific topics include data compression, non-maximum suppression, MHA, and MatMul/GEMM optimizations.)
Peter Guest, Oct 6, 2023, Graphcore Was the UK's AI Champion—Now It’s Scrambling to Survive, https://www.wired.com/story/graphcore-uk-ai-champion-scrambling-to-stay-afloat/ (An article about GraphCore's struggles against NVIDIA and GPUs with its IPUs.)
Etched, June 25, 2024 Etched is Making the Biggest Bet in AI, https://www.etched.com/announcing-etched
Beom Jin Kang, Hae In Lee, Seok Kyu Yoon, Young Chan Kim, Sang Beom Jeong, Seong Jun O, Hyun Kim, October 2024, A survey of FPGA and ASIC designs for transformer inference acceleration and optimization, Journal of Systems Architecture, Volume 155, 103247, https://www.sciencedirect.com/science/article/abs/pii/S138376212400184X
Mike Murphy, 26 Aug 2024, Enhancing enterprise AI with the IBM Spyre Accelerator, https://research.ibm.com/blog/spyre-for-z
Tiernan Ray, Aug. 27, 2024, AI startup Cerebras debuts 'world's fastest inference' service - with a twist: The AI computer maker claims its inference service is dramatically faster and makes new kinds of 'agentic' AI possible, https://www.zdnet.com/article/ai-startup-cerebras-debuts-worlds-fastest-inference-with-a-twist/
https://community.juniper.net/blogs/sharada-yeluri/2024/02/20/llm-inference-hw-sw-optimizations
Dina Genkina, Aug 29, 2024, AI Inference Competition Heats Up First MLPerf benchmarks for Nvidia Blackwell, AMD, Google, Untether AI, IEEE Spectru, https://spectrum.ieee.org/new-inference-chips
Sophia R. Cunningham,Dominique Archambault,Austin Kung, May 2024, Efficient Training and Inference: Techniques for Large Language Models Using Llama, https://www.techrxiv.org/doi/full/10.36227/techrxiv.171651876.65094225/v1
James Wang, August 27, 2024, Introducing Cerebras Inference: AI at Instant Speed, https://cerebras.ai/blog/introducing-cerebras-inference-ai-at-instant-speed
Latent Space, Sep 03, 2024 Efficiency is Coming: 3000x Faster, Cheaper, Better AI Inference from Hardware Improvements, Quantization, and Synthetic Data Distillation, https://www.latent.space/p/nyla
Sean Hollister, Sep 4, 2024, Intel reveals first Lunar Lake laptop CPUs: everything you need to know, https://www.theverge.com/2024/9/3/24233957/intel-lunar-lake-core-ultra-200v-launch
Marius Hobbhahn, Lennart Heim, Gökçe Aydos, Nov 09, 2023, Trends in Machine Learning Hardware, https://epochai.org/blog/trends-in-machine-learning-hardware
Frederic Lardinois, September 9, 2024, Apple announces its new A18 and A18 Pro iPhone chips, https://techcrunch.com/2024/09/09/apple-announces-its-new-a18-iphone-chip/
Nick Evanson, September 2, 2024, OpenAI plans to build its own AI chips on TSMC's forthcoming 1.6 nm A16 process node, https://www.yahoo.com/tech/openai-plans-build-own-ai-120921975.html
Matthew S. Smith, Sep 2024, Challengers Are Coming for Nvidia’s Crown. In AI’s Game of Thrones, don’t count out the upstarts, https://spectrum.ieee.org/nvidia-ai
Sean Hollister, Sep 10, 2024. AMD is turning its back on flagship gaming GPUs to chase AI first, https://www.theverge.com/2024/9/9/24240173/amd-udna-gpu-ai-gaming-rdna-cdna-jack-huynh
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Stephen Jones, March 2024, CUDA: New Features and Beyond, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62400/
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
Kif Leswing, Oct 10 2024, AMD launches AI chip to rival Nvidia’s Blackwell, https://www.cnbc.com/2024/10/10/amd-launches-mi325x-ai-chip-to-rival-nvidias-blackwell-.html
Yu-Ching Hu, September 2024, Efficient Accelerator-Rich Computers for Future Applications, Ph.D. Thesis, Computer Science, https://escholarship.org/content/qt68w3z4vq/qt68w3z4vq.pdf
Mahernaija, Sep 28, 2024, Update 2024 : The Best NVIDIA GPUs for LLM Inference: A Comprehensive Guide. Comparative Study of All NVIDIA GPU, https://medium.com/@mahernaija/the-best-nvidia-gpus-for-llm-inference-a-comprehensive-guide-56ff5b3e3b1f
Arjun Kharpal, Nov 8 2024, How Samsung fell behind in the AI boom leading to a $126 billion wipeout, https://www.cnbc.com/2024/11/08/how-samsung-fell-behind-in-the-ai-boom-behind-rival-sk-hynix.html (About Samsung's HBM memory chips.)
Maxwell Zeff, November 20, 2024, Nvidia’s CEO defends his moat as AI labs change how they improve their AI models, https://techcrunch.com/2024/11/20/nvidias-ceo-defends-his-moat-as-ai-labs-change-how-they-improve-their-ai-models/
Don Clark, Dec. 3, 2024, The Furious Contest to Unseat Nvidia as King of A.I. Chips: Amazon, Advanced Micro Devices and several start-ups are beginning to offer credible alternatives to Nvidia’s chips, especially for a phase of A.I. development known as “inferencing.” https://www.nytimes.com/2024/12/03/technology/nvidia-ai-chips.html
Andy Patrizio, Dec 02, 2024, MRDIMM: Why your next server will have a new kind of memory, MRDIMM promises faster memory with no hardware or software changes. https://www.networkworld.com/article/3615543/mrdimm-why-your-next-server-will-have-a-new-kind-of-memory.html
NVIDIA, Dec 2024, Jetson Orin Nano Super Developer Kit: The most affordable generative AI supercomputer. https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/nano-super-developer-kit/
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng-Ann Heng, Minyi Guo, Chao Li, 18 Dec 2024, A Survey on Inference Optimization Techniques for Mixture of Experts Models, https://arxiv.org/abs/2412.14219 (Broad survey of MoE inference optimization from hardware to model compression to expert parallelism.)
Maxwell Zeff, January 7, 2025, Nvidia CEO says his AI chips are improving faster than Moore’s Law, https://techcrunch.com/2025/01/07/nvidia-ceo-says-his-ai-chips-are-improving-faster-than-moores-law/
Congjie He, Yeqi Huang, Pei Mu, Ziming Miao, Jilong Xue, Lingxiao Ma, Fan Yang, Luo Mai, 6 Feb 2025, WaferLLM: A Wafer-Scale LLM Inference System, https://arxiv.org/abs/2502.04563 (GEMM on a wafer mesh.)
Victor Tangermann, Feb 28, 2025, Sam Altman Says OpenAI Has Run Out of GPUs: "This isn't how we want to operate...", https://futurism.com/sam-altman-openai-run-out-of-gpus
Hiari Pizzini Cavagna, Daniele Cesarini, Andrea Bartolini, 15 May 2025 (v2), Assessing Tenstorrent's RISC-V MatMul Acceleration Capabilities, https://arxiv.org/abs/2505.06085
Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Huazuo Gao, Jiashi Li, Liyue Zhang, Panpan Huang, Shangyan Zhou, Shirong Ma, Wenfeng Liang, Ying He, Yuqing Wang, Yuxuan Liu, Y.X. Wei, 14 May 2025, Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures, https://arxiv.org/abs/2505.09343

GPU Research

Research papers on various GPU issues:

Shaoyuan Chen, Yutong Lin, Mingxing Zhang, Yongwei Wu, 3 May 2024, Efficient and Economic Large Language Model Inference with Attention Offloading, https://arxiv.org/abs/2405.01814 (Separates the process-bound and memory-bound parts of inference for speedup, with focus on prefill, decoding, and the sub-tasks such as QKV and FFN use of GEMM kernels, versus the different pattern of attention computations and the KV cache.)
Jiamin Li, Le Xu, Hong Xu, Aditya Akella, 28 Apr 2024, BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models, https://arxiv.org/abs/2404.18322 (Partitioning inference over blocks for GPU.)
Lequn Chen, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/bitstream/handle/1773/51337/Chen_washington_0250E_26603.pdf?sequence=1&isAllowed=y
Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, Ion Stoica, 22 Apr 2024, Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity, https://arxiv.org/abs/2404.14527 Code: https://github.com/tyler-griggs/melange-release
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E Gonzalez, et al. High-throughput generative inference of large language models with a single gpu. arXiv preprint arXiv:2303.06865, 2023. https://arxiv.org/abs/2303.06865
Weile Luo, Ruibo Fan, Zeyu Li, Dayou Du, Qiang Wang, Xiaowen Chu, 21 Feb 2024, Benchmarking and Dissecting the Nvidia Hopper GPU Architecture, https://arxiv.org/abs/2402.13499
David Spuler, March 2024, Chapter 16. Hardware Acceleration, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698
Ryan Lucchese, Niki Birkner, Yaron Hagai, Virginia Adams, August 13, 2024, Together AI, A practitioner's guide to testing and running large GPU clusters for training generative AI models, https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models
Seungrok Jung. 15, Mar 2024, Large language model inference optimizations on AMD GPUs, ROCm Blogs, https://rocm.blogs.amd.com/artificial-intelligence/llm-inference-optimize/README.html
Dina Genkina, Aug 29, 2024, AI Inference Competition Heats Up First MLPerf benchmarks for Nvidia Blackwell, AMD, Google, Untether AI, IEEE Spectru, https://spectrum.ieee.org/new-inference-chips
David Spuler, March 2024, GPU Hardware Acceleration, in Generative AI in C++, https://www.aussieai.com/book/ch16-gpu-hardware-acceleration
Latent Space, Sep 03, 2024 Efficiency is Coming: 3000x Faster, Cheaper, Better AI Inference from Hardware Improvements, Quantization, and Synthetic Data Distillation, https://www.latent.space/p/nyla
Florian Douetteau, September 7, 2024, Get ready for a tumultuous era of GPU cost volatility, https://venturebeat.com/ai/get-ready-for-a-tumultuous-era-of-gpu-cost-volitivity/
M Davies, I McDougall, S Anandaraj, D Machchhar, April 2024, A Journey of a 1,000 Kernels Begins with a Single Step: A Retrospective of Deep Learning on GPUs, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, April 2024, Pages 20–36, https://doi.org/10.1145/3620665.3640367 https://dl.acm.org/doi/abs/10.1145/3620665.3640367 (Benchmarking analysis of GPU execution extending MLPerf.)
Peter Guest, Oct 6, 2023, Graphcore Was the UK's AI Champion—Now It’s Scrambling to Survive, https://www.wired.com/story/graphcore-uk-ai-champion-scrambling-to-stay-afloat/ (An article about GraphCore's struggles against NVIDIA and GPUs with its IPUs.)
Etched, June 25, 2024 Etched is Making the Biggest Bet in AI, https://www.etched.com/announcing-etched
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
Kif Leswing, Oct 10 2024, AMD launches AI chip to rival Nvidia’s Blackwell, https://www.cnbc.com/2024/10/10/amd-launches-mi325x-ai-chip-to-rival-nvidias-blackwell-.html
Paul Delestrac. 2024, Advanced Profiling Techniques For Evaluating GPU Computing Efficiency Executing ML Applications. Ph.D. Thesis, Micro and nanotechnologies/Microelectronics. Université de Montpellier, 2024. English. NNT: 2024UMONS014 https://theses.hal.science/tel-04742193/file/DELESTRAC_2024_archivage.pdf
Haoran Lin, Xianzhi Yu, Kang Zhao, Lu Hou, Zongyuan Zhan, Stanislav Kamenev, Han Bao, Ting Hu, Mingkai Wang, Qixin Chang, Siyue Sui, Weihao Sun, Jiaxin Hu, Jun Yao, Zekun Yin, Cheng Qian, Ying Zhang, Yinfei Pan, Yu Yang, Weiguo Liu, 22 Oct 2024, FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs, https://arxiv.org/abs/2410.16663
Mahernaija, Sep 28, 2024, Update 2024 : The Best NVIDIA GPUs for LLM Inference: A Comprehensive Guide. Comparative Study of All NVIDIA GPU, https://medium.com/@mahernaija/the-best-nvidia-gpus-for-llm-inference-a-comprehensive-guide-56ff5b3e3b1f
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
Bagus Hanindhito and Lizy K. John. 2024. Accelerating ML Workloads using GPU Tensor Cores: The Good, the Bad, and the Ugly. In Proceedings of the 15th ACM/SPEC International Conference on Performance Engineering (ICPE '24). Association for Computing Machinery, New York, NY, USA, 178–189. https://doi.org/10.1145/3629526.3653835 https://dl.acm.org/doi/abs/10.1145/3629526.3653835 PDF: https://lca.ece.utexas.edu/pubs/Hanindhito_AcceleratingMLWorkloads.pdf
C. Wang, P. Song, H. Zhao, F. Zhang, J. Wang and L. Zhang, "High-Utilization GPGPU Design for Accelerating GEMM Workloads: An Incremental Approach," 2024 IEEE International Symposium on Circuits and Systems (ISCAS), Singapore, Singapore, 2024, pp. 1-5, doi: 10.1109/ISCAS58744.2024.10558334. https://ieeexplore.ieee.org/abstract/document/10558334
Wei Zhao, Anand Jayarajan, Gennady Pekhimenko, 9 Oct 2024, Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads, https://arxiv.org/abs/2410.07381 (Interleaved scheduling layer for GPU workloads.)
Vasily Volkov, August 12, 2016, Understanding Latency Hiding on GPUs, Ph.D. Thesis, Electrical Engineering and Computer Sciences, University of California at Berkeley, Technical Report No. UCB/EECS-2016-143, http://www.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.pdf
Z. Chen et al., "An Empirical Study on the Power Consumption of LLMs with Different GPU Platforms," 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 2024, pp. 8640-8642, doi: 10.1109/BigData62323.2024.10825662. https://ieeexplore.ieee.org/abstract/document/10825662
Burcu Canakci, Junyi Liu, Xingbo Wu, Nathanaël Cheriere, Paolo Costa, Sergey Legtchenko, Dushyanth Narayanan, Ant Rowstron, 17 Jan 2025, Good things come in small packages: Should we adopt Lite-GPUs in AI infrastructure? https://arxiv.org/abs/2501.10187
Sama Bali, Jan 15, 2025 GPU Memory Essentials for AI Performance, https://developer.nvidia.com/blog/gpu-memory-essentials-for-ai-performance/
W. Choi, J. Jeong, H. Jang and J. Ahn, "GPU-centric Memory Tiering for LLM Serving with NVIDIA Grace Hopper Superchip," in IEEE Computer Architecture Letters, doi: 10.1109/LCA.2025.3533588. https://ieeexplore.ieee.org/abstract/document/10852027
Rohan Yadav, Michael Garland, Alex Aiken, Michael Bauer, 9 Apr 2025, Task-Based Tensor Computations on Modern GPUs, https://arxiv.org/abs/2504.07004
Burkhard Ringlein, Thomas Parnell, Radu Stoica, 15 May 2025 (v2), GPU Performance Portability needs Autotuning, https://arxiv.org/abs/2505.03780

Multi-GPU Research

Research papers on various multi-GPU inference and scheduling issues:

Lequn Chen, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/bitstream/handle/1773/51337/Chen_washington_0250E_26603.pdf?sequence=1&isAllowed=y
Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, Ion Stoica, 22 Apr 2024, Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity, https://arxiv.org/abs/2404.14527 Code: https://github.com/tyler-griggs/melange-release
Z Jiang, H Lin, Y Zhong, Q Huang, Y Chen, Z Zhang, et al., 2024, MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, https://www.usenix.org/system/files/nsdi24-jiang-ziheng.pdf
A Ouyang, June 2023, Understanding the Performance of Transformer Inference, Masters Thesis, Electrical Engineering and Computer Science, MIT, https://dspace.mit.edu/handle/1721.1/151543 https://dspace.mit.edu/bitstream/handle/1721.1/151543/ouyang-aouyang-meng-eecs-2023-thesis.pdf?sequence=1&isAllowed=y (Detailed analysis of Transformer performance, including the techniques of KV caching.)
Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu, 23 Feb 2024, MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, https://arxiv.org/abs/2402.15627
Dominik Wagner, Seanie Lee, Ilja Baumann, Philipp Seeberger, Korbinian Riedhammer, Tobias Bocklet, 16 Jun 2024, Optimized Speculative Sampling for GPU Hardware Accelerators, https://arxiv.org/abs/2406.11016 (Speculative decoding accelerated with multiple GPUs using approaches such as tiling, and uses a fused sigmoid replacing Softmax.)
Wesley Brewer, Aditya Kashi, Sajal Dash, Aristeidis Tsaris, Junqi Yin, Mallikarjun Shankar, Feiyi Wang, 24 Jun 2024, Scalable Artificial Intelligence for Science: Perspectives, Methods and Exemplars, https://arxiv.org/abs/2406.17812
Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
Sergei Savvov, Jun 27, 2023, 7 Ways To Speed Up Inference of Your Hosted LLMs, https://betterprogramming.pub/speed-up-llm-inference-83653aa24c47
Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo, Zhengyong Zhang, 15 Aug 2024, P/D-Serve: Serving Disaggregated Large Language Model at Scale, https://arxiv.org/abs/2408.08147 (Comprehensive serving system addressing disaggregated prefill and KV cache transfer with RDMA.)
Rohan Baskar Prabhakar, Hengrui Zhang, David Wentlzaff, 14 Aug 2024, Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference, https://arxiv.org/abs/2408.07802 (Modified Transformer architecture with parallelized sub-layers of attention and FFN.)
Ryan Lucchese, Niki Birkner, Yaron Hagai, Virginia Adams, August 13, 2024, Together AI, A practitioner's guide to testing and running large GPU clusters for training generative AI models, https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models
Tal Ben-Nun, Ely Levy, Amnon Barak, Eri Rubin, 2024, Memory access patterns: the missing piece of the multi-GPU puzzle, SC15: International Conference for High-Performance Computing, Networking, Storage and Analysis, Year: 2015, Pages: 1-12, DOI Bookmark: 10.1145/2807591.2807611, https://www.computer.org/csdl/proceedings-article/sc/2015/2807611/12OmNzaQoh1
Ari Lotter, Jeffrey Quesnelle, Umer H. Adil, Dillon Rolnick, Esteban La Rocca, A Preliminary Report on Distro, 2024, https://github.com/NousResearch/DisTrO/blob/main/A_Preliminary_Report_on_DisTrO.pdf https://venturebeat.com/wp-content/uploads/2024/08/A_Preliminary_Report_on_DisTrO.pdf (Reducing the inter-GPU networking bandwidth cost during training.)
Seungrok Jung. 15, Mar 2024, Large language model inference optimizations on AMD GPUs, ROCm Blogs, https://rocm.blogs.amd.com/artificial-intelligence/llm-inference-optimize/README.html
Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, Esha Choukse, 1 Aug 2024, DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency, https://arxiv.org/abs/2408.00741
Isaac Ong, May 16, 2024, Efficient Distributed LLM Inference with Dynamic Partitioning, Masters Thesis, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-108, http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.html https://digitalassets.lib.berkeley.edu/techreports/ucb/incoming/EECS-2024-108.pdf
Wei An, Xiao Bi, Guanting Chen, Shanhuang Chen, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Wenjun Gao, Kang Guan, Jianzhong Guo, Yongqiang Guo, Zhe Fu, Ying He, Panpan Huang, Jiashi Li, Wenfeng Liang, Xiaodong Liu, Xin Liu, Yiyuan Liu, Yuxuan Liu, Shanghao Lu, Xuan Lu, Xiaotao Nie, Tian Pei, Junjie Qiu, Hui Qu, Zehui Ren, Zhangli Sha, Xuecheng Su, Xiaowen Sun, Yixuan Tan, Minghui Tang, Shiyu Wang, Yaohui Wang, Yongji Wang, Ziwei Xie, Yiliang Xiong, Yanhong Xu, Shengfeng Ye, Shuiping Yu, Yukun Zha, Liyue Zhang, Haowei Zhang, Mingchuan Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Yuheng Zou, 31 Aug 2024 (v2), Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning, DeepSeek AI, https://www.arxiv.org/abs/2408.14158
Y. Peng, W. Gao and H. Peng, "Serving DNN Inference With Fine-Grained Spatio-Temporal Sharing of GPU Servers," in IEEE Transactions on Services Computing, doi: 10.1109/TSC.2024.3463429. https://ieeexplore.ieee.org/document/10684028 https://www.computer.org/csdl/journal/sc/5555/01/10684028/20lm4PEVn9u
Jianfeng Gu, Yichao Zhu, Puxuan Wang, Mohak Chadha, Michael Gerndt, 1 Sep 2023, FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference, https://arxiv.org/abs/2309.00558
Hajer Ayadi, Jimmy X. Huang, Aijun An, Yiming Shao, Hao Zhou, and Hossein Pourmodheji. 2023. TAMG: Topology-Aware Multi-GPU Allocation via Deep Reinforcement Learning. In Proceedings of the 33rd Annual International Conference on Computer Science and Software Engineering (CASCON '23). IBM Corp., USA, 185–190. https://dl.acm.org/doi/10.5555/3615924.3615946
Jiri Kraus, March 2024, Multi GPU Programming Models for HPC and AI, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s61339/
M. Gil et al., "TLP Balancer: Predictive Thread Allocation for Multi-Tenant Inference in Embedded GPUs," in IEEE Embedded Systems Letters, doi: 10.1109/LES.2024.3497587. https://ieeexplore.ieee.org/abstract/document/10753458/
Y Wang, B Li, MTI Ziad, L Eeckhout, J Yang, A Jaleel, Jan 2025, OASIS: Object-Aware Page Management for Multi-GPU Systems https://users.elis.ugent.be/~leeckhou/papers/HPCA2025-OASIS.pdf
Arissa Wongpanich, Tayo Oguntebi, Jose Baiocchi Paredes, Yu Emma Wang, Phitchaya Mangpo Phothilimthana, Ritwika Mitra, Zongwei Zhou, Naveen Kumar, Vijay Janapa Reddi, 10 Feb 2025, Machine Learning Fleet Efficiency: Analyzing and Optimizing Large-Scale Google TPU Systems with ML Productivity Goodput, https://arxiv.org/abs/2502.06982
Youhe Jiang, Ran Yan, Binhang Yuan, 11 Feb 2025, HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment, https://arxiv.org/abs/2502.07903
Hulin Wang, Yaqi Xia, Donglin Yang, Xiaobo Zhou, and Dazhao Cheng. 2025. Harnessing Inter-GPU Shared Memory for Seamless MoE Communication-Computation Fusion. In Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP '25). Association for Computing Machinery, New York, NY, USA, 170–182. https://doi.org/10.1145/3710848.3710868 https://dl.acm.org/doi/abs/10.1145/3710848.3710868
M. Suvarna, O. Tehrani, 2 Apr 2025, GigaAPI for GPU Parallelization, https://arxiv.org/abs/2504.01266

GPU Software Platforms

The main GPU software acceleration frameworks include:

CUDA (NVIDIA)
ROCm (AMD)
Triton (open source, originally by Meta)
OneAPI (Intel)
Vulkan
SYCL

CPU Execution of AI Workloads

Although GPUs are the mainstay of LLM execution, there is increasing focus on using CPUs for inference. This arises from the need to run on-device inference for AI phones and AI PCs, some of which may have an NPU, or some that may only have limited SIMD capabilities such as x86 AVX intrinsics.

Research on CPU execution of LLMs:

Xiao Fu, Weiling Yang, Dezun Dong, Xing Su, 03 June 2024, Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs, ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing, May 2024, Pages 137–149, https://doi.org/10.1145/3650200.3656620 https://dl.acm.org/doi/abs/10.1145/3650200.3656620
Djip007, May 2024, llamafile 0.8.6 CPU benchmark #450, https://github.com/Mozilla-Ocho/llamafile/discussions/450 (Running llamafile at 20 tokens per second on a non-GPU commodity CPU.)
J Cañete, F Bravo-Marquez, 2024, Speedy Gonzales: A Collection of Fast Task-Specific Models for Spanish, https://felipebravom.com/publications/starsem2024.pdf (Optimizing small models on CPU and GPU for the Spanish language, mostly using distillation.)
Xuanlei Zhao, Bin Jia, Haotian Zhou, Ziming Liu, Shenggan Cheng, Yang You, 2 Mar 2024, HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices, https://arxiv.org/abs/2403.01164
C Zhou, Z Hassman, R Xu, D Shah, V Richard, Y Li, Oct 2023, SIMD Dataflow Co-optimization for Efficient Neural Networks Inferences on CPUs, arXiv preprint arXiv:2310.00574, https://arxiv.org/pdf/2310.00574.pdf
V Vanhoucke, A Senior, MZ Mao, 2011, Improving the speed of neural networks on CPUs, Google Research, https://research.google/pubs/pub37631.pdf
David Spuler, March 2024, Chapter 17. AVX Intrinsics, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Intel. 2018, Intel math kernel library for deep neural networks (intel mkl-dnn). https://github.com/intel/mkl-dnn
Xianyi Zhang, Qian Wang, and Zaheer Chothia. 2014, Openblas. http://xianyi.github.io/OpenBLAS
Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698
Tianyi Zhang, Jonah Wonkyu Yi, Bowen Yao, Zhaozhuo Xu, Anshumali Shrivastava, 2 Mar 2024, NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention, https://arxiv.org/abs/2403.01273 Code: https://github.com/tonyzhang617/nomad-dist (Converts 4-bit vector dot products to using SIMD registers as lookup tables on CPUs.)
Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, Mao Yang, 25 Jun 2024, T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge, https://arxiv.org/abs/2407.00088 Code: https://github.com/microsoft/T-MAC (Table lookup for low-bit quantization on CPUs.)
Wang, Bin Guo, Chen Meng, Sheng Gui, Weifei Yu, Yi Xie, 10 Jul 2024, Inference Performance Optimization for Large Language Models on CPUs,Pujiang He, Shan Zhou, Wenhuan Huang, Changqing Li, Duyi https://arxiv.org/abs/2407.07304 Code: https://github.com/intel/xFasterTransformer
Pujiang He, Shan Zhou, Changqing Li, Wenhuan Huang, Weifei Yu, Duyi Wang, Chen Meng, Sheng Gui, 16 May 2024, Distributed Inference Performance Optimization for LLMs on CPUs, https://arxiv.org/abs/2407.00029
Longhao Chen, Yina Zhao, Qiangjun Xie, Qinghua Sheng, 6 Jun 2024, Optimization of Armv9 architecture general large language model inference performance based on Llama.cpp, https://arxiv.org/abs/2406.10816
Hyungyo Kim, Gaohan Ye, Nachuan Wang, Amir Yazdanbakhsh, Nam Sung Kim, 2024, Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference Jan.-Jun. 2024, pp. 117-120, vol. 23 DOI Bookmark: 10.1109/LCA.2024.3397747, https://www.computer.org/csdl/journal/ca/2024/01/10538369/1XcOWKoKwfe
Haihao Shen, Hanwen Chang, Bo Dong, Yu Luo, Hengyu Meng, 7 Dec 2023 (v2), Efficient LLM Inference on CPUs, https://arxiv.org/abs/2311.00502 https://github.com/intel/intel-extension-for-transformers
Neural Magic, 2024, DeepSparse: Sparsity-aware deep learning inference runtime for CPUs, https://github.com/neuralmagic/deepsparse https://neuralmagic.com/deepsparse/
David Spuler, March 2024, CPU Hardware Acceleration, in Generative AI in C++, https://www.aussieai.com/book/ch16-cpu-hardware-acceleration
Sean Hollister, Sep 4, 2024, Intel reveals first Lunar Lake laptop CPUs: everything you need to know, https://www.theverge.com/2024/9/3/24233957/intel-lunar-lake-core-ultra-200v-launch
Anonymous authors, 2024, Distributed Inference Performance Optimizations for LLMs on CPUs, ICLR 2024, https://openreview.net/pdf?id=oEbILBMvDS
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, ´ S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, ´ Y. Yu, and X. Zheng, “Tensorflow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/
Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. Improving the speed of neural networks on CPUs. In NIPS Workshop, 2011, https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.308.2766 PDF: https://citeseerx.ist.psu.edu/doc/10.1.1.308.2766
Z. Zhang, Y. Chen, B. He and Z. Zhang, June 2023, NIOT: A Novel Inference Optimization of Transformers on Modern CPUs, IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 6, pp. 1982-1995, June 2023, doi: 10.1109/TPDS.2023.3269530, https://ieeexplore.ieee.org/abstract/document/10107474
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
Daon Park and Bernhard Egger. 2024. Improving Throughput-oriented LLM Inference with CPU Computations. In Proceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques (PACT '24). Association for Computing Machinery, New York, NY, USA, 233–245. https://doi.org/10.1145/3656019.3676949 https://dl.acm.org/doi/abs/10.1145/3656019.3676949 (Combining CPU and GPU computations.)
Jie Peng, Zhang Cao, Huaizhi Qu, Zhengyu Zhang, Chang Guo, Yanyong Zhang, Zhichao Cao, Tianlong Chen, 23 Oct 2024 (v2), Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching, https://arxiv.org/abs/2410.14740
S Na, G Jeong, BH Ahn, J Young, T Krishna, H Kim, 2024, Understanding Performance Implications of LLM Inference on CPUs, https://seonjinna.github.io/assets/pdf/iiswc24_CPULLM.pdf
Andrew Chan, Dec 12, 2024, Fast LLM Inference From Scratch: Pushing single-GPU inference throughput to the edge without libraries, https://andrewkchan.dev/posts/yalm.html
Libo Zhang, Zhaoning Zhang, Baizhou Xu, Songzhu Mei, Dongsheng Li, 25 Dec 2024, Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference, https://arxiv.org/abs/2412.18934
Dibakar Gope, David Mansell, Danny Loh, Ian Bratt, 23 Dec 2024, Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs, https://arxiv.org/abs/2501.00032 https://github.com/ggerganov/llama.cpp
Ahmed F. AbouElhamayed, Jordan Dotzel, Yash Akhauri, Chi-Chih Chang, Sameh Gobriel, J. Pablo Muñoz, Vui Seng Chua, Nilesh Jain, Mohamed S. Abdelfattah, 18 Feb 2025, SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs, https://arxiv.org/abs/2502.12444
Jaewoo Song, Fangzhen Lin, 7 Mar 2025, SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs, https://arxiv.org/abs/2503.07657
C Zhang, X Zhu, L Chen, T Yang, E Pan, G Yu, Y Zhao, 2025, Enhancing LLM Inference Performance on ARMCPUsthrough Software and Hardware Co-optimization Strategies, DOI 10.23919/ICS.2025.3568404, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10994252

Neural Processing Unit (NPU)

An NPU is a hardware component designed specifically for AI workloads. The NPU is typically built into the CPU, or an add-on hardware component, but is inherently much less capable than a full GPU. Nevertheless, the NPU is the basis for hardware acceleration on AI phones and also some AI PCs.

Ken Yeung, May 21, 2024, Microsoft introduces Phi-Silica, a 3.3B parameter model made for Copilot+ PC NPUs, https://venturebeat.com/ai/microsoft-introduces-phi-silica-a-3-3b-parameter-model-made-for-copilot-pc-npus/
Minseok Seo, Xuan Truong Nguyen, Seok Joong Hwang, Yongkee Kwon, Guhyun Kim, Chanwook Park, Ilkon Kim, Jaehan Park, Jeongbin Kim, Woojae Shin, Jongsoon Won, Haerang Choi, Kyuyoung Kim, Daehan Kwon, Chunseok Jeong, April 2024, IANUS: Integrated Accelerator based on NPU-PIM Unified Memory System, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, Pages 545–560, https://doi.org/10.1145/3620666.3651324 https://dl.acm.org/doi/abs/10.1145/3620666.3651324
William Gallagher, Apr 16, 2024, When to expect every Mac to get the AI-based M4 processor, Apple Insider, https://appleinsider.com/articles/24/04/14/when-to-expect-every-mac-to-get-the-ai-based-m4-processor
Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, Hamed Haddadi, 20 Mar 2024 (v2), MELTing point: Mobile Evaluation of Language Transformers, https://arxiv.org/abs/2403.12844 (Survey and benchmarking of SOTA methods for running LLM inference natively on phones including iPhone and Android, with quantization levels, and with measurement of speed and battery depletion.)
Donghyeon Han, Hoi-Jun Yoo, 2023, On-Chip Training NPU - Algorithm, Architecture and SoC Design, Springer (27 July 2023), https://www.amazon.com/dp/B0C6CTPB9K/
Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, Yunxin Liu, 9 Mar 2024, AutoDroid: LLM-powered Task Automation in Android (v4), https://arxiv.org/abs/2308.15272 Code: https://autodroid-sys.github.io/ (Integrates both on-device Vicuna and cloud-based GPT-4/GPT-3.5 into an Android phone app called AutoDroid.)
Rocke, F. (2023), Evaluation of C++ SIMD Libraries, Bachelor’s Thesis, INSTITUT FUR INFORMATIK, DER LUDWIG–MAXIMILIANS–UNIVERSIT AT MUNCHEN, https://www.mnm-team.org/pub/Fopras/rock23/ PDF: https://www.mnm-team.org/pub/Fopras/rock23/PDF-Version/rock23.pdf (Reviewed six SIMD libraries: Highway, Vc, Libsimdpp, NSIMD, SIMD Everywhere, Pure SIMD).
Sam Rutherford, Wed, Oct 25, 2023, The Snapdragon X Elite is Qualcomm's most powerful chip to date https://www.engadget.com/the-snapdragon-x-elite-is-qualcomms-most-powerful-chip-to-date-190004830.html
Steve Dent, Thu, Mar 28, 2024, Microsoft Copilot AI will soon run locally on PCs, https://www.engadget.com/microsoft-copilot-ai-will-soon-run-locally-on-pcs-130642514.html
Matthijs Hollemans, April 2024 (accessed), The Neural Engine — what do we know about it? https://github.com/hollance/neural-engine
Victor Hristov Sep 17, 2022 (updated), A16 Bionic explained: what's new in Apple's Pro-grade mobile chip? https://www.phonearena.com/news/A16-Bionic-explained-whats-new_id142438
Mustafa Aljadery, 2024 (accessed), Lightning Whisper MLX, https://github.com/mustafaaljadery/lightning-whisper-mlx (Whisper model optiized for Apple MLX hardware acceleration.)
Jeff Butts, Feb 16th, 2023, What Is the Apple Neural Engine and What Does It Do? https://www.macobserver.com/tips/deep-dive/what-is-apple-neural-engine/
Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, Haibo Chen, 12 Jun 2024 (v2), PowerInfer-2: Fast Large Language Model Inference on a Smartphone, https://arxiv.org/abs/2406.06282 Project: https://powerinfer.ai/v2/ Code: https://github.com/SJTU-IPADS/PowerInfer (Runs 47B models on phones using neuron cluster approach to matrix multiplication on NPUs and dynamic activation sparsity, with different approaches for prefill versus decoding phases.)
Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698
Dr. Tehseen Zia, June 20, 2024, The Rise of Neural Processing Units: Enhancing On-Device Generative AI for Speed and Sustainability, https://www.unite.ai/the-rise-of-neural-processing-units-enhancing-on-device-generative-ai-for-speed-and-sustainability/
Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Gang Huang, Mengwei Xu, Xuanzhe Liu, 8 Jul 2024, Empowering 1000 tokens/second on-device LLM prefilling with mllm-NPU, https://arxiv.org/abs/2407.05858
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, Maosong Sun, 3 Aug 2024, MiniCPM-V: A GPT-4V Level MLLM on Your Phone, https://arxiv.org/abs/2408.01800 Code: https://github.com/OpenBMB/MiniCPM-V
Soroush Ghodrati, Sean Kinzer, Hanyang Xu, Rohan Mahapatra, Yoonsung Kim, Byung Hoon Ahn, Dong Kai Wang, Lavanya Karthikeyan, Amir Yazdanbakhsh, Jongse Park, Nam Sung Kim, Hadi Esmaeilzadeh, 27 April 2024, Tandem Processor: Grappling with Emerging Operators in Neural Networks, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, Pages 1165 - 1182, https://doi.org/10.1145/3620665.3640365 https://dl.acm.org/doi/abs/10.1145/3620665.3640365
Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Mengwei Xu, and Xuanzhe Liu. 2024. WiP: Efficient LLM Prefilling with Mobile NPU. In Proceedings of the Workshop on Edge and Mobile Foundation Models (EdgeFM '24). Association for Computing Machinery, New York, NY, USA, 33–35. https://doi.org/10.1145/3662006.3662066 https://dl.acm.org/doi/abs/10.1145/3662006.3662066
Zhongkai Yu, Shengwen Liang, Tianyun Ma, Yunke Cai, Ziyuan Nan, Di Huang, Xinkai Song, Yifan Hao, Jie Zhang, Tian Zhi, Yongwei Zhao, Zidong Du, Xing Hu, Qi Guo, Tianshi Chen, 24 Sep 2024, Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM, https://arxiv.org/abs/2409.15654
Haoran Lin, Xianzhi Yu, Kang Zhao, Lu Hou, Zongyuan Zhan, Stanislav Kamenev, Han Bao, Ting Hu, Mingkai Wang, Qixin Chang, Siyue Sui, Weihao Sun, Jiaxin Hu, Jun Yao, Zekun Yin, Cheng Qian, Ying Zhang, Yinfei Pan, Yu Yang, Weiguo Liu, 22 Oct 2024, FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs, https://arxiv.org/abs/2410.16663
Lucas Mearian, 24 Oct 2024, 2025: The year of the AI PC, Computer World, https://www.computerworld.com/article/3583355/2025-the-year-of-the-ai-pc.html
Anthony Fei, Mohamed S. Abdelfattah, 15 Dec 2024, NITRO: LLM Inference on Intel Laptop NPUs, https://arxiv.org/abs/2412.11053 https://github.com/abdelfattah-lab/nitro
Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Gang Huang, Mengwei Xu, and Xuanzhe Liu. 2025. Fast On-device LLM Inference with NPUs. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (ASPLOS '25). Association for Computing Machinery, New York, NY, USA, 445–462. https://doi.org/10.1145/3669940.3707239 https://dl.acm.org/doi/abs/10.1145/3669940.3707239 (Offloading chunked prefill computations to NPUs.)

FPGA

Research papers on FPGA hardware:

Beom Jin Kang, Hae In Lee, Seok Kyu Yoon, Young Chan Kim, Sang Beom Jeong, Seong Jun O, Hyun Kim, October 2024, A survey of FPGA and ASIC designs for transformer inference acceleration and optimization, Journal of Systems Architecture, Volume 155, 103247, https://www.sciencedirect.com/science/article/abs/pii/S138376212400184X
Han Xu, Yutong Li, Shihao Ji, 12 Sep 2024, LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs, https://arxiv.org/abs/2409.11424 (Matrix multiplications are 97% of computations, which are optimized with a pipelined matrix-vector operation.)
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
D. Gupta, A. Purohit and R. Naresh, "FPGA for High-Frequency Trading: Reducing Latency in Financial Systems," 2024 3rd International Conference on Automation, Computing and Renewable Systems (ICACRS), Pudukkottai, India, 2024, pp. 19-25, doi: 10.1109/ICACRS62842.2024.10841781. https://ieeexplore.ieee.org/abstract/document/10841781
Jindong Li, Tenglong Li, Guobin Shen, Dongcheng Zhao, Qian Zhang, Yi Zeng, 15 Feb 2025, Pushing up to the Limit of Memory Bandwidth and Capacity Utilization for Efficient LLM Decoding on Embedded FPGA, https://arxiv.org/abs/2502.10659
Chenyang Yin, Zhenyu Bai, Pranav Venkatram, Shivam Aggarwal, Zhaoying Li, Tulika Mitra, 23 Feb 2025, TerEffic: Highly Efficient Ternary LLM Inference on FPGA, https://arxiv.org/abs/2502.16473
Shaibal Saha, Lanyu Xu, 26 Feb 2025, Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies, https://arxiv.org/abs/2503.02891
24 Apr 2025 (v2), TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs, https://arxiv.org/abs/2504.16266
Richie Li, Sicheng Chen, 20 May 2025 (v3), Design and Implementation of an FPGA-Based Hardware Accelerator for Transformer, https://arxiv.org/abs/2503.16731

ASIC

Research papers on ASIC hardware:

Beom Jin Kang, Hae In Lee, Seok Kyu Yoon, Young Chan Kim, Sang Beom Jeong, Seong Jun O, Hyun Kim, October 2024, A survey of FPGA and ASIC designs for transformer inference acceleration and optimization, Journal of Systems Architecture, Volume 155, 103247, https://www.sciencedirect.com/science/article/abs/pii/S138376212400184X
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)