Aussie AI
Hardware Acceleration
-
Last Updated 7 December, 2024
-
by David Spuler, Ph.D.
It all started with the "math coprocessor" chips back in the 1990s. The modern-day version is the Graphics Processing Unit (GPU). As the name suggests, they were originally intended to handle graphics calculations, and are certainly still used for floating point calculations in gaming boxes to display the amazingly fast 3D first-person views that are found in games such as FortNite and MineCraft. However, the role of GPUs has broadened to become that of a general mathematical calculation engine, which has found extensive use in two other massive trends: cryptographic calculations (e.g. bitcoin mining), and the matrix calculations inherent to neural networks and Transformer engines for AI. Such chips are more accurately called "General Purpose GPUs" or GPGPUs, but lately they are all simply called GPUs.
Hardware acceleration is by far the most successful method of optimization for AI engines to date. As the number of floating point operations used by AI models has grown into the billions, the fastest GPU chips have kept up with numerous improvements to hardware acceleration algorithms. The primary advancements have included raw on-chip speed increases to reduce response time, increased on-chip memory size and performance, and the use of parallelization and pipelining methods for improved throughput.
Types of AI Hardware Acceleration
There are various types of hardware acceleration that can make a model run faster.
- Graphics Processing Unit (GPU)
- Application-Specific Integrated Circuit (ASIC)
- Field-Programmable Gate Array (FPGA)
- Central Processing Unit (CPU)
- Neural Processing Unit (NPU)
Specific hardware acceleration architectural techniques include:
- General Purpose GPUs (GPGPUs)
- Caches (on-chip memory caching)
- Multi-core CPUs
- Multi-threaded CPUs
- Single-Instruction Multiple Data (SIMD)
- Non-Uniform Memory Access (NUMA)
Software Integrations to Hardware Accelerators
Software interfaces to hardware accelaration:
- BLAS (Basic Linear Algebra Subroutines)
- CUDA (NVIDIA's proprietary Compute Unified Device Architecture)
- AVX (Advanced Vector Extensions; also AVX2, AVX-512 and AVX10)
- OpenCL
- cuBLAS (NVIDIA GPU BLAS version in CUDA)
Software Strategies for Hardware Acceleration
General software acceleration strategies for maximizing the benefits from hardware-accelerated computation:
- Pipelining. This refers to keeping the GPU busy with a stream of data to chomp through, and avoiding "bubbles" in the pipeline, which is time when the GPU has nothing to do.
- Partitioning and dataflow management. This is the software technique of organizing data so it's ready to send quickly to the GPU, usually in contiguous memory.
- Cache management. Judicious use of the various levels of cache memory can help with pipelining efficiency.
- Parallelizing. It's all parallel, isn't it? This point refers to writing the overarching algorithms in a parallelism-friendly manner, ensuring that nothing waits for nobody.
- Deep learning compilers. The full stack of software acceleration to maximize hardware.
Other software acceleration issues that are closely related to hardware efficiency:
- Model compression. Reduced total data size by making the whole model smaller in size (e.g. see quantization, pruning, model compression strategies).
- Lower-precision data: Using smaller byte sizes for data (i.e. quantization, end-to-end integer-only computations)
- Dataflow reduction: Reducing the amount of data being copied around, such as via caching and data reuse.
- Memory reduction and memory management algorithms. There's a lot of data being pumped through memory; see memory optimizations
For many other optimization strategies that are orthogonal to hardware acceleration, and can be used to further optimize a model, see the complete list of AI acceleration techniques.
Survey Papers on AI Hardware Accelerators
Papers that review hardware acceleration frameworks:
- Albert Reuther, Peter Michaleas, Michael Jones, Vijay Gadepally, Siddharth Samsi, Jeremy Kepner, AI and ML Accelerator Survey and Trends, Oct 2022, https://arxiv.org/abs/2210.04055
- C Åleskog, H Grahn, A Borg, 2022, Recent Developments in Low-Power AI Accelerators: A Survey, Algorithms 2022, 15, 419. https://doi.org/10.3390/a15110419, https://www.mdpi.com/1999-4893/15/11/419, PDF: https://www.mdpi.com/1999-4893/15/11/419/pdf
- Reuther, A.; Michaleas, P.; Jones, M.; Gadepally, V.; Samsi, S.; Kepner, J., AI Accelerator Survey and Trends. In Proceedings of the 2021 IEEE High Performance Extreme Computing Conference (HPEC), Virtual, 19–23 September 2022; pp. 1–9. http://dx.doi.org/10.1109/HPEC49654.2021.9622867, https://ieeexplore.ieee.org/document/9622867
- Talib, M.A.; Majzoub, S.; Nasir, Q.; Jamal, D., 2021, A systematic literature review on hardware implementation of artificial intelligence algorithms. J. Supercomput. 2021, 77, 1897–1938. http://dx.doi.org/10.1007/s11227-020-03325-8 https://link.springer.com/article/10.1007/s11227-020-03325-8
- Reuther, A.; Michaleas, P.; Jones, M.; Gadepally, V.; Samsi, S.; Kepner, J., 2019, Survey and Benchmarking of Machine Learning Accelerators. In Proceedings of the 2019 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 24–26 September 2019; pp. 1–9. http://dx.doi.org/10.1109/HPEC.2019.8916327, https://arxiv.org/abs/1908.11348
- Reuther, A.; Michaleas, P.; Jones, M.; Gadepally, V.; Samsi, S.; Kepner, J., 2020, Survey of Machine Learning Accelerators. In Proceedings of the 2020 IEEE High Performance Extreme Computing Conference (HPEC), Greater Boston Area, MA, USA, 22–24 September 2020; pp. 1–12, http://dx.doi.org/10.1109/HPEC43674.2020.9286149, https://arxiv.org/abs/2009.00993
- Li, W.; Liewig, M., A survey of AI accelerators for edge environment. In Proceedings of the World Conference on Information Systems and Technologies, Budva, Montenegro, 7–10 April 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 35–44. https://link.springer.com/chapter/10.1007/978-3-030-45691-7_4
- Lin, W.; Adetomi, A.; Arslan, T., Low-Power Ultra-Small Edge AI Accelerators for Image Recognition with Convolution Neural Networks: Analysis and Future Directions. Electronics 2021, 10, 2048. http://dx.doi.org/10.3390/electronics10172048, https://www.mdpi.com/2079-9292/10/17/2048
- Seo, J.s.; Saikia, J.; Meng, J.; He, W.; Suh, H.s.; Anupreetham; Liao, Y.; Hasssan, A.; Yeo, I., Digital Versus Analog Artificial Intelligence Accelerators: Advances, trends, and emerging designs. IEEE-Solid-State Circuits Mag. 2022, 14, 65–79. https://ieeexplore.ieee.org/document/9864008
- M Capra, B Bussolino, A Marchisio, M Shafique, 2020, An updated survey of efficient hardware architectures for accelerating deep convolutional neural networks, Future Internet, https://www.mdpi.com/1999-5903/12/7/113/pdf
- J Zhong, Z Liu, X Chen, Apr 2023, Transformer-based models and hardware acceleration analysis in autonomous driving: A survey, https://arxiv.org/abs/2304.10891
- L Capogrosso, F Cunico, DS Cheng, F Fummi, 2023, A Machine Learning-oriented Survey on Tiny Machine Learning arXiv preprint arXiv:2309.11932, https://arxiv.org/pdf/2309.11932.pdf
- S. Kalapothas, M. Galetakis, G. Flamis, F. Plessas, and P. Kitsos, A survey on RISC-V-based machine learning ecosystem, Information, vol. 14, no. 2, p. 64, 2023, https://www.mdpi.com/2078-2489/14/2/64 PDF: https://www.academia.edu/98345984/A_Survey_on_RISC_V_Based_Machine_Learning_Ecosystem
- R. Sanchez-Iborra and A. F. Skarmeta, Tinyml-enabled frugal smart objects: Challenges and opportunities, IEEE Circuits and Systems Magazine, vol. 20, no. 3, pp. 4–18, 2020. https://ieeexplore.ieee.org/document/9166461 PDF: https://sci-hub.se/10.1109/MCAS.2020.3005467
- R. Immonen, T. Hämäläinen et al., Tiny machine learning for resource-constrained microcontrollers, Journal of Sensors, vol. 2022, 2022, https://www.hindawi.com/journals/js/2022/7437023/
- M. Giordano, L. Piccinelli, and M. Magno, Survey and comparison of milliwatts micro controllers for tiny machine learning at the edge, in 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 2022, pp. 94–97. https://ieeexplore.ieee.org/document/9870017
- C. S. Lindsey and T. Lindblad, “Survey of Neural Network Hardware,” in SPIE 2492, Applications and Science of Artificial Neural Networks, S. K. Rogers and D. W. Ruck, Eds., vol. 2492. International Society for Optics and Photonics, apr 1995, pp. 1194–1205. http://proceedings.spiedigitallibrary.org/proceeding.aspx?articleid=1001095, https://www.semanticscholar.org/paper/Survey-of-neural-network-hardware-Lindsey-Lindblad/0729ff5b500565a23fc9faf7dba6df4f465a3b4b (An AI hardware survey paper from back in 1995.)
- Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, Henri E. Bal, 2023, Optimization techniques for GPU programming, ACM Computing Surveys, Volume 55, Issue 11, Article No. 239, pp 1–81, https://dl.acm.org/doi/abs/10.1145/3570638, PDF: https://dl.acm.org/doi/pdf/10.1145/3570638 (Extensive survey of software optimizations to improve GPU latency and throughput.)
- Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
- Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
AI Announcements from Hardware Vendors
- Neal Vaidya, Nick Comly, Joe DeLaere, Ankit Patel and Fred Oh, Sep 08, 2023, NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs, NVIDIA Technical Blog, https://developer.nvidia.com/blog/nvidia-tensorrt-llm-supercharges-large-language-model-inference-on-nvidia-h100-gpus/
- Michael Kan, July 2023, Intel CEO: Get Ready for the 'AI PC', PCMag UK, https://uk.pcmag.com/laptops/147984/intel-ceo-get-ready-for-the-ai-pc
- Jesse Clayton, May 23, 2023, NVIDIA and Microsoft Drive Innovation for Windows PCs in New Era of Generative AI, NVIDIA Blog, https://blogs.nvidia.com/blog/2023/05/23/microsoft-build-nvidia-ai-windows-rtx/
- Dan Robinson, 26 July 2023, Intel adds fresh x86 and vector instructions for future chips, The Register, https://www.theregister.com/2023/07/26/intel_x86_vector_instructions/
- Mann, Tobias, August 15, 2023, Intel's AVX10 promises benefits of AVX-512 without baggage, The Register, https://www.theregister.com/2023/08/15/avx10_intel_interviews/
- Intel, July 2023, Intel® Advanced Vector Extensions 10 Architecture Specification, Revision 1.0, https://cdrdv2.intel.com/v1/dl/getContent/784267
- NVIDIA, 2023, CUDA Zone, NVIDIA Developer, https://developer.nvidia.com/cuda-zone
- Kevin Okemwa, Oct 2023, Microsoft may debut its first AI chip at Ignite 2023 to mitigate cost, https://www.windowscentral.com/software-apps/microsoft-may-debut-its-first-ai-chip-at-ignite-2023-to-mitigate-cost
- Kyle Wiggers, October 7, 2023, OpenAI said to be considering developing its own AI chips TechCrunch, https://techcrunch.com/2023/10/06/openai-said-to-be-considering-developing-its-own-ai-chips/
- John Timmer, Oct 21, 2023, IBM has made a new, highly efficient AI processor, Ars Technica, https://arstechnica.com/science/2023/10/ibm-has-made-a-new-highly-efficient-ai-processor/
Hardware-Acceleration Research
Various papers on hardware acceleration, out of thousands, include:
- Vikram Jain, Marian Verhelst, Towards Heterogeneous Multi-core Systems-on-Chip for Edge Machine Learning: Journey from Single-core Acceleration to Multi-core Heterogeneous Systems, Springer Nature, 15 Sept 2023, https://link.springer.com/book/10.1007/978-3-031-38230-7
- Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks,” ISCA, 2016. https://ieeexplore.ieee.org/document/7551407, PDF: http://www.rle.mit.edu/eems/wp-content/uploads/2016/04/eyeriss_isca_2016.pdf, PDF Slides: https://eems.mit.edu/wp-content/uploads/2016/06/eyeriss_isca_2016_slides.pdf, Project: http://eyeriss.mit.edu/
- Ruizhe Zhao; Wayne Luk; Xinyu Niu; Huifeng Shi; Haitao Wang, 2017, Hardware acceleration for machine learning, 2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), https://ieeexplore.ieee.org/abstract/document/7987595/, PDF: https://www.doc.ic.ac.uk/~wl/papers/17/vlsi17rz.pdf
- A Auten, M Tomei, R Kumar, 2020, Hardware acceleration of graph neural networks 2020 57th ACM/IEEE Design Automation Conference (DAC), https://ieeexplore.ieee.org/abstract/document/9218751, PDF: http://rakeshk.web.engr.illinois.edu/dac20.pdf
- Nabavinejad, S.M.; Baharloo, M.; Chen, K.C.; Palesi, M.; Kogel, T.; Ebrahimi, M., An Overview of Efficient Interconnection Networks for Deep Neural Network Accelerators. IEEE J. Emerg. Sel. Top. Circuits Syst. 2020, 10, 268–282. http://dx.doi.org/10.1109/JETCAS.2020.3022920, https://ieeexplore.ieee.org/abstract/document/9189825 (On-chip interconnection optimizations.)
- Gobieski, G.; Atli, A.O.; Mai, K.; Lucia, B.; Beckmann, N. Snafu: An Ultra-Low-Power, Energy-Minimal CGRA-Generation Framework and Architecture. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 14–18 June 2021; pp. 1027–1040. http://dx.doi.org/10.1109/ISCA52012.2021.00084
- Singh, S.; Sarma, A.; Jao, N.; Pattnaik, A.; Lu, S.; Yang, K.; Sengupta, A.; Narayanan, V.; Das, C.R., NEBULA: A Neuromorphic Spin-Based Ultra-Low Power Architecture for SNNs and ANNs. In Proceedings of the 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Virtual, 30 May–3 June 2020; pp. 363–376. http://dx.doi.org/10.1109/ISCA45697.2020.00039
- Sairam Sri Vatsavai, Venkata Sai Praneeth Karempudi, Ishan Thakkar, Ahmad Salehi, Todd Hastings, Feb 2023, SCONNA: A Stochastic Computing Based Optical Accelerator for Ultra-Fast, Energy-Efficient Inference of Integer-Quantized CNNs, https://arxiv.org/abs/2302.07036, Code: https://github.com/uky-UCAT/SC_ONN_SIM.git
- S Moon, HG Mun, H Son, JY Sim, 2023, Multipurpose Deep-Learning Accelerator for Arbitrary Quantization With Reduction of Storage, Logic, and Latency Waste, IEEE Journal of Solid-State Circuits, https://ieeexplore.ieee.org/abstract/document/10268412
- Shashank Prasanna, Oct 21, 2020, A complete guide to AI accelerators for deep learning inference — GPUs, AWS Inferentia and Amazon Elastic Inference, Towards Data Science, https://towardsdatascience.com/a-complete-guide-to-ai-accelerators-for-deep-learning-inference-gpus-aws-inferentia-and-amazon-7a5d6804ef1c (Good introduction to hardware acceleration, but from 2020, which is a few years ago now.)
- Robert Clausecker, Daniel Lemire. Dec 2022, Transcoding Unicode Characters with AVX-512 Instructions, https://arxiv.org/abs/2212.05098 (The use of AVX-512 bitwise operations to convert Unicode and UTF8 bytes much faster in parallel.)
- Matthew Kolbe, 2023, Lightning Talk: How to Leverage SIMD Intrinsics for Massive Slowdowns, CppNow, https://www.youtube.com/watch?v=GleC3SZ8gjU (Discusses how using the C++ intrinsics can actually worsen speed versus allowing the compiler to optimize it automatically.)
- Arjun Sha, February 22, 2024, Meet Groq, a Lightning Fast AI Accelerator that Beats ChatGPT and Gemini, https://beebom.com/groq-lpu-chip-ai-platform-beats-chatgpt-gemini/ (Groq is a startup that runs LLM inference on a special hardware chip called LPU for fast inference.)
- CNBC, Apr 10, 2024 Meta debuts new generation of AI chip, CNBC, https://www.cnbc.com/2024/04/10/meta-debuts-new-generation-of-ai-chip.html
- Kif Leswing, April 9, 2024, Intel unveils latest AI chip as Nvidia competition heats up, CNBC, https://www.cnbc.com/2024/04/09/intel-unveils-gaudi-3-ai-chip-as-nvidia-competition-heats-up-.html (Intel Gaudi 3 chip for high-end datacenter usage, completing with NVIDIA H100.)
- Maurizio Capra, Beatrice Bussolino, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique, 2020, Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead, https://ieeexplore.ieee.org/iel7/6287639/6514899/09269334.pdf, https://arxiv.org/abs/2012.11233 (Analysis of optimizations for DNNs and SNNs.)
- Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu, 23 Feb 2024, MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, https://arxiv.org/abs/2402.15627
- Tamador Mohaidat, Kasem Khalil, 2023, A Survey on Neural Network Hardware Accelerators IEEE Transactions on Artificial Intelligence, Aug. pp. 1-21, vol. 1, https://www.computer.org/csdl/journal/ai/5555/01/10472723/1ViYSMvUFI4
- Doug Eadline, October 5, 2023, How AMD May Get Across the CUDA Moat, HPC Wire, https://www.hpcwire.com/2023/10/05/how-amd-may-get-across-the-cuda-moat/
- Arnab Raha, Raymond Sung, Soumendu Ghosh, Praveen Kumar Gupta, Deepak A. Mathaikutty, Umer I. Cheema, Kevin Hyland, Cormac Brick, Vijay Raghunathan, Efficient Hardware Acceleration of Emerging Neural Networks for Embedded Machine Learning: An Industry Perspective, 2023, In: Pasricha, S., Shafique, M. (eds) Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing. Springer, Cham. https://doi.org/10.1007/978-3-031-19568-6_5 https://link.springer.com/chapter/10.1007/978-3-031-19568-6_5
- S Tuli, NK Jha, 2023, TransCODE: Co-design of Transformers and Accelerators for Efficient Training and Inference, IEEE Transactions on Computer-Aided Design, https://ieeexplore.ieee.org/abstract/document/10144614/ https://arxiv.org/pdf/2303.14882
- S Kalapothas, M Galetakis, G Flamis, F Plessas, 2023, A Survey on RISC-V-Based Machine Learning Ecosystem, Information, https://www.mdpi.com/2078-2489/14/2/64 PDF: https://www.mdpi.com/2078-2489/14/2/64/pdf
- V Sze, YH Chen, J Emer, A Suleiman, 2017, Hardware for machine learning: Challenges and opportunities, 2017 IEEE Custom Integrated Circuits Conference (CICC) https://ieeexplore.ieee.org/abstract/document/7993626/ https://arxiv.org/pdf/1612.07625
- L Du, Y Du, 2017, Hardware accelerator design for machine learning, https://books.google.com/books?hl=en&lr=&id=EG-QDwAAQBAJ&oi=fnd&pg=PA1&dq=hardware+acceleration+machine+learning&ots=UXH17LVbVv&sig=apwSxxHT82TQJg4H_rzceL9NSMU https://www.researchgate.net/publication/327781400_Hardware_Accelerator_Design_for_Machine_Learning
- S Bavikadi, A Dhavlle, A Ganguly, 2022, A survey on machine learning accelerators and evolutionary hardware platforms, IEEE Design & Test ( Volume: 39, Issue: 3, June 2022), https://ieeexplore.ieee.org/abstract/document/9739030/
- Z Pan, P Mishra, 2022, Hardware acceleration of explainable machine learning, 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), https://ieeexplore.ieee.org/abstract/document/9774739/ https://par.nsf.gov/servlets/purl/10354093
- Alberto Marchisio, Davide Dura, Maurizio Capra, Maurizio Martina, Guido Masera, Muhammad Shafique, Apr 2023, SwiftTron: An Efficient Hardware Accelerator for Quantized Transformers, https://arxiv.org/abs/2304.03986 Code: https://github.com/albertomarchisio/SwiftTron
- Francesco Ratto, Ángela Porras Máinez, Carlo Sau, Paolo Meloni, Gianfranco Deriu, Stefano Delucchi, Massimo Massa, Luigi Raffo, Francesca Palumbo, April 2023, An Automated Design Flow for Adaptive Neural Network Hardware Accelerators. Journal of Signal Processing Systems (2023): 1-23. https://link.springer.com/article/10.1007/s11265-023-01855-x (Adapatable inference for a CNN by dynamic modification of FPGA-accelerated hardware integrations.)
- Sara Hooker. The hardware lottery. Communications of the ACM, 64(12):58–65, November 2021. ISSN 0001-0782. doi: 10.1145/3467017. https://doi.org/10.1145/3467017
- R. Sanchez-Iborra and A. F. Skarmeta, Tinyml-enabled frugal smart objects: Challenges and opportunities, IEEE Circuits and Systems Magazine, vol. 20, no. 3, pp. 4–18, 2020. https://ieeexplore.ieee.org/document/9166461 PDF: https://sci-hub.se/10.1109/MCAS.2020.3005467
- R. Immonen, T. Hämäläinen et al., Tiny machine learning for resource-constrained microcontrollers, Journal of Sensors, vol. 2022, 2022, https://www.hindawi.com/journals/js/2022/7437023/
- M. Giordano, L. Piccinelli, and M. Magno, Survey and comparison of milliwatts micro controllers for tiny machine learning at the edge, in 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 2022, pp. 94–97. https://ieeexplore.ieee.org/document/9870017
- Y. Liao, “Neural Networks in Hardware: A Survey,” Department of Computer Science, University of California, Tech. Rep., 2001. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi= 10.1.1.460.3235
- V. Sze, Y. Chen, T. Yang, and J. S. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, dec 2017. https://doi.org/10.1109/JPROC.2017.2761740
- H. F. Langroudi, T. Pandit, M. Indovina, and D. Kudithipudi, “Digital Neuromorphic Chips for Deep Learning Inference: A Comprehensive Study,” in Applications of Machine Learning, M. E. Zelinski, T. M. Taha, J. Howe, A. A. Awwal, and K. M. Iftekharuddin, Eds. SPIE, sep 2019, p. 9. https://doi.org/10.1117/12.2529407
- J. L. Hennessy and D. A. Patterson, “A New Golden Age for Computer Architecture,” Communications of the ACM, vol. 62, no. 2, pp. 48–60, jan 2019. http://dl.acm.org/citation.cfm?doid=3310134.3282307
- W. J. Dally, Y. Turakhia, and S. Han, “Domain-Specific Hardware Accelerators,” Communications of the ACM, vol. 63, no. 7, pp. 48–57, jun 2020. https://dl.acm.org/doi/10.1145/3361682
- R. Smith, “NVIDIA Hopper GPU Architecture and H100 Accelerator Announced: Working Smarter and Harder,” Mar 2022. https://www.anandtech.com/show/17327/nvidia-hopper-gpu-architecture-and-h100-accelerator-announced
- B. Funk, “NVIDIA Jetson AGX Orin: The Next-Gen Platform That Will Power Our AI Robot Overlords Unveiled,” mar 2022. https://hothardware.com/news/nvidia-jetson-agx-orin
- 93] “NVIDIA Tesla P100.” https://www.nvidia.com/en-us/data-center/tesla-p100/
- N. P. Jouppi, C. Young, N. Patil, and D. Patterson, “A Domain-Specific Architecture for Deep Neural Networks,” Communications of the ACM, vol. 61, no. 9, pp. 50–59, aug 2018. http://doi.acm.org/10.1145/3154484
- Y. Chen, Y. Xie, L. Song, F. Chen, and T. Tang, “A Survey of Accelerator Architectures for Deep Neural Networks,” Engineering, vol. 6, no. 3, pp. 264–274, mar 2020. https://doi.org/10.1016/j.eng.2020.01.007
- E. Wang, J. J. Davis, R. Zhao, H.-C. C. Ng, X. Niu, W. Luk, P. Y. K. Cheung, and G. A. Constantinides, “Deep Neural Network Approximation for Custom Hardware,” ACM Computing Surveys, vol. 52, no. 2, pp. 1–39, may 2019. https://dl.acm.org/doi/10.1145/3309551
- S. Khan and A. Mann, “AI Chips: What They Are and Why They Matter,” Georgetown Center for Security and Emerging Technology, Tech. Rep., apr 2020. https://cset.georgetown.edu/research/ai-chips-what-they-are-and-why-they-matter/
- U. Rueckert, “Digital Neural Network Accelerators,” in NANO-CHIPS 2030: On-Chip AI for an Efficient Data-Driven World, B. Murmann and B. Hoefflinger, Eds. Springer, Cham, 2020, ch. 12, pp. 181–202. https://link.springer.com/chapter/10.1007%2F978- 3-030-18338-7 12
- T. Rogers and M. Khairy, “An Academic’s Attempt to Clear the Fog of the Machine Learning Accelerator War — SIGARCH,” aug 2021. https://www.sigarch.org/an-academics-attempt-to-clear-the-fog-of-the-machine-learning-accelerator-war/
- F. P. Sunny, E. Taheri, M. Nikdast, and S. Pasricha, “A Survey on Silicon Photonics for Deep Learning,” ACM Journal on Emerging Technologies in Computing Systems, vol. 17, no. 4, oct 2021. https://dl.acm.org/doi/10.1145/3459009
- KAA Fuad, L Chen, 2023, A Survey on Sparsity Exploration in Transformer-Based Accelerators https://www.mdpi.com/2079-9292/12/10/2299
- Christopher Wolters, Xiaoxuan Yang, Ulf Schlichtmann, Toyotaro Suzumura, 12 Jun 2024, Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference, https://arxiv.org/abs/2406.08413
- S Bhowmick, 2023 Optimizing Transformer Inference on FPGA: A Study on Hardware Acceleration using Vitis HLS, Thesis, PDF: https://aaltodoc.aalto.fi/bitstream/handle/123456789/123155/master_Bhowmick_Soujanya_2023.pdf?sequence=1&isAllowed=y
- David Spuler, March 2024, Chapter 16. Hardware Acceleration, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Flegar G, Scheidegger F, Novaković V, Mariani G, Tom´s AE, Malossi ACI, Quintana-Ortí ES, 2019, FloatX: a C++ library for customized floating-point arithmetic. ACM Trans Math Softw 45(4):40, https://dl.acm.org/doi/10.1145/3368086
- H.-Y. Wang and T.-S. Chang, 2022, “Row-wise accelerator for vision transformer,” in 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS), 2022, pp. 399–402. https://arxiv.org/abs/2205.03998
- Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698
- Lucas Mearian, 05 Jun 2024, Can Intel’s new chips compete with Nvidia in the AI universe? https://www.computerworld.com/article/2138358/can-intels-new-chips-compete-with-nvidia-in-the-ai-universe.html
- Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, Evan Shelhamer Oct 2014, cuDNN: Efficient Primitives for Deep Learning, https://arxiv.org/abs/1410.0759
- Zhang, X., Wang, Q., and Chothia, Z., Openblas. 2014. http://xianyi.github.io/OpenBLAS
- Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., and Shelhamer, E., 2014, Cudnn: Efficient primitives for deep learning. CoRR, abs/1410.0759, 2014. http://arxiv.org/abs/1410.0759.
- Intel. 2020. Intel® math kernel library for deep learning networks, https://github.com/oneapi-src/oneDNN. [Online; accessed 3-Mar2021].
- Christoforos Kachris, 18 Jan 2024, A Survey on Hardware Accelerators for Large Language Models, https://arxiv.org/abs/2401.09890
- Chen, C, 2024, Hardware‑software co‑exploration and optimization for next‑generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 (Extensive coverage of hardware design with multiple contributions to accelerating various neural network types, ranging from acceleration of various single non-linear functions and end-to-end optimization algorithms. Specific topics include data compression, non-maximum suppression, MHA, and MatMul/GEMM optimizations.)
- Peter Guest, Oct 6, 2023, Graphcore Was the UK's AI Champion—Now It’s Scrambling to Survive, https://www.wired.com/story/graphcore-uk-ai-champion-scrambling-to-stay-afloat/ (An article about GraphCore's struggles against NVIDIA and GPUs with its IPUs.)
- Etched, June 25, 2024 Etched is Making the Biggest Bet in AI, https://www.etched.com/announcing-etched
- Beom Jin Kang, Hae In Lee, Seok Kyu Yoon, Young Chan Kim, Sang Beom Jeong, Seong Jun O, Hyun Kim, October 2024, A survey of FPGA and ASIC designs for transformer inference acceleration and optimization, Journal of Systems Architecture, Volume 155, 103247, https://www.sciencedirect.com/science/article/abs/pii/S138376212400184X
- Mike Murphy, 26 Aug 2024, Enhancing enterprise AI with the IBM Spyre Accelerator, https://research.ibm.com/blog/spyre-for-z
- Tiernan Ray, Aug. 27, 2024, AI startup Cerebras debuts 'world's fastest inference' service - with a twist: The AI computer maker claims its inference service is dramatically faster and makes new kinds of 'agentic' AI possible, https://www.zdnet.com/article/ai-startup-cerebras-debuts-worlds-fastest-inference-with-a-twist/
- https://community.juniper.net/blogs/sharada-yeluri/2024/02/20/llm-inference-hw-sw-optimizations
- Dina Genkina, Aug 29, 2024, AI Inference Competition Heats Up First MLPerf benchmarks for Nvidia Blackwell, AMD, Google, Untether AI, IEEE Spectru, https://spectrum.ieee.org/new-inference-chips
- Sophia R. Cunningham,Dominique Archambault,Austin Kung, May 2024, Efficient Training and Inference: Techniques for Large Language Models Using Llama, https://www.techrxiv.org/doi/full/10.36227/techrxiv.171651876.65094225/v1
- James Wang, August 27, 2024, Introducing Cerebras Inference: AI at Instant Speed, https://cerebras.ai/blog/introducing-cerebras-inference-ai-at-instant-speed
- Latent Space, Sep 03, 2024 Efficiency is Coming: 3000x Faster, Cheaper, Better AI Inference from Hardware Improvements, Quantization, and Synthetic Data Distillation, https://www.latent.space/p/nyla
- Sean Hollister, Sep 4, 2024, Intel reveals first Lunar Lake laptop CPUs: everything you need to know, https://www.theverge.com/2024/9/3/24233957/intel-lunar-lake-core-ultra-200v-launch
- Marius Hobbhahn, Lennart Heim, Gökçe Aydos, Nov 09, 2023, Trends in Machine Learning Hardware, https://epochai.org/blog/trends-in-machine-learning-hardware
- Frederic Lardinois, September 9, 2024, Apple announces its new A18 and A18 Pro iPhone chips, https://techcrunch.com/2024/09/09/apple-announces-its-new-a18-iphone-chip/
- Nick Evanson, September 2, 2024, OpenAI plans to build its own AI chips on TSMC's forthcoming 1.6 nm A16 process node, https://www.yahoo.com/tech/openai-plans-build-own-ai-120921975.html
- Matthew S. Smith, Sep 2024, Challengers Are Coming for Nvidia’s Crown. In AI’s Game of Thrones, don’t count out the upstarts, https://spectrum.ieee.org/nvidia-ai
- Sean Hollister, Sep 10, 2024. AMD is turning its back on flagship gaming GPUs to chase AI first, https://www.theverge.com/2024/9/9/24240173/amd-udna-gpu-ai-gaming-rdna-cdna-jack-huynh
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Stephen Jones, March 2024, CUDA: New Features and Beyond, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62400/
- Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
- Kif Leswing, Oct 10 2024, AMD launches AI chip to rival Nvidia’s Blackwell, https://www.cnbc.com/2024/10/10/amd-launches-mi325x-ai-chip-to-rival-nvidias-blackwell-.html
- Yu-Ching Hu, September 2024, Efficient Accelerator-Rich Computers for Future Applications, Ph.D. Thesis, Computer Science, https://escholarship.org/content/qt68w3z4vq/qt68w3z4vq.pdf
- Mahernaija, Sep 28, 2024, Update 2024 : The Best NVIDIA GPUs for LLM Inference: A Comprehensive Guide. Comparative Study of All NVIDIA GPU, https://medium.com/@mahernaija/the-best-nvidia-gpus-for-llm-inference-a-comprehensive-guide-56ff5b3e3b1f
- Arjun Kharpal, Nov 8 2024, How Samsung fell behind in the AI boom leading to a $126 billion wipeout, https://www.cnbc.com/2024/11/08/how-samsung-fell-behind-in-the-ai-boom-behind-rival-sk-hynix.html (About Samsung's HBM memory chips.)
- Maxwell Zeff, November 20, 2024, Nvidia’s CEO defends his moat as AI labs change how they improve their AI models, https://techcrunch.com/2024/11/20/nvidias-ceo-defends-his-moat-as-ai-labs-change-how-they-improve-their-ai-models/
- Don Clark, Dec. 3, 2024, The Furious Contest to Unseat Nvidia as King of A.I. Chips: Amazon, Advanced Micro Devices and several start-ups are beginning to offer credible alternatives to Nvidia’s chips, especially for a phase of A.I. development known as “inferencing.” https://www.nytimes.com/2024/12/03/technology/nvidia-ai-chips.html
- Andy Patrizio, Dec 02, 2024, MRDIMM: Why your next server will have a new kind of memory, MRDIMM promises faster memory with no hardware or software changes. https://www.networkworld.com/article/3615543/mrdimm-why-your-next-server-will-have-a-new-kind-of-memory.html
GPU Research
Research papers on various GPU issues:
- Shaoyuan Chen, Yutong Lin, Mingxing Zhang, Yongwei Wu, 3 May 2024, Efficient and Economic Large Language Model Inference with Attention Offloading, https://arxiv.org/abs/2405.01814 (Separates the process-bound and memory-bound parts of inference for speedup, with focus on prefill, decoding, and the sub-tasks such as QKV and FFN use of GEMM kernels, versus the different pattern of attention computations and the KV cache.)
- Jiamin Li, Le Xu, Hong Xu, Aditya Akella, 28 Apr 2024, BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models, https://arxiv.org/abs/2404.18322 (Partitioning inference over blocks for GPU.)
- Lequn Chen, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/bitstream/handle/1773/51337/Chen_washington_0250E_26603.pdf?sequence=1&isAllowed=y
- Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, Ion Stoica, 22 Apr 2024, Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity, https://arxiv.org/abs/2404.14527 Code: https://github.com/tyler-griggs/melange-release
- Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E Gonzalez, et al. High-throughput generative inference of large language models with a single gpu. arXiv preprint arXiv:2303.06865, 2023. https://arxiv.org/abs/2303.06865
- Weile Luo, Ruibo Fan, Zeyu Li, Dayou Du, Qiang Wang, Xiaowen Chu, 21 Feb 2024, Benchmarking and Dissecting the Nvidia Hopper GPU Architecture, https://arxiv.org/abs/2402.13499
- David Spuler, March 2024, Chapter 16. Hardware Acceleration, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698
- Ryan Lucchese, Niki Birkner, Yaron Hagai, Virginia Adams, August 13, 2024, Together AI, A practitioner's guide to testing and running large GPU clusters for training generative AI models, https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models
- Seungrok Jung. 15, Mar 2024, Large language model inference optimizations on AMD GPUs, ROCm Blogs, https://rocm.blogs.amd.com/artificial-intelligence/llm-inference-optimize/README.html
- Dina Genkina, Aug 29, 2024, AI Inference Competition Heats Up First MLPerf benchmarks for Nvidia Blackwell, AMD, Google, Untether AI, IEEE Spectru, https://spectrum.ieee.org/new-inference-chips
- David Spuler, March 2024, GPU Hardware Acceleration, in Generative AI in C++, https://www.aussieai.com/book/ch16-gpu-hardware-acceleration
- Latent Space, Sep 03, 2024 Efficiency is Coming: 3000x Faster, Cheaper, Better AI Inference from Hardware Improvements, Quantization, and Synthetic Data Distillation, https://www.latent.space/p/nyla
- Florian Douetteau, September 7, 2024, Get ready for a tumultuous era of GPU cost volatility, https://venturebeat.com/ai/get-ready-for-a-tumultuous-era-of-gpu-cost-volitivity/
- M Davies, I McDougall, S Anandaraj, D Machchhar, April 2024, A Journey of a 1,000 Kernels Begins with a Single Step: A Retrospective of Deep Learning on GPUs, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, April 2024, Pages 20–36, https://doi.org/10.1145/3620665.3640367 https://dl.acm.org/doi/abs/10.1145/3620665.3640367 (Benchmarking analysis of GPU execution extending MLPerf.)
- Peter Guest, Oct 6, 2023, Graphcore Was the UK's AI Champion—Now It’s Scrambling to Survive, https://www.wired.com/story/graphcore-uk-ai-champion-scrambling-to-stay-afloat/ (An article about GraphCore's struggles against NVIDIA and GPUs with its IPUs.)
- Etched, June 25, 2024 Etched is Making the Biggest Bet in AI, https://www.etched.com/announcing-etched
- Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
- Kif Leswing, Oct 10 2024, AMD launches AI chip to rival Nvidia’s Blackwell, https://www.cnbc.com/2024/10/10/amd-launches-mi325x-ai-chip-to-rival-nvidias-blackwell-.html
- Paul Delestrac. 2024, Advanced Profiling Techniques For Evaluating GPU Computing Efficiency Executing ML Applications. Ph.D. Thesis, Micro and nanotechnologies/Microelectronics. Université de Montpellier, 2024. English. NNT: 2024UMONS014 https://theses.hal.science/tel-04742193/file/DELESTRAC_2024_archivage.pdf
- Haoran Lin, Xianzhi Yu, Kang Zhao, Lu Hou, Zongyuan Zhan, Stanislav Kamenev, Han Bao, Ting Hu, Mingkai Wang, Qixin Chang, Siyue Sui, Weihao Sun, Jiaxin Hu, Jun Yao, Zekun Yin, Cheng Qian, Ying Zhang, Yinfei Pan, Yu Yang, Weiguo Liu, 22 Oct 2024, FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs, https://arxiv.org/abs/2410.16663
- Mahernaija, Sep 28, 2024, Update 2024 : The Best NVIDIA GPUs for LLM Inference: A Comprehensive Guide. Comparative Study of All NVIDIA GPU, https://medium.com/@mahernaija/the-best-nvidia-gpus-for-llm-inference-a-comprehensive-guide-56ff5b3e3b1f
- Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
- Bagus Hanindhito and Lizy K. John. 2024. Accelerating ML Workloads using GPU Tensor Cores: The Good, the Bad, and the Ugly. In Proceedings of the 15th ACM/SPEC International Conference on Performance Engineering (ICPE '24). Association for Computing Machinery, New York, NY, USA, 178–189. https://doi.org/10.1145/3629526.3653835 https://dl.acm.org/doi/abs/10.1145/3629526.3653835 PDF: https://lca.ece.utexas.edu/pubs/Hanindhito_AcceleratingMLWorkloads.pdf
- C. Wang, P. Song, H. Zhao, F. Zhang, J. Wang and L. Zhang, "High-Utilization GPGPU Design for Accelerating GEMM Workloads: An Incremental Approach," 2024 IEEE International Symposium on Circuits and Systems (ISCAS), Singapore, Singapore, 2024, pp. 1-5, doi: 10.1109/ISCAS58744.2024.10558334. https://ieeexplore.ieee.org/abstract/document/10558334
- Wei Zhao, Anand Jayarajan, Gennady Pekhimenko, 9 Oct 2024, Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads, https://arxiv.org/abs/2410.07381 (Interleaved scheduling layer for GPU workloads.)
- Vasily Volkov, August 12, 2016, Understanding Latency Hiding on GPUs, Ph.D. Thesis, Electrical Engineering and Computer Sciences, University of California at Berkeley, Technical Report No. UCB/EECS-2016-143, http://www.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.pdf
Multi-GPU Research
Research papers on various multi-GPU inference and scheduling issues:
- Lequn Chen, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/bitstream/handle/1773/51337/Chen_washington_0250E_26603.pdf?sequence=1&isAllowed=y
- Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, Ion Stoica, 22 Apr 2024, Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity, https://arxiv.org/abs/2404.14527 Code: https://github.com/tyler-griggs/melange-release
- Z Jiang, H Lin, Y Zhong, Q Huang, Y Chen, Z Zhang, et al., 2024, MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, https://www.usenix.org/system/files/nsdi24-jiang-ziheng.pdf
- A Ouyang, June 2023, Understanding the Performance of Transformer Inference, Masters Thesis, Electrical Engineering and Computer Science, MIT, https://dspace.mit.edu/handle/1721.1/151543 https://dspace.mit.edu/bitstream/handle/1721.1/151543/ouyang-aouyang-meng-eecs-2023-thesis.pdf?sequence=1&isAllowed=y (Detailed analysis of Transformer performance, including the techniques of KV caching.)
- Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu, 23 Feb 2024, MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, https://arxiv.org/abs/2402.15627
- Dominik Wagner, Seanie Lee, Ilja Baumann, Philipp Seeberger, Korbinian Riedhammer, Tobias Bocklet, 16 Jun 2024, Optimized Speculative Sampling for GPU Hardware Accelerators, https://arxiv.org/abs/2406.11016 (Speculative decoding accelerated with multiple GPUs using approaches such as tiling, and uses a fused sigmoid replacing Softmax.)
- Wesley Brewer, Aditya Kashi, Sajal Dash, Aristeidis Tsaris, Junqi Yin, Mallikarjun Shankar, Feiyi Wang, 24 Jun 2024, Scalable Artificial Intelligence for Science: Perspectives, Methods and Exemplars, https://arxiv.org/abs/2406.17812
- Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
- Sergei Savvov, Jun 27, 2023, 7 Ways To Speed Up Inference of Your Hosted LLMs, https://betterprogramming.pub/speed-up-llm-inference-83653aa24c47
- Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo, Zhengyong Zhang, 15 Aug 2024, P/D-Serve: Serving Disaggregated Large Language Model at Scale, https://arxiv.org/abs/2408.08147 (Comprehensive serving system addressing disaggregated prefill and KV cache transfer with RDMA.)
- Rohan Baskar Prabhakar, Hengrui Zhang, David Wentlzaff, 14 Aug 2024, Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference, https://arxiv.org/abs/2408.07802 (Modified Transformer architecture with parallelized sub-layers of attention and FFN.)
- Ryan Lucchese, Niki Birkner, Yaron Hagai, Virginia Adams, August 13, 2024, Together AI, A practitioner's guide to testing and running large GPU clusters for training generative AI models, https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models
- Tal Ben-Nun, Ely Levy, Amnon Barak, Eri Rubin, 2024, Memory access patterns: the missing piece of the multi-GPU puzzle, SC15: International Conference for High-Performance Computing, Networking, Storage and Analysis, Year: 2015, Pages: 1-12, DOI Bookmark: 10.1145/2807591.2807611, https://www.computer.org/csdl/proceedings-article/sc/2015/2807611/12OmNzaQoh1
- Ari Lotter, Jeffrey Quesnelle, Umer H. Adil, Dillon Rolnick, Esteban La Rocca, A Preliminary Report on Distro, 2024, https://github.com/NousResearch/DisTrO/blob/main/A_Preliminary_Report_on_DisTrO.pdf https://venturebeat.com/wp-content/uploads/2024/08/A_Preliminary_Report_on_DisTrO.pdf (Reducing the inter-GPU networking bandwidth cost during training.)
- Seungrok Jung. 15, Mar 2024, Large language model inference optimizations on AMD GPUs, ROCm Blogs, https://rocm.blogs.amd.com/artificial-intelligence/llm-inference-optimize/README.html
- Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, Esha Choukse, 1 Aug 2024, DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency, https://arxiv.org/abs/2408.00741
- Isaac Ong, May 16, 2024, Efficient Distributed LLM Inference with Dynamic Partitioning, Masters Thesis, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-108, http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.html https://digitalassets.lib.berkeley.edu/techreports/ucb/incoming/EECS-2024-108.pdf
- Wei An, Xiao Bi, Guanting Chen, Shanhuang Chen, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Wenjun Gao, Kang Guan, Jianzhong Guo, Yongqiang Guo, Zhe Fu, Ying He, Panpan Huang, Jiashi Li, Wenfeng Liang, Xiaodong Liu, Xin Liu, Yiyuan Liu, Yuxuan Liu, Shanghao Lu, Xuan Lu, Xiaotao Nie, Tian Pei, Junjie Qiu, Hui Qu, Zehui Ren, Zhangli Sha, Xuecheng Su, Xiaowen Sun, Yixuan Tan, Minghui Tang, Shiyu Wang, Yaohui Wang, Yongji Wang, Ziwei Xie, Yiliang Xiong, Yanhong Xu, Shengfeng Ye, Shuiping Yu, Yukun Zha, Liyue Zhang, Haowei Zhang, Mingchuan Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Yuheng Zou, 31 Aug 2024 (v2), Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning, DeepSeek AI, https://www.arxiv.org/abs/2408.14158
- Y. Peng, W. Gao and H. Peng, "Serving DNN Inference With Fine-Grained Spatio-Temporal Sharing of GPU Servers," in IEEE Transactions on Services Computing, doi: 10.1109/TSC.2024.3463429. https://ieeexplore.ieee.org/document/10684028 https://www.computer.org/csdl/journal/sc/5555/01/10684028/20lm4PEVn9u
- Jianfeng Gu, Yichao Zhu, Puxuan Wang, Mohak Chadha, Michael Gerndt, 1 Sep 2023, FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference, https://arxiv.org/abs/2309.00558
- Hajer Ayadi, Jimmy X. Huang, Aijun An, Yiming Shao, Hao Zhou, and Hossein Pourmodheji. 2023. TAMG: Topology-Aware Multi-GPU Allocation via Deep Reinforcement Learning. In Proceedings of the 33rd Annual International Conference on Computer Science and Software Engineering (CASCON '23). IBM Corp., USA, 185–190. https://dl.acm.org/doi/10.5555/3615924.3615946
- Jiri Kraus, March 2024, Multi GPU Programming Models for HPC and AI, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s61339/
- M. Gil et al., "TLP Balancer: Predictive Thread Allocation for Multi-Tenant Inference in Embedded GPUs," in IEEE Embedded Systems Letters, doi: 10.1109/LES.2024.3497587. https://ieeexplore.ieee.org/abstract/document/10753458/
GPU Software Platforms
The main GPU software acceleration frameworks include:
- CUDA (NVIDIA)
- ROCm (AMD)
- Triton (open source, originally by Meta)
- OneAPI (Intel)
- Vulkan
- SYCL
CPU Execution of AI Workloads
Although GPUs are the mainstay of LLM execution, there is increasing focus on using CPUs for inference. This arises from the need to run on-device inference for AI phones and AI PCs, some of which may have an NPU, or some that may only have limited SIMD capabilities such as x86 AVX intrinsics.
Research on CPU execution of LLMs:
- Xiao Fu, Weiling Yang, Dezun Dong, Xing Su, 03 June 2024, Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs, ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing, May 2024, Pages 137–149, https://doi.org/10.1145/3650200.3656620 https://dl.acm.org/doi/abs/10.1145/3650200.3656620
- Djip007, May 2024, llamafile 0.8.6 CPU benchmark #450, https://github.com/Mozilla-Ocho/llamafile/discussions/450 (Running llamafile at 20 tokens per second on a non-GPU commodity CPU.)
- J Cañete, F Bravo-Marquez, 2024, Speedy Gonzales: A Collection of Fast Task-Specific Models for Spanish, https://felipebravom.com/publications/starsem2024.pdf (Optimizing small models on CPU and GPU for the Spanish language, mostly using distillation.)
- Xuanlei Zhao, Bin Jia, Haotian Zhou, Ziming Liu, Shenggan Cheng, Yang You, 2 Mar 2024, HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices, https://arxiv.org/abs/2403.01164
- C Zhou, Z Hassman, R Xu, D Shah, V Richard, Y Li, Oct 2023, SIMD Dataflow Co-optimization for Efficient Neural Networks Inferences on CPUs, arXiv preprint arXiv:2310.00574, https://arxiv.org/pdf/2310.00574.pdf
- V Vanhoucke, A Senior, MZ Mao, 2011, Improving the speed of neural networks on CPUs, Google Research, https://research.google/pubs/pub37631.pdf
- David Spuler, March 2024, Chapter 17. AVX Intrinsics, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Intel. 2018, Intel math kernel library for deep neural networks (intel mkl-dnn). https://github.com/intel/mkl-dnn
- Xianyi Zhang, Qian Wang, and Zaheer Chothia. 2014, Openblas. http://xianyi.github.io/OpenBLAS
- Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698
- Tianyi Zhang, Jonah Wonkyu Yi, Bowen Yao, Zhaozhuo Xu, Anshumali Shrivastava, 2 Mar 2024, NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention, https://arxiv.org/abs/2403.01273 Code: https://github.com/tonyzhang617/nomad-dist (Converts 4-bit vector dot products to using SIMD registers as lookup tables on CPUs.)
- Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, Mao Yang, 25 Jun 2024, T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge, https://arxiv.org/abs/2407.00088 Code: https://github.com/microsoft/T-MAC (Table lookup for low-bit quantization on CPUs.)
- Wang, Bin Guo, Chen Meng, Sheng Gui, Weifei Yu, Yi Xie, 10 Jul 2024, Inference Performance Optimization for Large Language Models on CPUs,Pujiang He, Shan Zhou, Wenhuan Huang, Changqing Li, Duyi https://arxiv.org/abs/2407.07304 Code: https://github.com/intel/xFasterTransformer
- Pujiang He, Shan Zhou, Changqing Li, Wenhuan Huang, Weifei Yu, Duyi Wang, Chen Meng, Sheng Gui, 16 May 2024, Distributed Inference Performance Optimization for LLMs on CPUs, https://arxiv.org/abs/2407.00029
- Longhao Chen, Yina Zhao, Qiangjun Xie, Qinghua Sheng, 6 Jun 2024, Optimization of Armv9 architecture general large language model inference performance based on Llama.cpp, https://arxiv.org/abs/2406.10816
- Hyungyo Kim, Gaohan Ye, Nachuan Wang, Amir Yazdanbakhsh, Nam Sung Kim, 2024, Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference Jan.-Jun. 2024, pp. 117-120, vol. 23 DOI Bookmark: 10.1109/LCA.2024.3397747, https://www.computer.org/csdl/journal/ca/2024/01/10538369/1XcOWKoKwfe
- Haihao Shen, Hanwen Chang, Bo Dong, Yu Luo, Hengyu Meng, 7 Dec 2023 (v2), Efficient LLM Inference on CPUs, https://arxiv.org/abs/2311.00502 https://github.com/intel/intel-extension-for-transformers
- Neural Magic, 2024, DeepSparse: Sparsity-aware deep learning inference runtime for CPUs, https://github.com/neuralmagic/deepsparse https://neuralmagic.com/deepsparse/
- David Spuler, March 2024, CPU Hardware Acceleration, in Generative AI in C++, https://www.aussieai.com/book/ch16-cpu-hardware-acceleration
- Sean Hollister, Sep 4, 2024, Intel reveals first Lunar Lake laptop CPUs: everything you need to know, https://www.theverge.com/2024/9/3/24233957/intel-lunar-lake-core-ultra-200v-launch
- Anonymous authors, 2024, Distributed Inference Performance Optimizations for LLMs on CPUs, ICLR 2024, https://openreview.net/pdf?id=oEbILBMvDS
- M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, ´ S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, ´ Y. Yu, and X. Zheng, “Tensorflow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/
- Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. Improving the speed of neural networks on CPUs. In NIPS Workshop, 2011, https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.308.2766 PDF: https://citeseerx.ist.psu.edu/doc/10.1.1.308.2766
- Z. Zhang, Y. Chen, B. He and Z. Zhang, June 2023, NIOT: A Novel Inference Optimization of Transformers on Modern CPUs, IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 6, pp. 1982-1995, June 2023, doi: 10.1109/TPDS.2023.3269530, https://ieeexplore.ieee.org/abstract/document/10107474
- Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
- Daon Park and Bernhard Egger. 2024. Improving Throughput-oriented LLM Inference with CPU Computations. In Proceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques (PACT '24). Association for Computing Machinery, New York, NY, USA, 233–245. https://doi.org/10.1145/3656019.3676949 https://dl.acm.org/doi/abs/10.1145/3656019.3676949 (Combining CPU and GPU computations.)
- Jie Peng, Zhang Cao, Huaizhi Qu, Zhengyu Zhang, Chang Guo, Yanyong Zhang, Zhichao Cao, Tianlong Chen, 23 Oct 2024 (v2), Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching, https://arxiv.org/abs/2410.14740
- S Na, G Jeong, BH Ahn, J Young, T Krishna, H Kim, 2024, Understanding Performance Implications of LLM Inference on CPUs, https://seonjinna.github.io/assets/pdf/iiswc24_CPULLM.pdf
Neural Processing Unit (NPU)
An NPU is a hardware component designed specifically for AI workloads. The NPU is typically built into the CPU, or an add-on hardware component, but is inherently much less capable than a full GPU. Nevertheless, the NPU is the basis for hardware acceleration on AI phones and also some AI PCs.
- Ken Yeung, May 21, 2024, Microsoft introduces Phi-Silica, a 3.3B parameter model made for Copilot+ PC NPUs, https://venturebeat.com/ai/microsoft-introduces-phi-silica-a-3-3b-parameter-model-made-for-copilot-pc-npus/
- Minseok Seo, Xuan Truong Nguyen, Seok Joong Hwang, Yongkee Kwon, Guhyun Kim, Chanwook Park, Ilkon Kim, Jaehan Park, Jeongbin Kim, Woojae Shin, Jongsoon Won, Haerang Choi, Kyuyoung Kim, Daehan Kwon, Chunseok Jeong, April 2024, IANUS: Integrated Accelerator based on NPU-PIM Unified Memory System, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, Pages 545–560, https://doi.org/10.1145/3620666.3651324 https://dl.acm.org/doi/abs/10.1145/3620666.3651324
- William Gallagher, Apr 16, 2024, When to expect every Mac to get the AI-based M4 processor, Apple Insider, https://appleinsider.com/articles/24/04/14/when-to-expect-every-mac-to-get-the-ai-based-m4-processor
- Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, Hamed Haddadi, 20 Mar 2024 (v2), MELTing point: Mobile Evaluation of Language Transformers, https://arxiv.org/abs/2403.12844 (Survey and benchmarking of SOTA methods for running LLM inference natively on phones including iPhone and Android, with quantization levels, and with measurement of speed and battery depletion.)
- Donghyeon Han, Hoi-Jun Yoo, 2023, On-Chip Training NPU - Algorithm, Architecture and SoC Design, Springer (27 July 2023), https://www.amazon.com/dp/B0C6CTPB9K/
- Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, Yunxin Liu, 9 Mar 2024, AutoDroid: LLM-powered Task Automation in Android (v4), https://arxiv.org/abs/2308.15272 Code: https://autodroid-sys.github.io/ (Integrates both on-device Vicuna and cloud-based GPT-4/GPT-3.5 into an Android phone app called AutoDroid.)
- Rocke, F. (2023), Evaluation of C++ SIMD Libraries, Bachelor’s Thesis, INSTITUT FUR INFORMATIK, DER LUDWIG–MAXIMILIANS–UNIVERSIT AT MUNCHEN, https://www.mnm-team.org/pub/Fopras/rock23/ PDF: https://www.mnm-team.org/pub/Fopras/rock23/PDF-Version/rock23.pdf (Reviewed six SIMD libraries: Highway, Vc, Libsimdpp, NSIMD, SIMD Everywhere, Pure SIMD).
- Sam Rutherford, Wed, Oct 25, 2023, The Snapdragon X Elite is Qualcomm's most powerful chip to date https://www.engadget.com/the-snapdragon-x-elite-is-qualcomms-most-powerful-chip-to-date-190004830.html
- Steve Dent, Thu, Mar 28, 2024, Microsoft Copilot AI will soon run locally on PCs, https://www.engadget.com/microsoft-copilot-ai-will-soon-run-locally-on-pcs-130642514.html
- Matthijs Hollemans, April 2024 (accessed), The Neural Engine — what do we know about it? https://github.com/hollance/neural-engine
- Victor Hristov Sep 17, 2022 (updated), A16 Bionic explained: what's new in Apple's Pro-grade mobile chip? https://www.phonearena.com/news/A16-Bionic-explained-whats-new_id142438
- Mustafa Aljadery, 2024 (accessed), Lightning Whisper MLX, https://github.com/mustafaaljadery/lightning-whisper-mlx (Whisper model optiized for Apple MLX hardware acceleration.)
- Jeff Butts, Feb 16th, 2023, What Is the Apple Neural Engine and What Does It Do? https://www.macobserver.com/tips/deep-dive/what-is-apple-neural-engine/
- Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, Haibo Chen, 12 Jun 2024 (v2), PowerInfer-2: Fast Large Language Model Inference on a Smartphone, https://arxiv.org/abs/2406.06282 Project: https://powerinfer.ai/v2/ Code: https://github.com/SJTU-IPADS/PowerInfer (Runs 47B models on phones using neuron cluster approach to matrix multiplication on NPUs and dynamic activation sparsity, with different approaches for prefill versus decoding phases.)
- Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698
- Dr. Tehseen Zia, June 20, 2024, The Rise of Neural Processing Units: Enhancing On-Device Generative AI for Speed and Sustainability, https://www.unite.ai/the-rise-of-neural-processing-units-enhancing-on-device-generative-ai-for-speed-and-sustainability/
- Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Gang Huang, Mengwei Xu, Xuanzhe Liu, 8 Jul 2024, Empowering 1000 tokens/second on-device LLM prefilling with mllm-NPU, https://arxiv.org/abs/2407.05858
- Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, Maosong Sun, 3 Aug 2024, MiniCPM-V: A GPT-4V Level MLLM on Your Phone, https://arxiv.org/abs/2408.01800 Code: https://github.com/OpenBMB/MiniCPM-V
- Soroush Ghodrati, Sean Kinzer, Hanyang Xu, Rohan Mahapatra, Yoonsung Kim, Byung Hoon Ahn, Dong Kai Wang, Lavanya Karthikeyan, Amir Yazdanbakhsh, Jongse Park, Nam Sung Kim, Hadi Esmaeilzadeh, 27 April 2024, Tandem Processor: Grappling with Emerging Operators in Neural Networks, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, Pages 1165 - 1182, https://doi.org/10.1145/3620665.3640365 https://dl.acm.org/doi/abs/10.1145/3620665.3640365
- Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Mengwei Xu, and Xuanzhe Liu. 2024. WiP: Efficient LLM Prefilling with Mobile NPU. In Proceedings of the Workshop on Edge and Mobile Foundation Models (EdgeFM '24). Association for Computing Machinery, New York, NY, USA, 33–35. https://doi.org/10.1145/3662006.3662066 https://dl.acm.org/doi/abs/10.1145/3662006.3662066
- Zhongkai Yu, Shengwen Liang, Tianyun Ma, Yunke Cai, Ziyuan Nan, Di Huang, Xinkai Song, Yifan Hao, Jie Zhang, Tian Zhi, Yongwei Zhao, Zidong Du, Xing Hu, Qi Guo, Tianshi Chen, 24 Sep 2024, Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM, https://arxiv.org/abs/2409.15654
- Haoran Lin, Xianzhi Yu, Kang Zhao, Lu Hou, Zongyuan Zhan, Stanislav Kamenev, Han Bao, Ting Hu, Mingkai Wang, Qixin Chang, Siyue Sui, Weihao Sun, Jiaxin Hu, Jun Yao, Zekun Yin, Cheng Qian, Ying Zhang, Yinfei Pan, Yu Yang, Weiguo Liu, 22 Oct 2024, FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs, https://arxiv.org/abs/2410.16663
- Lucas Mearian, 24 Oct 2024, 2025: The year of the AI PC, Computer World, https://www.computerworld.com/article/3583355/2025-the-year-of-the-ai-pc.html
More AI Research
Read more about:
- List of AI Optimizations
- Inference Optimizations
- Loop Optimizations
- Code Optimizations
- « Research Home