Aussie AI
Partitioning
-
Last Updated 28 November, 2024
-
by David Spuler, Ph.D.
Partitioning is a model inference optimization technique that involves organizing data in memory, especially ordering of vectors and tensors. There can be multiple goals to achieve with in-memory partitioning:
- Faster memory access. This can be improved via use of contiguous memory or retaining data in memory longer, rather than swapping in and out.
- Pipelining operations to GPUs. Keeping the GPU busy by handling how the data is organized before being sent to the GPU.
- Parallelization of operations to multiple GPUs.
Research Papers on Partitioning
GPU partitioning is a type of software acceleration to make hardware acceleration more effective. Partitioning data optimally can optimize the throughput and efficiency when using multiple GPUs.
- Jaskirat Singh, Bram Adams, Ahmed E. Hassan, 25 Mar 2024, On the Impact of Black-box Deployment Strategies for Edge AI on Latency and Model Performance, https://arxiv.org/abs/2403.17154 (MLOps deployment for quantization, partitioning and early-exit across mobile, edge, and cloud platforms, including running early exit on mobile.)
- Shuming Shi, Enbo Zhao, Deng Cai, Leyang Cui, Xinting Huang, Huayang Li, 16 Jan 2024, Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models, https://arxiv.org/abs/2401.08294 Source: https://github.com/inferflow/inferflow
- Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Nov 2021, Data movement is all you need: A case study on optimizing transformers, Proceedings of Machine Learning and Systems, 3, 2021. https://arxiv.org/abs/2007.00072 Code: https://github.com/spcl/substation
- Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, Chuan Wu, 2 Mar 2024, LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization, https://arxiv.org/abs/2403.01136
- Urvij Saroliya, Eishi Arima, Dai Liu, Martin Schulz, 14 May 2024, Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach, https://arxiv.org/abs/2405.08754
- Eishi Arima, Minjoon Kang, Issa Saba, Josef Weidendorfer, Carsten Trinitis, Martin Schulz, 6 May 2024, Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps, https://arxiv.org/abs/2405.03838
- Issa Saba, Eishi Arima, Dai Liu, Martin Schulz, 6 May 2024, Orchestrated Co-scheduling, Resource Partitioning, and Power Capping on CPU-GPU Heterogeneous Systems via Machine Learning, https://arxiv.org/abs/2405.03831
- Houssam-Eddine Zahaf, Ignacio Sanudo Olmedo, Jayati Singh, Nicola Capodieci, Sebastien Faucou, 21 May 2021, Contention-Aware GPU Partitioning and Task-to-Partition Allocation for Real-Time Workloads, https://arxiv.org/abs/2105.10312
- D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Paper with extensive coverage of numerous compiler auto-optimizations of program code.)
- V. Vanhoucke, A. Senior, and M. Z. Mao, Improving the speed of neural networks on CPUs, In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, volume 1, page 4, 2011, https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.308.2766 (This paper explores some general code optimizations in relation to CPU and GPU execution, including lazy evaluation, loop unrolling, parallel accumulators, and in-memory partitioning of data for hardware acceleration.)
- Isaac Ong, May 16, 2024, Efficient Distributed LLM Inference with Dynamic Partitioning, Masters Thesis, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-108, http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.pdf
- https://community.juniper.net/blogs/sharada-yeluri/2024/02/20/llm-inference-hw-sw-optimizations
- Mingjin Zhang, 2024, High-performance scheduling of deep learning tasks in collaborative edge computing, Ph.D. Thesis, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, https://theses.lib.polyu.edu.hk/bitstream/200/13080/3/7528.pdf (Scheduling of inference and training tasks on edge devices with techniques such as model splitting/partitioning.)
- Yikun Han, Chunjiang Liu, Pengfei Wang, 18 Oct 2023, A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge, https://arxiv.org/abs/2310.11703
- Zheng Wang, Shu Xian Teo, Jieer Ouyang, Yongjun Xu, Wei Shi, 26 May 2024, M-RAG: Reinforcing Large Language Model Performance through Retrieval-Augmented Generation with Multiple Partitions, https://arxiv.org/abs/2405.16420
- Eric Samikwa, 2024, Resource-Aware Distributed Machine Learning for Artificial Intelligence of Things, Ph.D. thesis, Faculty of Science, University of Bern, Switzerland, https://boristheses.unibe.ch/5378/1/24samikwa_e_1_.pdf https://doi.org/10.48549/5378 (Multi-edge device with early exit, "micro-split" scheduling, split/federated learning, and distributed inference.)
- Y. Song, Y. Meng, B. Chen, S. Chen and Y. Kang, 2024, SALTM: Accelerating Large Transformers in Multi-device System with 2D Model Partitioning Method, Integrated Circuits and Systems, doi: 10.23919/ICS.2024.3458897, https://ieeexplore.ieee.org/abstract/document/10678935 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10678935
- Dimitrios Kafetzis, Iordanis Koutsopoulos, Oct 2024, Demo: AnExperimental Platform for AI Model Partitioning on Resource-constrained Devices, https://dl.acm.org/doi/pdf/10.1145/3641512.3690629
- Wenxiang Lin, Xinglin Pan, Shaohuai Shi, Xuan Wang, Xiaowen Chu, 24 Nov 2024, Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems, https://arxiv.org/abs/2411.15715
More AI Research
Read more about: