Aussie AI

Partitioning

  • Last Updated 30 August, 2025
  • by David Spuler, Ph.D.

Partitioning is a model inference optimization technique that involves organizing data in memory, especially ordering of vectors and tensors. There can be multiple goals to achieve with in-memory partitioning:

  • Faster memory access. This can be improved via use of contiguous memory or retaining data in memory longer, rather than swapping in and out.
  • Pipelining operations to GPUs. Keeping the GPU busy by handling how the data is organized before being sent to the GPU.
  • Parallelization of operations to multiple GPUs.

Research Papers on Partitioning

GPU partitioning is a type of software acceleration to make hardware acceleration more effective. Partitioning data optimally can optimize the throughput and efficiency when using multiple GPUs.

  • Jaskirat Singh, Bram Adams, Ahmed E. Hassan, 25 Mar 2024, On the Impact of Black-box Deployment Strategies for Edge AI on Latency and Model Performance, https://arxiv.org/abs/2403.17154 (MLOps deployment for quantization, partitioning and early-exit across mobile, edge, and cloud platforms, including running early exit on mobile.)
  • Shuming Shi, Enbo Zhao, Deng Cai, Leyang Cui, Xinting Huang, Huayang Li, 16 Jan 2024, Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models, https://arxiv.org/abs/2401.08294 Source: https://github.com/inferflow/inferflow
  • Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Nov 2021, Data movement is all you need: A case study on optimizing transformers, Proceedings of Machine Learning and Systems, 3, 2021. https://arxiv.org/abs/2007.00072 Code: https://github.com/spcl/substation
  • Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, Chuan Wu, 2 Mar 2024, LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization, https://arxiv.org/abs/2403.01136
  • Urvij Saroliya, Eishi Arima, Dai Liu, Martin Schulz, 14 May 2024, Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach, https://arxiv.org/abs/2405.08754
  • Eishi Arima, Minjoon Kang, Issa Saba, Josef Weidendorfer, Carsten Trinitis, Martin Schulz, 6 May 2024, Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps, https://arxiv.org/abs/2405.03838
  • Issa Saba, Eishi Arima, Dai Liu, Martin Schulz, 6 May 2024, Orchestrated Co-scheduling, Resource Partitioning, and Power Capping on CPU-GPU Heterogeneous Systems via Machine Learning, https://arxiv.org/abs/2405.03831
  • Houssam-Eddine Zahaf, Ignacio Sanudo Olmedo, Jayati Singh, Nicola Capodieci, Sebastien Faucou, 21 May 2021, Contention-Aware GPU Partitioning and Task-to-Partition Allocation for Real-Time Workloads, https://arxiv.org/abs/2105.10312
  • D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Paper with extensive coverage of numerous compiler auto-optimizations of program code.)
  • V. Vanhoucke, A. Senior, and M. Z. Mao, Improving the speed of neural networks on CPUs, In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, volume 1, page 4, 2011, https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.308.2766 (This paper explores some general code optimizations in relation to CPU and GPU execution, including lazy evaluation, loop unrolling, parallel accumulators, and in-memory partitioning of data for hardware acceleration.)
  • Isaac Ong, May 16, 2024, Efficient Distributed LLM Inference with Dynamic Partitioning, Masters Thesis, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-108, http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.pdf
  • https://community.juniper.net/blogs/sharada-yeluri/2024/02/20/llm-inference-hw-sw-optimizations
  • Mingjin Zhang, 2024, High-performance scheduling of deep learning tasks in collaborative edge computing, Ph.D. Thesis, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, https://theses.lib.polyu.edu.hk/bitstream/200/13080/3/7528.pdf (Scheduling of inference and training tasks on edge devices with techniques such as model splitting/partitioning.)
  • Yikun Han, Chunjiang Liu, Pengfei Wang, 18 Oct 2023, A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge, https://arxiv.org/abs/2310.11703
  • Zheng Wang, Shu Xian Teo, Jieer Ouyang, Yongjun Xu, Wei Shi, 26 May 2024, M-RAG: Reinforcing Large Language Model Performance through Retrieval-Augmented Generation with Multiple Partitions, https://arxiv.org/abs/2405.16420
  • Eric Samikwa, 2024, Resource-Aware Distributed Machine Learning for Artificial Intelligence of Things, Ph.D. thesis, Faculty of Science, University of Bern, Switzerland, https://boristheses.unibe.ch/5378/1/24samikwa_e_1_.pdf https://doi.org/10.48549/5378 (Multi-edge device with early exit, "micro-split" scheduling, split/federated learning, and distributed inference.)
  • Y. Song, Y. Meng, B. Chen, S. Chen and Y. Kang, 2024, SALTM: Accelerating Large Transformers in Multi-device System with 2D Model Partitioning Method, Integrated Circuits and Systems, doi: 10.23919/ICS.2024.3458897, https://ieeexplore.ieee.org/abstract/document/10678935 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10678935
  • Dimitrios Kafetzis, Iordanis Koutsopoulos, Oct 2024, Demo: AnExperimental Platform for AI Model Partitioning on Resource-constrained Devices, https://dl.acm.org/doi/pdf/10.1145/3641512.3690629
  • Wenxiang Lin, Xinglin Pan, Shaohuai Shi, Xuan Wang, Xiaowen Chu, 24 Nov 2024, Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems, https://arxiv.org/abs/2411.15715
  • Jiale Liu, Yifan Zeng, Shaokun Zhang, Chi Zhang, Malte Højmark-Bertelsen, Marie Normann Gadeberg, Huazheng Wang, Qingyun Wu, 6 May 2025, Divide, Optimize, Merge: Fine-Grained LLM Agent Optimization at Scale, https://arxiv.org/abs/2505.03973
  • Vikas Natesh, H.T. Kung, 12 Apr 2025, PQS (Prune, Quantize, and Sort): Low-Bitwidth Accumulation of Dot Products in Neural Network Computations, https://arxiv.org/abs/2504.09064 (Split vectors into positive and negatives to avoid overflow in vector dot product accumulators.)
  • Lucas Cardoso, Vitor Santos, Jos\'e Ribeiro Filho, Ricardo Prud\^encio, Regiane Kawasaki and Ronnie Alves, 14 Aug 2025, Beyond Random Sampling: Instance Quality-Based Data Partitioning via Item Response Theory, https://arxiv.org/abs/2508.10628
  • Michael Grosskopf, Kellin Rumsey, Ayan Biswas, Earl Lawrence, 22 Jul 2025, A Partitioned Sparse Variational Gaussian Process for Fast, Distributed Spatial Modeling, https://arxiv.org/abs/2507.16771
  • Lam Ngo, Huong Ha, Jeffrey Chan, Hongyu Zhang, 9 Aug 2025, MOCA-HESP: Meta High-dimensional Bayesian Optimization for Combinatorial and Mixed Spaces via Hyper-ellipsoid Partitioning, https://arxiv.org/abs/2508.06847
  • Qize Jiang, Linsey Pang, Alice Gatti, Mahima Aggarwal, Giovanna Vantini, Xiaosong Ma, Weiwei Sun, Sourav Medya, Sanjay Chawla, 11 Aug 2025, RIDGECUT: Learning Graph Partitioning with Rings and Wedges, https://arxiv.org/abs/2505.13986
  • Ahmed Shokry and Ayman Khalafallah, 27 Jul 2025, Clustering by Attention: Leveraging Prior Fitted Transformers for Data Partitioning, https://arxiv.org/abs/2507.20369
  • Lara Neves, Afonso Louren\c{c}o, Alberto Cano, Goreti Marreiros, 28 Jul 2025, Online hierarchical partitioning of the output space in extreme multi-label data stream, https://arxiv.org/abs/2507.20894
  • Yining Huang,Bin Li,Keke Tang,Meilian Chen, 28 Jul 2025, LoRA-PAR: A Flexible Dual-System LoRA Partitioning Approach to Efficient LLM Fine-Tuning, https://arxiv.org/abs/2507.20999
  • Ashok S. Kumar, Nancy Nayak, Sheetal Kalyani, Himal A. Suraweera, 26 Jul 2025, DRL-AdaPart: DRL-Driven Adaptive STAR-RIS Partitioning for Fair and Frugal Resource Utilization, https://arxiv.org/abs/2407.06868
  • Urban Eriksson, 29 Jul 2025, An Equal-Probability Partition of the Sample Space: A Non-parametric Inference from Finite Samples, https://arxiv.org/abs/2507.21712
  • Christopher Godwin Udomboso, Caston Sigauke and Ini Adinya, 2 Aug 2025, Fusion Sampling Validation in Data Partitioning for Machine Learning, https://arxiv.org/abs/2508.01325
  • Kun Peng, Cong Cao, Hao Peng, Zhifeng Hao, Lei Jiang, Kongjing Gu, Yanbing Liu and Philip S. Yu, 7 Aug 2025, Dialogues Aspect-based Sentiment Quadruple Extraction via Structural Entropy Minimization Partitioning, https://arxiv.org/abs/2508.05023
  • Offa Kingsleigh, Alfred Abercrombie, David Woolstencroft, Beorhtric Meadowcroft, Marcus Irvin, 8 Aug 2025, Architectural Fusion Through Contextual Partitioning in Large Language Models: A Novel Approach to Parameterized Knowledge Integration, https://arxiv.org/abs/2501.12901
  • Sowmini Devi Veeramachaneni, Ramamurthy Garimella, 18 Aug 2025, Constrained Centroid Clustering: A Novel Approach for Compact and Structured Partitioning, https://arxiv.org/abs/2508.12758
  • Michael E. Sander, Vincent Roulet, Tianlin Liu, Mathieu Blondel, 19 Aug 2025, Joint Learning of Energy-based Models and their Partition Function, https://arxiv.org/abs/2501.18528
  • Sami Alabed, Dominik Grewe, Norman Alexander Rink, Timur Sitdikov, Agnieszka Swietlik, Dimitrios Vytiniotis, Daniel Belov, 20 Aug 2025, TOAST: Fast and scalable auto-partitioning based on principled static analysis, https://arxiv.org/abs/2508.15010
  • Aparajithan Venkateswaran and Anirudh Sankar and Arun G. Chandrasekhar and Tyler H. McCormick, 19 Aug 2025, Robustly estimating heterogeneity in factorial data using Rashomon Partitions, https://arxiv.org/abs/2404.02141

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: