Aussie AI
Collaborative Inference
-
Last Updated 27 November, 2024
-
by David Spuler, Ph.D.
Collaborative inference is a type of multi-model ensemble AI optimization strategy where two or more engines combine to perform inference calculations. There are two basic architectures:
- Multi-component partial inference
- Multi-component full inference
In multi-component partial inference, multiple sub-components contribute to a single inference computation. For example, parts of the inference computation can be spread out across multiple machines or multiple GPUs, and then combined together to complete the inference result. The output is a single prediction for decoding.
The alternative is multi-component full inference, where multiple components (or entire models) perform a full inference, with results combined at the end. All of the inference computations occur independently. Each model or component generates its own separate prediction of output tokens and their probabilities. Then a decision mechanism analyzes the outputs of each model, and decides on which final token to output.
There are several variations on either of these two approaches. Particular types of collaborative inference include:
- Speculative Decoding
- Consensus-based decoding
- Mutually-guided decoding
- Big-Little Architectures
- Committee-based inference
- Ensemble Decoding
- Swarm inference (swarm decoding)
Research on Collaborative Inference (Generally)
Research papers on collaborative inference include:
- G Xu, Z Hao, Y Luo, H Hu, J An, S Mao, 2023, DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices, arXiv preprint arXiv:2309.05015, https://arxiv.org/abs/2309.05015
- Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Hao Peng, Ximing Lu, Dragomir Radev, Yejin Choi, Noah A. Smith, Oct 2022, Twist Decoding: Diverse Generators Guide Each Other, https://arxiv.org/abs/2205.09273, Code: https://github.com/jungokasai/twist_decoding (Twist decoding is a type of collaborative inference.)
- J Kasai, 2023, Towards Efficient, Customizable, and Communal Natural Language Processing, Ph.D. thesis, Computer Science and Engineering, University of Washington, https://www.proquest.com/openview/604084b574dcd05e41eb6e33682a3537/1 (Impressive thesis includes twist decoding amid other topics.)
- Jinduo Song, Zhicheng Liu, Xiaofei Wang, Chao Qiu, Xu Chen, 2021, "Adaptive and Collaborative Edge Inference in Task Stream with Latency Constraint", ICC 2021, IEEE International Conference on Communications, pp.1-6, https://ieeexplore.ieee.org/document/9500892
- C Luo, J Chen, X Feng, J Zhang, J Li, 2023, Sustainable Collaborative Inference in Intelligent Transportation Systems IEEE Transactions on Intelligent Transportation, https://ieeexplore.ieee.org/document/10239242
- Yiping Kang, Johann Hauswald, Cao Gao, Austin Rovinski, Trevor Mudge, Jason Mars, Lingjia Tang, 2017, “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,” ACM SIGARCH Comput. Archit. News, vol. 52, no. 4, pp. 615–629, https://dl.acm.org/doi/10.1145/3037697.3037698
- Z. Hao, G. Xu, Y. Luo, H. Hu, J. An, and S. Mao, June 2022, “Multi-agent collaborative inference via dnn decoupling: Intermediate feature compression and edge learning,” IEEE Trans. Mob. Comput., 2022, https://arxiv.org/abs/2205.11854
- J. Kim, Y. Park, G. Kim, and S. J. Hwang, “Splitnet: Learning to semantically split deep networks for parameter reduction and model parallelization,” in Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70. PMLR, 2017, pp. 1866–1874. http://proceedings.mlr.press/v70/kim17b/kim17b.pdf
- Y. Kim, J. Kim, D. Chae, D. Kim, and J. Kim, “ µlayer: Low latency on-device inference using cooperative single-layer acceleration and processor-friendly quantization,” in Proceedings of the Fourteenth EuroSys Conference 2019, Dresden, Germany, March 25-28, 2019, G. Candea, R. van Renesse, and C. Fetzer, Eds. ACM, 2019, pp. 45:1–45:15. https://dl.acm.org/doi/10.1145/3302424.3303950
- T. Mohammed, C. Joe-Wong, R. Babbar, and M. D. Francesco, “Distributed inference acceleration with adaptive DNN partitioning and offloading,” in 39th IEEE Conference on Computer Communications, INFOCOM 2020, Toronto, ON, Canada, July 6-9, 2020. IEEE, 2020, pp. 854–863, https://ieeexplore.ieee.org/document/9155237
- S. Yang, Z. Zhang, C. Zhao, X. Song, S. Guo, and H. Li, “CNNPC: end-edge-cloud collaborative CNN inference with joint model partition and compression,” IEEE Trans. Parallel Distributed Syst., vol. 33, no. 10, pp. 4039–4056, 2022. https://ieeexplore.ieee.org/document/9782528
- X Xu, K Yan, S Han, B Wang, X Tao, P Zhang, 2023, Learning-Based Edge-Device Collaborative DNN Inference in IoVT Networks IEEE Internet of Things Journal, https://ieeexplore.ieee.org/abstract/document/10258387
- Dec 2023, Adapting Large Language Models by Integrating Collaborative Semantics for Recommendation, Bowen Zheng, Yupeng Hou, Hongyu Lu, Yu Chen, Wayne Xin Zhao, Ming Chen, Ji-Rong Wen, https://arxiv.org/abs/2311.09049 Code: https://github.com/RUCAIBox/LC-Rec/
- Mikolaj Jankowski, Deniz Gunduz, Krystian Mikolajczyk, Nov 2023, Adaptive Early Exiting for Collaborative Inference over Noisy Wireless Channels, https://arxiv.org/abs/2311.18098 (Early exiting combined with collaborative inference.)
- Junho Wohn, February 2024, Optimizing Deep Learning Model Inference using Efficient Model Partitioning on Edge Devices, Thesis for the Master of Science, Graduate School of Hanyang University, https://repository.hanyang.ac.kr/handle/20.500.11754/188388, PDF: https://hanyang.dcollection.net/public_resource/pdf/200000726139_20240331200233.pdf (Compiles models using the TVM deep learning compiler and then partitions them across multiple edge devices for collaborative edge inference.)
- Nir Shlezinger; Erez Farhan; Hai Morgenstern; Yonina C. Eldar, 2021, Collaborative Inference via Ensembles on the Edge, ICASSP 2021, 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://ieeexplore.ieee.org/abstract/document/9414740
- Nir Shlezinger; Ivan V. Bajić, 2022, Collaborative Inference for AI-Empowered IoT Devices, IEEE Internet of Things Magazine (Volume: 5, Issue: 4, December 2022), https://ieeexplore.ieee.org/abstract/document/10012474
- Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
- Emre Kilcioglu, March 2024, Collaborative On-device CNN Inference: Design and Optimization of Communication and Computation, Ph.D. thesis, Engineering Sciences and Technology, UCLouvain, PDF: https://dial.uclouvain.be/pr/boreal/object/boreal%3A286224/datastream/PDF_01/view
- David Spuler, March 2024, Chapter 54. Ensemble Multi-Model Architectures, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Zixu Hao, Huiqiang Jiang, Shiqi Jiang, Ju Ren, Ting Cao, June 2024, Hybrid SLM and LLM for Edge-Cloud Collaborative Inference, EdgeFM ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan, https://dl.acm.org/doi/pdf/10.1145/3662006.3662067 (Small model on edge devices with large model in the cloud, performing collaborative inference.)
- Kaiyan Zhang, Jianyu Wang, Ning Ding, Biqing Qi, Ermo Hua, Xingtai Lv, Bowen Zhou, 18 Jun 2024, Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding, https://arxiv.org/abs/2406.12295 Code: https://github.com/TsinghuaC3I/FS-GEN
- Zexuan Qiu, Zijing Ou, Bin Wu, Jingjing Li, Aiwei Liu, Irwin King, 25 Jun 2024, Entropy-Based Decoding for Retrieval-Augmented Large Language Models, https://arxiv.org/abs/2406.17519 (Enhanced decoding algorithm for multi-document RAG processing.)
- Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
- Mingjin Zhang, 2024, High-performance scheduling of deep learning tasks in collaborative edge computing, Ph.D. Thesis, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, https://theses.lib.polyu.edu.hk/bitstream/200/13080/3/7528.pdf (Scheduling of inference and training tasks on edge devices with techniques such as model splitting/partitioning.)
- Eric Samikwa, 2024, Resource-Aware Distributed Machine Learning for Artificial Intelligence of Things, Ph.D. thesis, Faculty of Science, University of Bern, Switzerland, https://boristheses.unibe.ch/5378/1/24samikwa_e_1_.pdf https://doi.org/10.48549/5378 (Multi-edge device with early exit, "micro-split" scheduling, split/federated learning, and distributed inference.)
- Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei Zaharia, James Zou, 4 Jun 2024 (v2), Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems, https://arxiv.org/abs/2403.02419
- J. Niu, W. Zhang, C. J. Xue and N. Guan, 2024, "RTiL: Real-Time Inference of Large Language Models on Memory-Constrained GPU Devices," 2024 IEEE 30th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), Sokcho, Korea, Republic of, 2024, pp. 21-30, doi: 10.1109/RTCSA62462.2024.00013. https://ieeexplore.ieee.org/abstract/document/10695719
- Akrit Mudvari, Yuang Jiang, Leandros Tassiulas, 16 Oct 2024 (v2), SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization, https://arxiv.org/abs/2410.10759
- Do Xuan Long, Duong Ngoc Yen, Anh Tuan Luu, Kenji Kawaguchi, Min-Yen Kan, Nancy F. Chen, 1 Nov 2024, Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models, https://arxiv.org/abs/2411.00492
- Nan Xue, Yaping Sun, Zhiyong Chen, Meixia Tao, Xiaodong Xu, Liang Qian, Shuguang Cui, Wenjun Zhang, Ping Zhang, 11 Nov 2024, WDMoE: Wireless Distributed Mixture of Experts for Large Language Models, https://arxiv.org/abs/2411.06681
- Yingxuan Yang, Qiuying Peng, Jun Wang, Weinan Zhang, 21 Nov 2024, Multi-LLM-Agent Systems: Techniques and Business Perspectives, https://arxiv.org/abs/2411.14033
Consensus Decoding
Consensus decoding is a type of collaborative inference where multiple models must form a "consensus" for the predicted output token. The idea is that two or more models perform inference independently, each predicting token probabilities, and then their results are combined to output a "best" token. Note that this differs from approaches such as speculative decoding (or other more generalized types of collaborative inference), where the two models affect each other's inference in progress.
Research papers on consensus decoding include:
- Bowen Zheng, Yupeng Hou, Hongyu Lu, Yu Chen, Wayne Xin Zhao, Ming Chen, Ji-Rong Wen, Dec 2023, Adapting Large Language Models by Integrating Collaborative Semantics for Recommendation, https://arxiv.org/abs/2311.09049 Code: https://github.com/RUCAIBox/LC-Rec/
- Mikolaj Jankowski, Deniz Gunduz, Krystian Mikolajczyk, Nov 2023, Adaptive Early Exiting for Collaborative Inference over Noisy Wireless Channels, https://arxiv.org/abs/2311.18098 (Early exiting combined with collaborative inference.)
- Adam Pauls, John DeNero and Dan Klein, 2009, Consensus Training for Consensus Decoding in Machine Translation, Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 1418–1427, https://aclanthology.org/D09-1147.pdf
- Nir Shlezinger; Erez Farhan; Hai Morgenstern; Yonina C. Eldar, 2021, Collaborative Inference via Ensembles on the Edge, ICASSP 2021, 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://ieeexplore.ieee.org/abstract/document/9414740
- Nir Shlezinger; Ivan V. Bajić, 2022, Collaborative Inference for AI-Empowered IoT Devices, IEEE Internet of Things Magazine (Volume: 5, Issue: 4, December 2022), https://ieeexplore.ieee.org/abstract/document/10012474
- Caelin Kaplan, Tareq Si Salem, Angelo Rodio, Chuan Xu, Giovanni Neglia, 7 May 2024, Federated Learning for Cooperative Inference Systems: The Case of Early Exit Networks, https://arxiv.org/abs/2405.04249
- David Spuler, March 2024, Chapter 54. Ensemble Multi-Model Architectures, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
More Research on Decoding Algorithms
- Decoding algorithms (overview)
— Non-autoregressive decoding
— Greedy decoding
— Top-k decoding
— Top-p decoding
— Min-P Sampling
— Flash decoding
— Beam search decoding
— Edit decoding
— Contrastive decoding
— Constrained decoding - Parallel decoding (overview)
— Blockwise parallel decoding
— n-gram parallel decoding
— Lookahead decoding
— Medusa decoding
— Consensus decoding - Speculative decoding (overview)
— Generalized speculative decoding
— Aggressive decoding
— Lookup decoding
— Retrieval lookup decoding
— Prompt lookup decoding
— Self speculative decoding
— Tree speculative decoding
— Superposed decoding
— Hierarchical speculative decoding
— Heuristic speculative decoding
— Multi-token speculative decoding
— Sequential speculative decoding
More AI Research
Read more about:
- Ensemble Model Architectures
- Speculative Decoding
- Inference Optimizations
- Loop Optimizations
- Code Optimizations
- « Research Home