Aussie AI

Collaborative Inference

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Collaborative Inference

Speculative Decoding

Speculative execution is the general area of Computer Science theory from which speculative decoding is derived. Various algorithms benefit from speculatively executing in parallel with another pathway. A particular example is “branch prediction” in hardware execution of low-level machine code.

Applying this idea to inference yields “speculative decoding” as an ensemble architecture where a small model generates some possible output tokens (i.e. “speculating” possible outputs from its decoder), and a larger model verifies whether the output of the smaller model is correct. This optimizes inference speed because it is faster for a large model to verify the correctness of suggested output tokens in parallel on an already-generated sequence than for it to fully generate its own new tokens in an autoregressive method. If the small model predicts poorly, then the bigger model vetoes the suggested tokens, and has to “backtrack” making the whole process slower. However, the smaller model should be correct most of the time, and can generate multiple speculative tokens each iteration, causing an overall speedup across all of the tokens, while getting very close to the accuracy of a bigger model.

Speculative decoding is technically a subtype of the “big-little architecture”. Another type of big-little architecture involves using a heuristic to detect “easy” requests that are routed to a small model, or “hard” queries that are routed to the big model. Speculative decoding differs because all queries go first to the small model, and are then checked by the larger model, and sometimes the big model overrides the small model's suggestions and re-generates its own.

Research papers on speculative decoding:

Leviathan, Y., Kalman, M., and Matias, Y., May 2023, Fast inference from transformers via speculative decoding, https://arxiv.org/abs/2211.17192
D Xu, W Yin, X Jin, Y Zhang, S Wei, M Xu, X Liu, Sep 2023, LLMCad: Fast and Scalable On-device Large Language Model Inference, arXiv preprint arXiv:2309.04255, https://arxiv.org/pdf/2309.04255.pdf (Keeps a smaller model in memory, improving speed and reducing memory utilization.)
Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J., Feb 2023, Accelerating large language model decoding with speculative sampling, arXiv preprint arXiv:2302.01318, https://arxiv.org/abs/2302.01318
Sehoon Kim, Karttikeya Mangalam, Suhong Moon, John Canny, Jitendra Malik, Michael W. Mahoney, Amir Gholami, Kurt Keutzer, 2023, Speculative Decoding with Big Little Decoder, Sep 2023 (original Feb 2023), https://arxiv.org/abs/2302.07863 (Separates a “fallback policy” when the smaller model detects it needs the bigger model, and a “rollback policy” when the bigger model vetoes output and intervenes, both for deciding when the bigger model controls.)
Yaniv Leviathan, Matan Kalman, and Yossi Matias. May 2023, Fast inference from transformers via speculative decoding, In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023. https://arxiv.org/abs/2211.17192
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023, Accelerating large language model decoding with speculative sampling, DeepMind, arXiv preprint arXiv:2302.01318, 2023. https://arxiv.org/abs/2302.01318
Heming Xia, Tao Ge, Si-Qing Chen, Furu Wei, and Zhifang Sui. 2022, Speculative decoding: Lossless speedup of autoregressive translation, Openreview, 2022. https://openreview.net/forum?id=H-VlwsYvVi
Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, Daxin Jiang, Linjun Yang, Rangan Majumder, Furu Wei, Apr 2023, Inference with Reference: Lossless Acceleration of Large Language Models, https://arxiv.org/abs/2304.04487 (Not pure speculative decoding, but an analogous method.)
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia, Aug 2023, Specinfer: Accelerating generative llm serving with speculative inference and token tree verification, arXiv preprint arXiv:2305.09781, 2023. https://arxiv.org/abs/2305.09781, Code: https://github.com/flexflow/FlexFlow/tree/inference
Burton, F. W., 1985, Speculative computation, parallelism, and functional programming, IEEE Transactions on Computers, C-34(12):1190–1193, 1985. doi: 10.1109/TC.1985. 6312218. https://ieeexplore.ieee.org/document/6312218 (Algorithmic theory of "speculative computation" from 1985.)
Hennessy, J. L. and Patterson, 2012, D. A., Computer Architecture: A Quantitative Approach, Morgan Kaufmann, Amsterdam, 5 edition, 2012. ISBN 978-0-12-383872-8. https://dl.acm.org/doi/book/10.5555/1999263 (Includes coverage of speculative algorithms.)
T. Ge, H. Xia, X. Sun, S. Chen, and F. Wei. 2022, Lossless acceleration for seq2seq generation with aggressive decoding, ArXiv, abs/2205.10350, 2022. https://arxiv.org/abs/2205.10350, Code: https://github.com/microsoft/unilm/tree/master/decoding (The generalized aggressive decoding method has a “draft-and-verify” algorithm that is similar to speculative decoding.)
M. Stern, N. Shazeer, and J. Uszkoreit. 2018, Blockwise parallel decoding for deep autoregressive models, CoRR, abs/1811.03115, 2018. https://arxiv.org/abs/1811.03115 (Generates various output in parallel and using a scoring method to confirm them.)
Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Hao Peng, Ximing Lu, Dragomir Radev, Yejin Choi, Noah A. Smith, Oct 2022, Twist Decoding: Diverse Generators Guide Each Other, https://arxiv.org/abs/2205.09273, Code: https://github.com/jungokasai/twist_decoding
S Bae, J Ko, H Song, SY Yun, Oct 2023, Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding, arXiv preprint arXiv:2310.05424, https://arxiv.org/pdf/2310.05424.pdf (Combination of early-exit with a “shallow-deep module” and parallel decoding.)
Kaya Y., Hong S., Dumitras T., 2019, Shallow-deep networks: Understanding and mitigating network overthinking, Proceedings of the international conference on machine learning, ICML (2019), pp. 3301-3310, https://arxiv.org/abs/1810.07052 (Shallow-deep method in a single model is analogous to speculative decoding.)
Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, Rishabh Agarwal, Oct 2023, DistillSpec: Improving Speculative Decoding via Knowledge Distillation, https://arxiv.org/abs/2310.08461

For research papers on speculative decoding multi-model architectures, see https://www.aussieai.com/research/speculative-decoding.

Collaborative inference is a type of multi-model inference where two or more engines combine to perform inference calculations. There are two different types of architectures here: smarter or faster.

One of the goals of multi-model inference is obviously to have a smarter AI engine overall by combining the calculations of two models. Some examples with this goal include:

Consensus-based decoding
Mutually-guided decoding
Committee-based inference (“wisdom of committees”)

Surprisingly, some of these multi-model inference algorithms are actually speedup optimizations, where faster inference is possible by having two engines working together. The reduced latency can be achieved through parallel calculations and using a small model in the mix. Particular types of parallel collaborative inference include:

Speculative Decoding
Big-Little Architectures

Research papers on collaborative inference:

G Xu, Z Hao, Y Luo, H Hu, J An, S Mao, 2023, DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices, arXiv preprint arXiv:2309.05015, https://arxiv.org/abs/2309.05015
Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Hao Peng, Ximing Lu, Dragomir Radev, Yejin Choi, Noah A. Smith, Oct 2022, Twist Decoding: Diverse Generators Guide Each Other, https://arxiv.org/abs/2205.09273, Code: https://github.com/jungokasai/twist_decoding (Twist decoding is a type of collaborative inference.)
J Kasai, 2023, Towards Efficient, Customizable, and Communal Natural Language Processing, Ph.D. thesis, Computer Science and Engineering, University of Washington, https://www.proquest.com/openview/604084b574dcd05e41eb6e33682a3537/1 (Impressive thesis includes twist decoding amid other topics.)
Jinduo Song, Zhicheng Liu, Xiaofei Wang, Chao Qiu, Xu Chen, 2021, Adaptive and Collaborative Edge Inference in Task Stream with Latency Constraint, ICC 2021, IEEE International Conference on Communications, pp.1-6, https://ieeexplore.ieee.org/document/9500892
C Luo, J Chen, X Feng, J Zhang, J Li, 2023, Sustainable Collaborative Inference in Intelligent Transportation Systems, IEEE Transactions on Intelligent Transportation, https://ieeexplore.ieee.org/document/10239242
Yiping Kang, Johann Hauswald, Cao Gao, Austin Rovinski, Trevor Mudge, Jason Mars, Lingjia Tang, 2017, Neurosurgeon: Collaborative intelligence between the cloud and mobile edge, ACM SIGARCH Comput. Archit. News, vol. 52, no. 4, pp. 615–629, https://dl.acm.org/doi/10.1145/3037697.3037698
Z. Hao, G. Xu, Y. Luo, H. Hu, J. An, and S. Mao, June 2022, Multi-agent collaborative inference via dnn decoupling: Intermediate feature compression and edge learning, IEEE Trans. Mob. Comput., 2022, https://arxiv.org/abs/2205.11854
J. Kim, Y. Park, G. Kim, and S. J. Hwang, 2017, Splitnet: Learning to semantically split deep networks for parameter reduction and model parallelization, in Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70. PMLR, 2017, pp. 1866–1874. http://proceedings.mlr.press/v70/kim17b/kim17b.pdf
Y. Kim, J. Kim, D. Chae, D. Kim, and J. Kim, 2019, µlayer: Low latency on-device inference using cooperative single-layer acceleration and processor-friendly quantization, in Proceedings of the Fourteenth EuroSys Conference 2019, Dresden, Germany, March 25-28, 2019, G. Candea, R. van Renesse, and C. Fetzer, Eds. ACM, 2019, pp. 45:1–45:15. https://dl.acm.org/doi/10.1145/3302424.3303950
T. Mohammed, C. Joe-Wong, R. Babbar, and M. D. Francesco, 2020, Distributed inference acceleration with adaptive DNN partitioning and offloading, in 39th IEEE Conference on Computer Communications, INFOCOM 2020, Toronto, ON, Canada, July 6-9, 2020. IEEE, 2020, pp. 854–863, https://ieeexplore.ieee.org/document/9155237
S. Yang, Z. Zhang, C. Zhao, X. Song, S. Guo, and H. Li, 2022, CNNPC: end-edge-cloud collaborative CNN inference with joint model partition and compression, IEEE Trans. Parallel Distributed Syst., vol. 33, no. 10, pp. 4039–4056, 2022. https://ieeexplore.ieee.org/document/9782528
X Xu, K Yan, S Han, B Wang, X Tao, P Zhang, 2023, Learning-Based Edge-Device Collaborative DNN Inference in IoVT Networks, IEEE Internet of Things Journal, https://ieeexplore.ieee.org/abstract/document/10258387

For research papers on collaborative inference multi-model architectures, see https://www.aussieai.com/research/collaborative.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Collaborative Inference

Collaborative Inference

Speculative Decoding

Quick Links

Product

New to Writing?

Writing Styles