Aussie AI

Easy-vs-Hard Queries

Last Updated 1 April, 2025

by David Spuler, Ph.D.

What is Easy-vs-Hard Query Optimization? This optimization is based on the observation in LLM theory that some queries are "easy" to compute, whereas others are "hard" to predict. This has the obvious optimization idea of sending the easy queries to a small model, and only doing full computation of a large model on the "hard" queries.

This idea is a type of "adaptive inference" where the model does different computations according to the inputs. Some of the ways to do this include:

Mixture-of-experts
Cascades and other dynamic routing methods
Big-little architectures
Early exiting
Speculative decoding

Research on Easy-Hard Architectures. Various papers have examined the easy-versus-hard query distinction and related optimizations:

Qingyuan Wang, Barry Cardiff, Antoine Frappé, Benoit Larras, Deepu John, 26 Mar 2024, Tiny Models are the Computational Saver for Large Models, https://arxiv.org/abs/2403.17726v1 (Choose tiny or small models after an initial layer of the larger model, combining early exit with easy-hard queries for multi-model inference.)
Jinmin He, Kai Li, Yifan Zang, Haobo Fu, Qiang Fu, Junliang Xing, Jian Cheng, 25 Jan 2024, Not All Tasks Are Equally Difficult: Multi-Task Deep Reinforcement Learning with Dynamic Depth Routing, https://arxiv.org/abs/2312.14472 (Dynamic routing based on easy vs hard queries to optimize training.)
M Salehi, S Mehta, A Kusupati, A Farhadi, H Hajishirzi, 2023, Sharcs: Efficient transformers through routing with dynamic width sub-networks https://arxiv.org/pdf/2310.12126.pdf (Direct queries to subnetworks with different widths.)
Peter Hase, Mohit Bansal, Peter Clark, Sarah Wiegreffe, 12 Jan 2024, The Unreasonable Effectiveness of Easy Training Data for Hard Tasks, https://arxiv.org/abs/2401.06751
Avi Schwarzschild. Easy-To-Hard, October 2021. https://github.com/aks2203/easy-to-hard
Y Wang, K Chen, H Tan, K Guo, 2023, Tabi: An Efficient Multi-Level Inference System for Large Language Models, EuroSys '23: Proceedings of the Eighteenth European Conference on Computer Systems, Rome, Italy May 2023, Pages 233–248, https://doi.org/10.1145/3552326.3587438, https://dl.acm.org/doi/10.1145/3552326.3587438, PDF: https://cse.hkust.edu.hk/~kaichen/papers/tabi-eurosys23.pdf (Dynamic routing to small or large LLMs based on the query.)
David Spuler, March 2024, Chapter 50. Adaptive Inference, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Tianle Li, Wei-Lin Chiang, Lisa Dunlap, May 20, 2024, Introducing Hard Prompts Category in Chatbot Arena, https://lmsys.org/blog/2024-05-17-category-hard/
Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei Zaharia, James Zou, 4 Jun 2024 (v2), Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems, https://arxiv.org/abs/2403.02419
Hao Zhou, Chengming Hu, Ye Yuan, Yufei Cui, Yili Jin, Can Chen, Haolun Wu, Dun Yuan, Li Jiang, Di Wu, Xue Liu, Charlie Zhang, Xianbin Wang, Jiangchuan Liu, 17 May 2024, Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities, https://arxiv.org/abs/2405.10825
Aakriti Agrawal, Mucong Ding, Zora Che, Chenghao Deng, Anirudh Satheesh, John Langford, Furong Huang, 6 Oct 2024, EnsemW2S: Can an Ensemble of LLMs be Leveraged to Obtain a Stronger LLM? https://arxiv.org/abs/2410.04571
Kumari Nishu, Sachin Mehta, Samira Abnar, Mehrdad Farajtabar, Maxwell Horton, Mahyar Najibi, Moin Nabi, Minsik Cho, Devang Naik, 17 Feb 2025, From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs, https://arxiv.org/abs/2502.12325
Xiaoye Qu, Yafu Li, Zhaochen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, Peng Li, Wei Wei, Jing Shao, Chaochao Lu, Yue Zhang, Xian-Sheng Hua, Bowen Zhou, Yu Cheng, 27 Mar 2025, A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond, https://arxiv.org/abs/2503.21614