Aussie AI

Shallow Decoder Architecture

  • Last Updated 26 September, 2024
  • by David Spuler, Ph.D.

The full name of this technique is "deep encoder, shallow decoder" for an encoder-decoder Transformer. The encoder should be "deep" with many layers, but the decoder can be "shallow" with few layers.

Various research has discovered that it's fine for an AI engine to be shallow, but mostly in its decoder. For the encoder, it is more important that it runs all its layers. What this might suggest, is that it's hard to read, and easy to write, if you're an AI engine.

The shallow decoder idea is not really a hot research area anymore. It's closely related to "early exit" but requires an encoder-decoder architecture. Most modern LLM architectures are decoder-only architectures, so there simply isn't an encoder. Hence, not many papers lately.

Deep Prefill in Decoder-Only Transformers

Although the original vanilla Transformer was encoder-decoder, most modern architectures are decoder-only (e.g. the GPT series, starting with GPT-2). It's a little tricky to do a "deep encoder" when there isn't an encoder.

However, decoder-only models have a "prefill" or "prompt processing" initialization phase that is very similar to an encoder phase. Hence, the generalization of this method is "deep prefill, shallow decoder" method. It is trivial for the code to know whether it's in "prefill" versus "decoding" mode, and we could implement different exiting policies (or not exit early at all during prefill).

However, although there are plenty of papers on "early exit" of decoder-only models, I haven't seen many papers that talk about prefill versus decoding phases, in the context of early exit (i.e., no papers on "prefill-only early exit").

However, there is starting to be research on prefill in general: see research on Prefill Optimizations. And one way to avoid prefill is to cache the KV values, which has then spawned some research on making the KV cache smaller called "KV cache compression," which has some areas that are depthwise optimizations (e.g., KV layer pruning, KV layer fusion). It's not quite the same, but it's in the neighborhood.

Shallow Decoder Transformer Research

Various research into "layer pruning" and "early exit" architectures has discovered that the Transformer's encoder layers are far more important than layers in the decoder. This suggested the concept of a "deep encoder, shallow decoder" architecture, where the encoder retains many layers, but the decoder has fewer, or even only a single layer. The "shallow decoder" terminology seems to have been introduced by Kasai et al. (2020), but is based on earlier research examining layer dependency in Transformers.

The shallow decoder architecture is a Transformer-specific type of layer pruning, which can be implemented as either static layer pruning (removing some layers permanently from the model) or dynamic layer pruning (skipping layers adaptively during inference execution).

Note that this research is also related to the papers that show that pruning attention heads in the decoder still leads to a useable transformer (see "attention head pruning"). Some papers have even suggested that removing the Feed Forward Network (FFN) from the decoder was possible (see "FFN pruning").

Research papers on shallow decoder architectures include:

  • Jungo Kasai, Nikolaos Pappas, Hao Peng, James Cross, and Noah A. Smith. 2020. Deep encoder, shallow decoder: Reevaluating the speed-quality tradeoff in machine translation. CoRR, abs/2006.10369. https://arxiv.org/abs/2006.10369 Code: https://github.com/jungokasai/deep-shallow
  • Bag of Tricks for Optimizing Transformer Efficiency, Ye Lin, Yanyang Li, Tong Xiao, Jingbo Zhu, Findings of the Association for Computational Linguistics: EMNLP 2021, November 2021, https://aclanthology.org/2021.findings-emnlp.357/
  • Wenxuan Wang and Zhaopeng Tu. 2020. Rethinking the value of transformer components. In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pages 6019– 6029. International Committee on Computational Linguistics. https://arxiv.org/abs/2011.03803v1 (This paper primarily does measurement of the importance of Transformer components.)
  • Wangchunshu Zhou, Ronan Le Bras, Yejin Choi, Modular Transformers: Compressing Transformers into Modularized Layers for Flexible Efficient Inference June 2023, https://arxiv.org/abs/2306.02379 (An interesting paper that considers using two or more layers as "modules" that can be weaved into a new model somehow, which somewhat generalizes layer pruning or shallow decoder architectures.)
  • Cristóbal Eyzaguirre, Felipe del Río, Vladimir Araujo, Álvaro Soto, DACT-BERT: Differentiable Adaptive Computation Time for an Efficient BERT Inference, Sep 2021, ArXiv preprint, abs/2109.11745, https://arxiv.org/abs/2109.11745
  • Antonio Valerio Miceli Barone, Jindrich Helcl, Rico Sennrich, Barry Haddow, and Alexandra Birch. Deep architectures for neural machine translation. In Proc. of WMT, 2017. https://arxiv.org/abs/1707.07631 (Different stacked architectures in RNNs.)
  • Young Jin Kim, Marcin Junczys-Dowmunt, Hany Hassan, Alham Fikri Aji, Kenneth Heafield, Roman Grundkiewicz, and Nikolay Bogoychev. From research to production and back: Ludicrously fast neural machine translation. In Proc. of WNGT, 2019. https://www.aclweb.org/anthology/D19-5632/, Code: https://github.com/marian-nmt/marian
  • Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz. Universal transformers. In Proc. of ICLR, 2019. https://arxiv.org/abs/1807.03819
  • Raj Dabre and Atsushi Fujita. Recurrent stacking of layers for compact neural machine translation models. In Proc. of AAAI, 2019. https://arxiv.org/abs/1807.05353 (Examines stacking layers of Transformers, including increasing the layers with parameter sharing.)
  • Shazeer, N. M. Fast transformer decoding: One write-head is all you need. ArXiv, abs/1911.02150, 2019, https://arxiv.org/abs/1911.02150
  • Sun, X., Ge, T., Wei, F., and Wang, H., Instantaneous grammatical error correction with shallow aggressive decoding. ArXiv, abs/2106.04970, 2021, https://arxiv.org/abs/2106.04970
  • Bapna, A., Arivazhagan, N., and Firat, O., Controlling computation versus quality for neural sequence models. ArXiv, abs/2002.07106, Apr 2020, https://arxiv.org/abs/2002.07106
  • Xiang Kong, Adithya Renduchintala, James Cross, Yuqing Tang, Jiatao Gu, Xian Li, 2022, Multilingual Neural Machine Translation with Deep Encoder and Multiple Shallow Decoders, https://arxiv.org/abs/2206.02079
  • Yilin Yang, Longyue Wang, Shuming Shi, Prasad Tadepalli, Stefan Lee, and Zhaopeng Tu. 2020. On the Sub-layer Functionalities of Transformer Decoder. In Findings of EMNLP. Online, 4799–4811. https://doi.org/10.18653/v1/2020.findings-emnlp.432, https://arxiv.org/abs/2010.02648 (Investigates the depth of decoders; also concludes that the FFN can be removed from the decoder.)
  • Telmo Pessoa Pires, António V. Lopes, Yannick Assogba, Hendra Setiawan, 2023, One Wide Feedforward is All You Need, arXiv preprint arXiv:2309.01826, https://arxiv.org/abs/2309.01826 (Removes the decoder FFNs entirely and shares a single encoder FFN across multiple encoder layers, and also increases the single FFN's size.)
  • Tao Ge, Si-Qing Chen, and Furu Wei. 2022. EdgeFormer: A parameter-efficient transformer for on-device seq2seq generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10786–10798, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics, https://arxiv.org/abs/2202.07959 (Includes "shared layers" with shared decoder FFN weights.)
  • Q Wei, G Zeng, B Zeng, 2023, Efficient Training for Visual Tracking with Deformable Transformer, arXiv preprint arXiv:2309.02676, https://arxiv.org/pdf/2309.02676.pdf (Optimization and also investigated effects of number of decoder layers.)
  • Xin Sun, Tao Ge, Furu Wei, and Houfeng Wang. Instantaneous grammatical error correction with shallow aggressive decoding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5937–5947, 2021. https://arxiv.org/abs/2106.04970, Code: https://github.com/AutoTemp/Shallow-Aggressive-Decoding (Aggressive decoding emits as many tokens as possible, combined with a shallow decoder architecture.)
  • J Kasai, 2023, Towards Efficient, Customizable, and Communal Natural Language Processing, Ph.D. thesis, Computer Science and Engineering, University of Washington, https://www.proquest.com/openview/604084b574dcd05e41eb6e33682a3537/1 (Shallow decoding is only part of this wide-ranging and impressive Ph.D. thesis, by one of the early proponents of shallow decoding architectures.)
  • S Bae, J Ko, H Song, SY Yun, Oct 2023, Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding, arXiv preprint arXiv:2310.05424, https://arxiv.org/pdf/2310.05424.pdf (Combination of early-exit with a "shallow-deep module" and parallel decoding.)
  • Kaya Y., Hong S., Dumitras T., Shallow-deep networks: Understanding and mitigating network overthinking Proceedings of the international conference on machine learning, ICML (2019), pp. 3301-3310, https://arxiv.org/abs/1810.07052 (Shallow-deep method in a single model.)
  • H Xia, T Ge, P Wang, SQ Chen, F Wei, 2023, Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation, https://arxiv.org/abs/2203.16487 https://aclanthology.org/2023.findings-emnlp.257.pdf Code: https://github.com/hemingkx/SpecDec (Uses a specially optimized deep-encoder shallow-decoder architecture as the drafting model.)
  • Andrea Santilli, Silvio Severino, Emilian Postolache, Valentino Maiorca, Michele Mancusi, Riccardo Marin, Emanuele Rodolà, May 2023, Accelerating Transformer Inference for Translation via Parallel Decoding, https://arxiv.org/abs/2305.10427
  • Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, and Donald Metzler. 2021 (updated Jan 2022). Scale efficiently: Insights from pre-training and fine-tuning transformers. ArXiv, abs/2109.10686, https://arxiv.org/abs/2109.10686
  • Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, and Lidia S. Chao. 2019, Learning deep transformer models for machine translation. In Proc. of ACL, 2019. https://arxiv.org/abs/1906.01787
  • Xuezhe Ma, Chunting Zhou, Xian Li, Graham Neubig, and Eduard H. Hovy. FlowSeq: Non-autoregressive conditional sequence generation with generative flow. In Proc. of EMNLP, 2019. https://arxiv.org/abs/1909.02480.
  • Raphael Shu, Jason Lee, Hideki Nakayama, and Kyunghyun Cho. 2020, Latent-variable non-autoregressive neural machine translation with deterministic inference using a delta posterior. In Proc. of AAAI, 2020. https://arxiv.org/abs/1908.07181
  • Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, Weipeng Chen, 7 Mar 2024 (v2), ShortGPT: Layers in Large Language Models are More Redundant Than You Expect, https://arxiv.org/abs/2403.03853
  • Wang, Z., Han, J. (2024). Improve Shallow Decoder Based Transformer with Structured Expert Prediction. In: Wand, M., Malinovská, K., Schmidhuber, J., Tetko, I.V. (eds) Artificial Neural Networks and Machine Learning – ICANN 2024. ICANN 2024. Lecture Notes in Computer Science, vol 15022. Springer, Cham. https://doi.org/10.1007/978-3-031-72350-6_15 https://link.springer.com/chapter/10.1007/978-3-031-72350-6_15

Shallow Decoder and KV Caching

One of the downsides of shallow decoder, like all types of early exit, is that it can de-optimize KV caching. When a layer is exited early, the KV cache is not computed, and will be out-of-date. It's a full-blown AI engine coding bug if nothing is done to re-calculate the KV cache.

The idea of caching the results of the KV computations is one of the earliest recognized optimizations for autoregressive decoding of output sequences. However, computation of the cache requires execution of all layers, but early exiting in shallow decoder skips this. In the next inference computation, the cache is out-of-date for the unexecuted layers.

Hence, early exit saves computation, but damages the KV cache, leading to extra computation to fix this. Researchers have examined this issue and found solutions involving either simple cache recomputations, propagating the last executed layer's KV cache to the other skipped layers, and avoiding the issue entirely by modifying early exit methods. More more details about this research see: KV caching with early exit.

More Research on Pruning Types

More AI Research

Read more about: