Aussie AI

Shallow Decoder Transformer Architecture

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Shallow Decoder Transformer Architecture

Various research has discovered that it's fine for an AI engine to be shallow, but mostly in its decoder. For the encoder, it is more important that it runs all its layers. What this might suggest, is that it's hard to read, and easy to write, if you're an AI engine.

The discovery of the relative importance of layers has been related to research into “layer pruning” and “early exit” architectures. Finding that a Transformer's encoder layers are far more important than layers in the decoder suggested the concept of a “deep encoder, shallow decoder” architecture, where the encoder retains many layers, but the decoder has fewer, or even only a single layer. The “shallow decoder” terminology seems to have been introduced by Kasai et al. (2020), but is based on earlier research examining layer dependency in Transformers.

The shallow decoder architecture is a Transformer-specific type of layer pruning, which can be implemented as either static layer pruning (removing some layers permanently from the model) or dynamic layer pruning (skipping layers adaptively during inference execution).

An interesting question about early exiting relates to decoder-only architectures. Although the early 2017 Transformers were encoder-decoder, many modern Transformers such as the GPT family are decoder-only. Can the deep encoder, shallow decoder architecture be emulated in decoder-only architectures by doing dynamic early exit to different levels in the prefill phase versus the later decoding phases? I'm not sure if I've seen a research paper on that.

Note that this shallow decoder research is also related to the papers that show that pruning attention heads in the decoder still leads to a useable transformer (see “attention head pruning” in Chapter 48). Some papers have even suggested that removing the Feed Forward Network (FFN) from the decoder was possible (see “FFN pruning” in Chapter 34). Again, there's a question here as to whether pruning these components can generalize to decoder-only architectures.

Research papers on shallow-decoder architectures:

  1. Jungo Kasai, Nikolaos Pappas, Hao Peng, James Cross, and Noah A. Smith. 2020. Deep encoder, shallow decoder: Reevaluating the speed-quality tradeoff in machine translation, CoRR, abs/2006.10369. https://arxiv.org/abs/2006.10369 Code: https://github.com/jungokasai/deep-shallow
  2. Ye Lin, Yanyang Li, Tong Xiao, Jingbo Zhu, 2021, Bag of Tricks for Optimizing Transformer Efficiency, Findings of the Association for Computational Linguistics: EMNLP 2021, November 2021, https://aclanthology.org/2021.findings-emnlp.357/
  3. Wenxuan Wang and Zhaopeng Tu. 2020. Rethinking the value of transformer components, In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pages 6019– 6029. International Committee on Computational Linguistics. https://arxiv.org/abs/2011.03803v1 (This paper primarily does measurement of the importance of Transformer components.)
  4. Wangchunshu Zhou, Ronan Le Bras, Yejin Choi, June 2023, Modular Transformers: Compressing Transformers into Modularized Layers for Flexible Efficient Inference, https://arxiv.org/abs/2306.02379 (An interesting paper that considers using two or more layers as “modules” that can be weaved into a new model somehow, which somewhat generalizes layer pruning or shallow decoder architectures.)
  5. Cristóbal Eyzaguirre, Felipe del Río, Vladimir Araujo, Álvaro Soto, 2021, DACT-BERT: Differentiable Adaptive Computation Time for an Efficient BERT Inference, Sep 2021, ArXiv preprint, abs/2109.11745, https://arxiv.org/abs/2109.11745
  6. Antonio Valerio Miceli Barone, Jindrich Helcl, Rico Sennrich, Barry Haddow, and Alexandra Birch. 2017, Deep architectures for neural machine translation, In Proc. of WMT, 2017. https://arxiv.org/abs/1707.07631 (Different stacked architectures in RNNs.)
  7. Young Jin Kim, Marcin Junczys-Dowmunt, Hany Hassan, Alham Fikri Aji, Kenneth Heafield, Roman Grundkiewicz, and Nikolay Bogoychev. 2019, From research to production and back: Ludicrously fast neural machine translation, In Proc. of WNGT, 2019. https://www.aclweb.org/anthology/D19-5632/, Code: https://github.com/marian-nmt/marian
  8. Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz. 2019, Universal transformers, In Proc. of ICLR, 2019. https://arxiv.org/abs/1807.03819
  9. Raj Dabre and Atsushi Fujita. 2019, Recurrent stacking of layers for compact neural machine translation models, In Proc. of AAAI, 2019. https://arxiv.org/abs/1807.05353 (Examines stacking layers of Transformers, including increasing the layers with parameter sharing.)
  10. Shazeer, N. M. 2019, Fast transformer decoding: One write-head is all you need, ArXiv, abs/1911.02150, 2019, https://arxiv.org/abs/1911.02150
  11. Sun, X., Ge, T., Wei, F., and Wang, H., 2021, Instantaneous grammatical error correction with shallow aggressive decoding, ArXiv, abs/2106.04970, 2021, https://arxiv.org/abs/2106.04970
  12. Bapna, A., Arivazhagan, N., and Firat, O., 2020, Controlling computation versus quality for neural sequence models, ArXiv, abs/2002.07106, Apr 2020, https://arxiv.org/abs/2002.07106
  13. Xiang Kong, Adithya Renduchintala, James Cross, Yuqing Tang, Jiatao Gu, Xian Li, 2022, Multilingual Neural Machine Translation with Deep Encoder and Multiple Shallow Decoders, https://arxiv.org/abs/2206.02079
  14. Yilin Yang, Longyue Wang, Shuming Shi, Prasad Tadepalli, Stefan Lee, and Zhaopeng Tu. 2020. On the Sub-layer Functionalities of Transformer Decoder, In Findings of EMNLP. Online, 4799–4811. https://doi.org/10.18653/v1/2020.findings-emnlp.432, https://arxiv.org/abs/2010.02648 (Investigates the depth of decoders; also concludes that the FFN can be removed from the decoder.)
  15. Telmo Pessoa Pires, António V. Lopes, Yannick Assogba, Hendra Setiawan, 2023, One Wide Feedforward is All You Need, arXiv preprint arXiv:2309.01826, https://arxiv.org/abs/2309.01826 (Removes the decoder FFNs entirely and shares a single encoder FFN across multiple encoder layers, and also increases the single FFN's size.)
  16. Tao Ge, Si-Qing Chen, and Furu Wei. 2022. EdgeFormer: A parameter-efficient transformer for on-device seq2seq generation, In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10786–10798, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics, https://arxiv.org/abs/2202.07959 (Includes “shared layers” with shared decoder FFN weights.)
  17. Q Wei, G Zeng, B Zeng, 2023, Efficient Training for Visual Tracking with Deformable Transformer, arXiv preprint arXiv:2309.02676, https://arxiv.org/pdf/2309.02676.pdf (Optimization and also investigated effects of number of decoder layers.)
  18. Xin Sun, Tao Ge, Furu Wei, and Houfeng Wang. 2021, Instantaneous grammatical error correction with shallow aggressive decoding, In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5937–5947, 2021. https://arxiv.org/abs/2106.04970, Code: https://github.com/AutoTemp/Shallow-Aggressive-Decoding (Aggressive decoding emits as many tokens as possible, combined with a shallow decoder architecture.)
  19. J Kasai, 2023, Towards Efficient, Customizable, and Communal Natural Language Processing, Ph.D. thesis, Computer Science and Engineering, University of Washington, https://www.proquest.com/openview/604084b574dcd05e41eb6e33682a3537/1 (Shallow decoding is only part of this wide-ranging and impressive Ph.D. thesis, by one of the early proponents of shallow decoding architectures.)
  20. S Bae, J Ko, H Song, SY Yun, Oct 2023, Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding, arXiv preprint arXiv:2310.05424, https://arxiv.org/pdf/2310.05424.pdf (Combination of early-exit with a “shallow-deep module” and parallel decoding.)
  21. Kaya Y., Hong S., Dumitras T., 2019, Shallow-deep networks: Understanding and mitigating network overthinking, Proceedings of the international conference on machine learning, ICML (2019), pp. 3301-3310, https://arxiv.org/abs/1810.07052 (Shallow-deep method in a single model.)

For more research on the shallow decoder architecture, refer to https://www.aussieai.com/research/shallow-decoder.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++