Aussie AI
Transformer Architectures
-
Last Updated 12 December, 2024
-
by David Spuler, Ph.D.
The Transformer was itself a major architectural advance in 2017. Since then, numerous modified Transformer architectures have been tested, and many ways to optimize Transformers have been found. Discussion of the various major architectural changes is given below; see also Transformer code optimizations, inference optimization techniques, and a very long list of Transformer optimizations
Global Transformer Architecture Changes
Since the introduction of the vanilla Transformer in 2017, researchers have been searching for optimizations up and down the Transformer's tech stack.
Global Transformer optimizations: Some of the architectural-level optimizations to Transformer inference engines include:
- Quantization: clearly the most popular method, with endless research papers and many practical implementions in modern toolkits; see quantization page.
- Pruning: There is depth pruning, width pruning, and length pruning, and then some bright spark thought of combining them, so now there's dual pruning and triple pruning.
Depth-wise optimizations (layers):
- Layer pruning / early exit: cutting short the layers in the encoder and/or decoder is a successful optimization strategy; see layer pruning and early exit inference.
- Shallow decoder architecture. This idea for modifying the Transformer's decoder architecture uses layer pruning in the decoder to achieve a "deep encoder/shallow decoder" architecture, as reported in several research papers, such as Kasai et al. (2021) and Hsu et al. (2020); see shallow decoder architectures, and also depth pruning.
Width-wise optimizations (attention heads):
- Attention head pruning: not all of the attention heads are important, esp. in the decoder (numerous research papers; see head pruning). There's some irony here when you consider the title of the original 2017 Transformer paper!
- Flash Attention: Of all the various attention optimizations, Flash Attention (Dao et al., June 2022), and particularly Flash Attention 2 (Dao, July 2023), seems to have emerged as the most popular. See attention optimization methods.
- Smaller Attention Head Components. It is possible to use more simplified attention head components, esp. in the decoder, as in Kasai et al. (2021); see approximate attention heads and head pruning. Another method is "weight sharing" for attention heads (or "fused" heads), such as in Zhai et al. (2023).
Lengthwise optimizations (input sequences):
- Length pruning. Removing padding in the input vectors for short queries to avoid redundant computations. For example, see ByteTransformer in Zhai et al. (2023). Read more about length pruning and zero padding byte removal.
- Auto-regression optimizations such as semi-autoregressive and non-autoregressive Transformer architectures.
Too Much of a Smart Thing: Just like in Highlander, there can be only one. No, wait, that's incorrect! It's called "multi-AI" or "ensemble" AI:
- Ensemble architectures. Most architectures with two or more Transformers are aiming to achieve more advanced reasoning (usually at a worse speed), but the "big-small" dual architecture aims to improve inference speed by sending common queries to the smaller model. See enemble architectures.
Component-Level Transformer Architecture Changes
Attention heads are addressed above under width pruning, and layers are depth pruning, but various other Transformer components can be optimized:
Normalization optimizations:
- Norm merging (operator fusion). The normalization component can often be merged with another component. This is a type of "kernel fusion" involving the LayerNorm. See "fused LayerNorm" in kernel operator fusion methods.
- Norm pruning (removal). Some research also suggests removal of normalization; see pruning normalization components.
- Norm placement. See pre-norm vs post-norm.
Activation function optimizations:
- Optimizing activations. See overview of activation function optimizations.
- Approximate activation functions. See activation function approximation methods.
- Fused activations. See "fused RELU" and others in kernel operator fusion methods.
Decoder algorithms:
- Faster decoding algorithms. Research in Transformers includes beam search decoding and greedy decoding. There's also aggressive decoding, speculative decoding and collaborative decoding.
- Speculative decoding (supervised dual decoding). A parallelization method whereby the decoding occurs in the small model to generate possible tokens. A larger model has the smaller model running ahead, and it confirms or vetos the suggested tokens, which is basically the same plot as Terminator II. No, but, I'm just checking if you're reading this stuff like an AI, rather than scanning and skipping like a human. If the small model is usually correct, this speeds up the overall process compared to only running a large model. This is similar to Big-Little architectures, but differs because both models are still running. See speculative decoding.
Feed-Fordware Network optimizations:
- FFN Pruning. Simplified decoders, with FFN removed, as in Kasai et al. (2021), although this may be dependent on the use case; see "FFN pruning" section.
MatMul optimizations: Also known as GEMM and various other dumb names. It's matrix multiplication and vector dot product like you did in High School.
- Approximate MatMul. There is much research about using approximate multiplication algorithms.
- Matrix mutiplication improvements: See matrix algebra, sparsification and low-rank matrices.
Softmax optimizations: Occurs less frequently than MatMul, but Softmax can still be optimized:
- Softmax approximation. The use of simplifed approximate Softmax components.
- Softmax removal. See Softmax pruning.
- Softmax replacement. See Softmax alternatives and substitutes.
Positional encoding optimizations: Not usually considered a bottleneck, but even the PE can be optimized:
- PE Optimizations. See positional encoding optimizations
- PE Pruning (Removal). Positional encoding modules may not be as essential as assumed; see positional encoding pruning.
But wait, there's more. And there are more ways to optimize. Refer to the complete list of Transformer optimizations.
Survey Papers on Transformer Architectures
Several papers have surveyed the literature for the latest Transformer ideas:
- Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. A survey of transformers. AI Open, 2022. https://arxiv.org/abs/2106.04554 (An extensive and useful survey of Transformer architectures.)
- Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey (v2). arXiv preprint arXiv:2009.06732, 2022, https://arxiv.org/abs/2009.06732
- Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu, Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan, Lifang He, Hao Peng, Jianxin Li, Jia Wu, Ziwei Liu, Pengtao Xie, Caiming Xiong, Jian Pei, Philip S. Yu, Lichao Sun, May 2023, A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT, https://arxiv.org/abs/2302.09419
- Q Fournier, GM Caron, D Aloise, 2023, A practical survey on faster and lighter transformers, ACM Computing Surveys, https://dl.acm.org/doi/abs/10.1145/3586074, https://arxiv.org/abs/2103.14636
- Xipeng Qiu, TianXiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. 2020. Pre-trained Models for Natural Language Processing: A Survey. SCIENCE CHINA Technological Sciences 63, 10 (2020), 1872–1897. https://doi.org/10.1007/s11431-020-1647-3, https://arxiv.org/abs/2003.08271 (Good survey of Transformer architectures in 2020.)
- Y Chang, X Wang, J Wang, Y Wu, K Zhu, 2023, A survey on evaluation of large language models, arXiv preprint, https://arxiv.org/abs/2307.03109
- N Elhage, N Nanda, C Olsson, T Henighan, N Joseph, B Mann, A Askell, Y Bai, A Chen, T Conerly, et al. 2021. A mathematical framework for transformer circuits. https://transformer-circuits.pub/2021/framework/index.html (Detailed theoretical examination of how various Transformer components work.)
- W Li, H Hacid, E Almazrouei, M Debbah, 2023, A Comprehensive Review and a Taxonomy of Edge Machine Learning: Requirements, Paradigms, and Techniques, AI 2023, 4(3), 729-786, https://www.mdpi.com/2673-2688/4/3/39 (Extensive survey related to optimizing on edge devices.)
- J Zhong, Z Liu, X Chen, Apr 2023, Transformer-based models and hardware acceleration analysis in autonomous driving: A survey, https://arxiv.org/abs/2304.10891
- Y Li, S Wang, H Ding, H Chen, 2023, Large Language Models in Finance: A Survey, PDF: https://www.researchgate.net/profile/Yinheng-Li/publication/374546790_Large_Language_Models_in_Finance_A_Survey/links/6523988afc5c2a0c3bc534fc/Large-Language-Models-in-Finance-A-Survey.pdf
- Maurizio Capra, Beatrice Bussolino, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique, 2020, Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead, https://ieeexplore.ieee.org/iel7/6287639/6514899/09269334.pdf, https://arxiv.org/abs/2012.11233
- Kai He, Rui Mao, Qika Lin, Yucheng Ruan, Xiang Lan, Mengling Feng, Erik Cambria, Oct 2023, A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics, https://arxiv.org/abs/2310.05694
- Minghao Shao, Abdul Basit, Ramesh Karri, Muhammad Shafique, Architectures: Trends, Benchmarks, and Challenges, https://www.researchgate.net/profile/Minghao_Shao2/publication/383976933Survey of different Large Language Model_Survey_of_different_Large_Language_Model_Architectures_Trends_Benchmarks_and_Challenges/links/66e2d320f84dd1716ce79f85/Survey-of-different-Large-Language-Model-Architectures-Trends-Benchmarks-and-Challenges.pdf
Decoder-Only Architectures
Decoder-only architectures are the modern version of Transformers, such as GPT. It was discovered that the encoder in the older encoder-decoder Transformers from 2017 was not needed, and was actually an inefficiency. Decoder-only models are faster, and need fewer weights.
Research on the decoder-only transformer architectures:
- Sathya Krishnan Suresh, Shunmugapriya P, 24 Apr 2024 (v2), Towards smaller, faster decoder-only transformers: Architectural variants and their implications, https://arxiv.org/abs/2404.14462 Code: https://github.com/SkAndMl/gpt-variations (Focuses on three new variants of decoder-only Transformer architectures: ParallelGPT (p-gpt), LinearlyCompressedGPT (lc-gpt), and ConvCompressedGPT (cc-gpt).)
- Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
- Jesse Roberts, 2 Feb 2024 (v3), How Powerful are Decoder-Only Transformer Neural Models? https://arxiv.org/abs/2305.17026
- Georgy Tyukin, 2 Apr 2024, Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations, Masters Thesis, Data Science and Machine Learning, University College London., https://arxiv.org/abs/2404.05741 (Reviews various model compression and inference optimization techniques, and specifically analyzes layer skipping and sublayer skipping, such as attention head pruning and FFN/MLP pruning.)
- Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien Launay, Colin Raffel, Apr 2022, What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? https://arxiv.org/abs/2204.05832
- Urvashi Khandelwal, Kevin Clark, Dan Jurafsky, Lukasz Kaiser, 21 May 2019, Sample Efficient Text Summarization Using a Single Pre-Trained Transformer, https://arxiv.org/abs/1905.08836
- Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Łukasz Kaiser, Noam Shazeer, Jan 2018, GENERATING WIKIPEDIA BY SUMMARIZING LONG SEQUENCES, ICLR 2018 https://arxiv.org/pdf/1801.10198.pdf
- Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating Wikipedia by Summarizing Long Sequences. In Proceedings of the 6th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1801.10198
- Yumo Bai, Feb 3, 2024 Why are most LLMs decoder-only? Dive into the rabbit hole of recent advancement in Large Language Models, https://medium.com/@yumo-bai/why-are-most-llms-decoder-only-590c903e4789
- M Fujitake, 2023 DTrOCR: Decoder-only Transformer for Optical Character Recognition, https://arxiv.org/pdf/2308.15996.pdf
- Benjamin Bergner, Andrii Skliar, Amelie Royer, Tijmen Blankevoort, Yuki Asano, Babak Ehteshami Bejnordi, 26 Feb 2024, Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding, https://arxiv.org/abs/2402.16844
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
- Meta, July 23, 2024, Introducing Llama 3.1: Our most capable models to date, https://ai.meta.com/blog/meta-llama-3-1/
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
- Hao Zhou, Chengming Hu, Ye Yuan, Yufei Cui, Yili Jin, Can Chen, Haolun Wu, Dun Yuan, Li Jiang, Di Wu, Xue Liu, Charlie Zhang, Xianbin Wang, Jiangchuan Liu, 17 May 2024, Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities, https://arxiv.org/abs/2405.10825
Encoder-Decoder Architectures
Encoder-decoder Transformers are the older architecture from 2017. Decoder-only architectures have largely superceded this version, but it is still used in some use cases such as machine translation (foreign language translation).
Research on encoder-decoder architectures:
- Tianyu He, Xu Tan, Yingce Xia, Di He, Tao Qin, Zhibo Chen, Tie-Yan Liu, 2018, Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation, Advances in Neural Information Processing Systems 31 (NeurIPS 2018) https://papers.nips.cc/paper/2018/hash/4fb8a7a22a82c80f2c26fe6c1e0dcbb3-Abstract.html
- Nadeem Vidhya Mar 11, 2021, Encoders-Decoders, Sequence to Sequence Architecture, Analytics Vidhya, Medium, https://medium.com/analytics-vidhya/encoders-decoders-sequence-to-sequence-architecture-5644efbb3392
- Yumo Bai, Feb 3, 2024 Why are most LLMs decoder-only? Dive into the rabbit hole of recent advancement in Large Language Models, https://medium.com/@yumo-bai/why-are-most-llms-decoder-only-590c903e4789
- Benjamin Bergner, Andrii Skliar, Amelie Royer, Tijmen Blankevoort, Yuki Asano, Babak Ehteshami Bejnordi, 26 Feb 2024, Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding, https://arxiv.org/abs/2402.16844
- João Monteiro, Étienne Marcotte, Pierre-André Noël, Valentina Zantedeschi, David Vázquez, Nicolas Chapados, Christopher Pal, Perouz Taslakian, 23 Apr 2024, XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference, https://arxiv.org/abs/2404.15420
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
- kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
- Hao Zhou, Chengming Hu, Ye Yuan, Yufei Cui, Yili Jin, Can Chen, Haolun Wu, Dun Yuan, Li Jiang, Di Wu, Xue Liu, Charlie Zhang, Xianbin Wang, Jiangchuan Liu, 17 May 2024, Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities, https://arxiv.org/abs/2405.10825
Encoder-Only Architectures
Encoder-only architectures lack a decoder, and are only used where the output is not a full text sequence. This makes sense in models where the output can be an embedding vector. Most modern LLMs are not using this architecture.
Research on encoder-only architectures:
- Ting Hu, Christoph Meinel, Haojin Yang, 2024, A flexible BERT model enabling width- and depth-dynamic inference, Computer Speech & Language 4 April 2024, 101646, https://www.sciencedirect.com/science/article/pii/S0885230824000299 (Dual pruning method with layerwise "neural grafting" that gives dynamic width models, and combined with early exit on the depth dimension.)
- Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao, 5 Feb 2024. A Survey on Transformer Compression. https://arxiv.org/abs/2402.05964 (Model compression survey paper with focus on pruning, quantization, knowledge distillation, and efficient architecture design.)
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
- Hao Zhou, Chengming Hu, Ye Yuan, Yufei Cui, Yili Jin, Can Chen, Haolun Wu, Dun Yuan, Li Jiang, Di Wu, Xue Liu, Charlie Zhang, Xianbin Wang, Jiangchuan Liu, 17 May 2024, Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities, https://arxiv.org/abs/2405.10825
Hybrid Transformer Architectures
Transformer architectures have been merged with aspects of previous neural network theory to create hybrid architectures. Examples include:
- Vision Transformer (ViT)
- Transformer-RNN hybrid architectures
- Transformer-CNN hybrid architectures
Research papers on hybrid transformer architectures:
Innovative New Transformer Architecture Research Papers
Since the original Transformer paper in 2017, and various other Transformer milestone papers, there have been numerous architectural variations proposed to alleviate efficiency or accuracy concerns. Research papers on specific modifications to the Transformer architecture include:
- Chen, M. X., Firat, O., Bapna, A., Johnson, M., Macherey, W., Foster, G., Jones, L., Schuster, M., Shazeer, N., Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Chen, Z., Wu, Y., and Hughes, M. The best of both worlds: Combining recent advances in neural machine translation. In ACL, 2018, https://arxiv.org/abs/1804.09849 (Hybrid Transformer architectures.)
- David So, Quoc Le, and Chen Liang. The evolved transformer. In International Conference on Machine Learning, pages 5877–5886. PMLR, 2019. https://arxiv.org/abs/1901.11117
- Piotr Nawrot, Szymon Tworkowski, Michael Tyrolski, Lukasz Kaiser, Yuhuai Wu, Christian Szegedy, and Henryk Michalewski. Hierarchical transformers are more efficient language models. arXiv preprint arXiv:2110.13711, 2021. https://arxiv.org/abs/2110.13711
- Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, Yibo Zhu, 2023, ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs, https://arxiv.org/abs/2210.03052 (An advanced new architecture.)
- Jungo Kasai, Nikolaos Pappas, Hao Peng, James Cross, and Noah A. Smith. 2020. Deep encoder, shallow decoder: Reevaluating the speed-quality tradeoff in machine translation. CoRR, abs/2006.10369. https://arxiv.org/abs/2006.10369 Code: https://github.com/jungokasai/deep-shallow
- Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-xl: Attentive language models beyond a fixed-length context,” arXiv, 2019, https://arxiv.org/abs/1901.02860
- Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. arXiv e-prints (2020), arXiv:2004.05150. https://arxiv.org/abs/2004.05150
- Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. 2019. Universal Transformers. In ICLR, https://arxiv.org/abs/1807.03819
- William Fedus, Barret Zoph, and Noam Shazeer. 2021. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv e-prints (2021), arXiv:2101.03961. https://arxiv.org/abs/2101.03961
- Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, et al. 2020. Big Bird: Transformers for Longer Sequences. In NeurIPS, Vol. 33. 17283–17297, https://arxiv.org/abs/2007.14062
- Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. Proceedings of EMNLP, 2020, https://arxiv.org/abs/2004.05150
- Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, and Zheng Zhang. Star-transformer. Proceedings of NAACL, 2019, https://arxiv.org/abs/1902.09113
- Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030, 2021. https://arxiv.org/abs/2103.14030
- Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, and Song Han. HAT: Hardware-aware transformers for efficient natural language processing. arXiv preprint arXiv:2005.14187, 2020. https://arxiv.org/abs/2005.14187 Code: https://github.com/mit-han-lab/hardware-aware-transformers.git
- Zihang Dai, Guokun Lai, Yiming Yang, and Quoc Le. 2020. Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing. In NeurIPS. https://arxiv.org/abs/2006.03236
- A Agrawal, A Panwar, J Mohan, N Kwatra, BS Gulavani, 2023, SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills, arXiv preprint, https://arxiv.org/abs/2308.16369
- NVIDIA, NVIDIA FasterTransformer, https://github.com/NVIDIA/FasterTransformer
- Chenguang Wang, Zihao Ye, Aston Zhang, Zheng Zhang, and Alexander J. Smola. 2020. Transformer on a Diet. arXiv e-prints (2020), arXiv:2002.06170. https://arxiv.org/abs/2002.06170
- Aminabadi, R. Y.; Rajbhandari, S.; Zhang, M.; Awan, A. A.; Li, C.; Li, D.; Zheng, E.; Rasley, J.; Smith, S.; Ruwase, O.; and He, Y. 2022. DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. arXiv:2207.00032. https://arxiv.org/abs/2207.00032
- Sheng, Y.; Zheng, L.; Yuan, B.; Li, Z.; Ryabinin, M.; Fu, D. Y.; Xie, Z.; Chen, B.; Barrett, C.; Gonzalez, J. E.; Liang, P.; Re, C.; Stoica, I.; and Zhang, C. 2023. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. arXiv:2303.06865 https://arxiv.org/abs/2303.06865, Code: https://github.com/FMInference/FlexGen
- Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Xiangru Tang, Bolun Wang, Johan S. Wind, Stansilaw Wozniak, Ruichong Zhang, Zhenyuan Zhang, Qihang Zhao, Peng Zhou, Jian Zhu, Rui-Jie Zhu, May 2023, RWKV: Reinventing RNNs for the Transformer Era, https://arxiv.org/pdf/2305.13048.pdf, Code: https://github.com/BlinkDL/RWKV-LM (RWKV transformers are a hybrid RNN-Transformer that replaces QKV attention with Receptance Weighted Key Value (RWKV)).
- A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, YOLOv4: Optimal Speed and Accuracy of Object Detection, arXiv:2004.10934 [cs, eess], Apr. 2020. https://arxiv.org/abs/2004.10934, Code: https://github.com/AlexeyAB/darknet
- Keyu An, Shiliang Zhang, Sep 2023, Exploring RWKV for Memory Efficient and Low Latency Streaming ASR, arXiv preprint arXiv:2309.14758, https://arxiv.org/pdf/2309.14758.pdf (Analysis of the RWKV Transformer-RNN hybrid architecture.)
- CC Atabansi, J Nie, H Liu, Q Song, L Yan, X Zhou, 2023, A survey of Transformer applications for histopathological image analysis: New developments and future directions, BioMedical Engineering OnLine, https://link.springer.com/article/10.1186/s12938-023-01157-0 (Massive survey of medical imaging analysis use case, including discussion of various hybrid Transformer models.)
- Jan Kocon, Igor Cichecki, Oliwier Kaszyca, Mateusz ´ Kochanek, Dominika Szydło, Joanna Baran, Julita Bielaniewicz, Marcin Gruza, Arkadiusz Janz, Kamil Kanclerz, Anna Kocon, Bartłomiej Koptyra, Wik- ´ toria Mieleszczenko-Kowszewicz, Piotr Miłkowski, Marcin Oleksy, Maciej Piasecki, Łukasz Radlinski, ´ Konrad Wojtasik, Stanisław Wo´zniak, and Przemysław Kazienko. June 2023. Chatgpt: Jack of all trades, master of none. https://arxiv.org/abs/2302.10724 (A detailed analysis of ChatGPT, including GPT-4, in various uses cases.)
- Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. March 2022. Training compute-optimal large language models. https://arxiv.org/abs/2203.15556 (This paper presents the 70B Chinchilla model.)
- Shuangfei Zhai, Walter Talbott, Nitish Srivastava, Chen Huang, Hanlin Goh, Ruixiang Zhang, Josh Susskind, Sep 2021, An Attention Free Transformer, https://arxiv.org/abs/2105.14103
- Chien-Yao Wang, Alexey Bochkovskiy, Hong-Yuan Mark Liao, 2023, YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 7464-7475, PDF: http://openaccess.thecvf.com/content/CVPR2023/papers/Wang_YOLOv7_Trainable_Bag-of-Freebies_Sets_New_State-of-the-Art_for_Real-Time_Object_Detectors_CVPR_2023_paper.pdf
- Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, Furu Wei, July 2023, Retentive Network: A Successor to Transformer for Large Language Models, https://arxiv.org/abs/2307.08621, Code: https://aka.ms/retnet (Some analysis of KV cache memory usage, but not a primary focus of the paper.) (A proposed new architecture called "Retentive Network" to supercede Transformers.)
- Hanting Chen, Yunhe Wang, Jianyuan Guo, and Dacheng Tao, May 2023, Vanillanet: the power of minimalism in deep learning, https://arxiv.org/abs/2305.12972, Code: https://github.com/huawei-noah/VanillaNet, Code: https://gitee.com/mindspore/models/tree/master/research/cv/vanillanet
- Tobias Domhan. 2018. How much attention do you need? a granular analysis of neural machine translation architectures. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1799–1808. PDF: https://aclanthology.org/P18-1167.pdf (Examines in detail the various components of the early Transformer architectures, using a pre-norm architecture, based on Tensor2Tensor.)
- Davis Yoshida, Allyson Ettinger, and Kevin Gimpel. 2020. Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size. CoRR abs/2008.07027 (2020). arXiv:2008.07027 https://arxiv.org/abs/2008.07027 (Hybrid RNN-Transformer architecture.)
- Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, Mehrdad Farajtabar, Oct 2023 ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models, https://arxiv.org/abs/2310.04564 (Recommends reinstating the simpler RELU rather than GELU or SiLU, with a focus on inference efficiency.)
- Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed, Oct 2023, Mistral 7B, https://arxiv.org/abs/2310.06825, Code: https://mistral.ai/news/announcing-mistral-7b/ (Uses grouped-query attention and sliding window attention for long context handling.)
- A Langedijk, H Mohebbi, G Sarti, W Zuidema, J Jumelet, Oct 2023, DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers, https://arxiv.org/abs/2310.03686, https://pure.rug.nl/ws/portalfiles/portal/799424386/2310.03686v1.pdf (Allows the decoder to cross-attend to earlier layers of the encoder, rather than only the final output layer.)
- nostalgebraist. 2020. Interpreting GPT: The logit lens. AI Alignment Forum. https://www.alignmentforum.org/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens, https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens (Detailed analysis of how Transformers seem to actually work.)
- J Alman, Z Song, Oct 2023, How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation, arXiv preprint arXiv:2310.04064, https://arxiv.org/abs/2310.04064 (Uses more advanced QKV attention mechanism with even more computations than vanilla Transformer.)
- Sharan Narang, Hyung Won Chung, Yi Tay, Liam Fedus, Thibault Fevry, Michael Matena, Karishma Malkan, ´ Noah Fiedel, Noam Shazeer, Zhenzhong Lan, Yanqi Zhou, Wei Li, Nan Ding, Jake Marcus, Adam Roberts, and Colin Raffel. 2021, Do transformer modifications transfer across implementations and applications? Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, 7-11 November, 2021, pages 5758–5773. Association for Computational Linguistics, 2021. https://arxiv.org/abs/2102.11972 (Paper examines various Transformer variants.)
- Shabnam Daghaghi, Nicholas Meisburger, Mengnan Zhao, Yong Wu, Sameh Gobriel, Charlie Tai, Anshumali Shrivastava, Mar 2021, Accelerating SLIDE Deep Learning on Modern CPUs: Vectorization, Quantizations, Memory Optimizations, and More, https://arxiv.org/abs/2103.10891, Code: https://github.com/RUSH-LAB/SLIDE (Fast training on CPUs using AVX-512 and locality-sensitive hashing of vectors.)
- Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain, Oct 2023, MatFormer: Nested Transformer for Elastic Inference, https://arxiv.org/abs/2310.07707 (Multiple submodels inside a large model.)
- Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning, M Xia, T Gao, Z Zeng, D Chen, arXiv preprint arXiv:2310.06694, Oct 2023, https://arxiv.org/pdf/2310.06694.pdf, Code: https://github.com/princeton-nlp/LLM-Shearing
- C Wang, 2023, Applied Intelligence volume 53, pages 19990–20006, HCT-Det: a hybrid CNN-transformer architecture for 3D object detection from point clouds, https://link.springer.com/article/10.1007/s10489-023-04570-z, https://www.spiedigitallibrary.org/conference-proceedings-of-spie/12799/127993I/HCT-Det--a-hybrid-CNN-transformer-architecture-for-3D/10.1117/12.3005832.short?SSO=1, Code: https://github.com/yuzh2022/HCT-Net
- P Shamsolmoali, M Zareapoor, H Zhou, X Li, Y Lu, Oct 2023, Distance-based Weighted Transformer Network for Image Completion, arXiv preprint arXiv:2310.07440, https://arxiv.org/abs/2310.07440
- S Tan, Y Shen, Z Chen, A Courville, C Gan, Oct 2023, Sparse Universal Transformer, arXiv preprint arXiv:2310.07096, https://arxiv.org/pdf/2310.07096.pdf
- H Xu, Y Song, Q Liu, J van Genabith, D Xiong, 2024, Rewiring the Transformer with Depth-Wise LSTMs, LREC-COLING 2024, pages 14122–14133, 20-25 May, 2024, https://aclanthology.org/2024.lrec-main.1231.pdf
- Jean Mercat, Igor Vasiljevic, Sedrick Keh, Kushal Arora, Achal Dave, Adrien Gaidon, Thomas Kollar, 10 May 2024, Linearizing Large Language Models, https://arxiv.org/abs/2405.06640 Code: https://github.com/TRI-ML/linear_open_lm
- Lu Ma, Zeang Sheng, Xunkai Li, Xinyi Gao, Zhezheng Hao, Ling Yang, Wentao Zhang, Bin Cui, 7 May 2024, Acceleration Algorithms in GNNs: A Survey, https://arxiv.org/abs/2405.04114
- Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei, 9 May 2024 (v2), You Only Cache Once: Decoder-Decoder Architectures for Language Models, https://arxiv.org/abs/2405.05254 Code: https://aka.ms/YOCO (A novel decoder-decoder architecture with fast KV caching and cross-attention.)
- Badri Narayana Patro, Vijay Srinivas Agneeswaran, 24 Apr 2024, Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges, https://arxiv.org/abs/2404.16112
- Sathya Krishnan Suresh, Shunmugapriya P, 24 Apr 2024 (v2), Towards smaller, faster decoder-only transformers: Architectural variants and their implications, https://arxiv.org/abs/2404.14462 Code: https://github.com/SkAndMl/gpt-variations (Focuses on three new variants of decoder-only Transformer architectures: ParallelGPT (p-gpt), LinearlyCompressedGPT (lc-gpt), and ConvCompressedGPT (cc-gpt).)
- Jianhui Pang, Fanghua Ye, Longyue Wang, Dian Yu, Derek F. Wong, Shuming Shi, Zhaopeng Tu, 17 Jan 2024 (v2), Salute the Classic: Revisiting Challenges of Machine Translation in the Age of Large Language Models, https://arxiv.org/abs/2401.08350 Code: https://github.com/pangjh3/LLM4MT
- Mackenzie Morehead, Apr 16, 2024, Is Attention All You Need? https://www.mackenziemorehead.com/is-attention-all-you-need/
- Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot, Johan Ferret, Sertan Girgin, Olivier Bachem, Alek Andreev, Kathleen Kenealy, Thomas Mesnard, Cassidy Hardin, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Armand Joulin, Noah Fiedel, Evan Senter, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, David Budden, Arnaud Doucet, Sharad Vikram, Adam Paszke, Trevor Gale, Sebastian Borgeaud, Charlie Chen, Andy Brock, Antonia Paterson, Jenny Brennan, Meg Risdal, Raj Gundluru, Nesh Devanathan, Paul Mooney, Nilay Chauhan, Phil Culliton, Luiz GUStavo Martins, Elisa Bandy, David Huntsperger, Glenn Cameron, Arthur Zucker, Tris Warkentin, Ludovic Peran, Minh Giang, Zoubin Ghahramani, Clément Farabet, Koray Kavukcuoglu, Demis Hassabis, Raia Hadsell, Yee Whye Teh, Nando de Frietas, 11 Apr 2024, RecurrentGemma: Moving Past Transformers for Efficient Open Language Models, Google Research, https://arxiv.org/abs/2404.07839
- Panjie Qi; Edwin Hsing-Mean Sha; Qingfeng Zhuge; Hongwu Peng; Shaoyi Hua, 2021, Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization, 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD), https://ieeexplore.ieee.org/document/9643586
- SwitchGPT: Adapting Large Language Models for Non-Text Outputs X Wang, B Zhuang, Q Wu - arXiv preprint arXiv:2309.07623, 2023, https://arxiv.org/pdf/2309.07623.pdf
- Staphord Bengesi, Hoda El-Sayed, Md Kamruzzaman Sarker, Yao Houkpati, John Irungu, Timothy Oladunni, 2023, Advancements in Generative AI: A Comprehensive Review of GANs, GPT, Autoencoders, Diffusion Model, and Transformers, 21 Nov 2023, https://arxiv.org/abs/2311.10242
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, Jan 2024, Understanding LLMs: A Comprehensive Overview from Training to Inference https://arxiv.org/abs/2401.02038
- Steve Yadlowsky, Lyric Doshi, Nilesh Tripuraneni, Nov 2023, Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models, https://arxiv.org/abs/2311.00871
- Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher Ré, Apr 2023, Hyena Hierarchy: Towards Larger Convolutional Language Models, https://arxiv.org/pdf/2302.10866.pdf
- Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu, Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan, Lifang He, Hao Peng, Jianxin Li, Jia Wu, Ziwei Liu, Pengtao Xie, Caiming Xiong, Jian Pei, Philip S. Yu, Lichao Sun, May 2023, A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT, https://arxiv.org/abs/2302.09419
- Sachin Mehta, Marjan Ghazvininejad, Srinivasan Iyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2020. DeLighT: Very Deep and Light-weight Transformer. arXiv:2008.00623 https://arxiv.org/abs/2008.00623 (Different Transformer architecture that includes removing attention heads and simplifies the FFN.)
- Zihang Dai, Guokun Lai, Yiming Yang, and Quoc Le. 2020. Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing. In Proceedings of NeurIPS. https://proceedings.neurips.cc/paper/2020/hash/2cd2915e69546904e4e5d4a2ac9e1652-Abstract.html https://arxiv.org/abs/2006.03236 Code: https://github.com/laiguokun/Funnel-Transformer
- Xipeng Qiu, TianXiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. 2020. Pre-trained Models for Natural Language Processing: A Survey. SCIENCE CHINA Technological Sciences 63, 10 (2020), 1872–1897. https://doi.org/10.1007/s11431-020-1647-3 https://arxiv.org/abs/2003.08271 (Good survey of Transformer architectures in 2020.)
- A Ouyang, June 2023, Understanding the Performance of Transformer Inference, Masters Thesis, Electrical Engineering and Computer Science, MIT, https://dspace.mit.edu/handle/1721.1/151543 https://dspace.mit.edu/bitstream/handle/1721.1/151543/ouyang-aouyang-meng-eecs-2023-thesis.pdf?sequence=1&isAllowed=y (Detailed analysis of Transformer performance, including the techniques of KV caching.)
- Sandeep Subramanian, Ronan Collobert, Marc’Aurelio Ranzato, and Y-Lan Boureau. Multi-scale transformer language models. arXiv preprint arXiv:2005.00581, 2020. https://arxiv.org/abs/2005.00581
- Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin. Deebert: Dynamic early exiting for accelerating BERT inference. arXiv preprint arXiv:2004.12993, 2020. https://arxiv.org/abs/2004.12993
- Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019, https://arxiv.org/abs/1910.01108
- Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. MobileBERT: a compact task-agnostic BERT for resource-limited devices. arXiv preprint arXiv:2004.02984, 2020. https://arxiv.org/abs/2004.02984
- Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Haotang Deng, and Qi Ju. FastBERT: a self-distilling BERT with adaptive inference time. arXiv preprint arXiv:2004.02178, 2020. https://arxiv.org/abs/2004.02178
- Daoyuan Chen, Yaliang Li, Minghui Qiu, Zhen Wang, Bofang Li, Bolin Ding, Hongbo Deng, Jun Huang, Wei Lin, and Jingren Zhou. AdaBERT: Task-adaptive BERT compression with differentiable neural architecture search. arXiv preprint arXiv:2001.04246, 2020. https://arxiv.org/abs/2001.04246
- Forrest N Iandola, Albert E Shaw, Ravi Krishna, and Kurt W Keutzer. SqueezeBERT: What can computer vision teach NLP about efficient neural networks? arXiv preprint arXiv:2006.11316, 2020. https://arxiv.org/abs/2006.11316
- Andrea Santilli, Silvio Severino, Emilian Postolache, Valentino Maiorca, Michele Mancusi, Riccardo Marin, Emanuele Rodolà, May 2023, Accelerating Transformer Inference for Translation via Parallel Decoding, https://arxiv.org/abs/2305.10427
- Javier Ferrando, Gabriele Sarti, Arianna Bisazza, Marta R. Costa-jussà, 2 May 2024 (v2), A Primer on the Inner Workings of Transformer-based Language Models, https://arxiv.org/pdf/2405.00208 (Analyzes the theory of the Transformer architecture, including an interesting separation of the effects of attention versus FFNs on logits to give attributions.)
- Simeon Emanuilov, Apr 4, 2024 LLM agent operating system (AIOS) and the future of LLM-powered agents, https://medium.com/@simeon.emanuilov/llm-agent-operating-system-aios-and-the-future-of-llm-powered-agents-3d08b4e91c34 https://unfoldai.com/aios-llm-powered-agents/
- CAMERON R. WOLFE, PH.D. MAR 04, 2024, Decoder-Only Transformers: The Workhorse of Generative LLMs, https://cameronrwolfe.substack.com/p/decoder-only-transformers-the-workhorse
- Rachel Gordon, Publication Date:March 21, 2024, AI generates high-quality images 30 times faster in a single step, MIT News, https://news.mit.edu/2024/ai-generates-high-quality-images-30-times-faster-single-step-0321 (MIT's new image generation framework called "distribution matching distillation" is faster than diffusion models.)
- Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao, 5 Feb 2024. A Survey on Transformer Compression. https://arxiv.org/abs/2402.05964 (Model compression survey paper with focus on pruning, quantization, knowledge distillation, and efficient architecture design.)
- Han Zhao, Min Zhang, Wei Zhao, Pengxiang Ding, Siteng Huang, Donglin Wang, 22 Mar 2024, Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference, https://arxiv.org/abs/2403.14520 Code: https://sites.google.com/view/cobravlm (Multimodal version of the new Mamba architecture.)
- Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, Young Jin Kim, 18 Jan 2024, Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation, https://arxiv.org/abs/2401.08417
- Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, 1 Dec 2023, The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, https://arxiv.org/abs/2312.00678 Project: https://github.com/tding1/Efficient-LLM-Survey
- Jesse Roberts, 2 Feb 2024 (v3), How Powerful are Decoder-Only Transformer Neural Models? https://arxiv.org/abs/2305.17026
- Georgy Tyukin, 2 Apr 2024, Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations, Masters Thesis, Data Science and Machine Learning, University College London., https://arxiv.org/abs/2404.05741 (Reviews various model compression and inference optimization techniques, and specifically analyzes layer skipping and sublayer skipping, such as attention head pruning and FFN/MLP pruning.)
- Stan Gibson, 03 Jun 2024, Getting infrastructure right for generative AI, CIO, https://www.cio.com/article/2128440/getting-infrastructure-right-for-generative-ai.html
- Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai, 25 Jan 2024, ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models, https://arxiv.org/abs/2401.14351 Code: https://github.com/ServerlessLLM/ServerlessLLM
- Gavin Li, Nov 19, 2023, Unbelievable! Run 70B LLM Inference on a Single 4GB GPU with This NEW Technique, AI Advances https://ai.gopubby.com/unbelievable-run-70b-llm-inference-on-a-single-4gb-gpu-with-this-new-technique-93e2057c7eeb
- Yumo Bai, Feb 3, 2024 Why are most LLMs decoder-only? Dive into the rabbit hole of recent advancement in Large Language Models, https://medium.com/@yumo-bai/why-are-most-llms-decoder-only-590c903e4789
- Sergey Levine, 2023, UC Berkeley Transformers: CS W182/282A (slides), accessed 3rd Oct 2023, PDF Slides: https://cs182sp21.github.io/static/slides/lec-12.pdf
- Christopher Wolters, Xiaoxuan Yang, Ulf Schlichtmann, Toyotaro Suzumura, 12 Jun 2024, Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference, https://arxiv.org/abs/2406.08413
- Wang, X.; Zhang, L.L.; Wang, Y.; Yang, M. Towards Efficient Vision Transformer Inference: A First Study of Transformers on Mobile Devices. In Proceedings of the 23rd Annual International Workshop on Mobile Computing Systems and Applications, HotMobile 2022, Orange County, CA, USA, 22–23 February 2022; pp. 1–7. http://dx.doi.org/10.1145/3508396.3512869
- Graham, B.; El-Nouby, A.; Touvron, H.; Stock, P.; Joulin, A.; Jégou, H.; Douze, M. LeViT: A Vision Transformer in ConvNet’s Clothing for Faster Inference. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 12239–12249. http://dx.doi.org/10.1109/ICCV48922.2021.01204
- Roh, B.; Shin, J.; Shin, W.; Kim, S. Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity. arXiv 2021. http://dx.doi.org/10.48550/arXiv.2111.14330
- Li, Y.; Yuan, G.; Wen, Y.; Hu, E.; Evangelidis, G.; Tulyakov, S.; Wang, Y.; Ren, J. EfficientFormer: Vision Transformers at MobileNet Speed. arXiv 2022. http://dx.doi.org/10.48550/arXiv.2206.01191
- David Spuler, March 2024, Chapter 2. Transformers & LLMs, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, Ji-Rong Wen, June 2023, A Survey of Large Language Models, https://arxiv.org/abs/2303.18223
- Shuning Chang, Pichao Wang, Hao Luo, Fan Wang, Mike Zheng Shou, 2023, Revisiting Vision Transformer from the View of Path Ensemble, https://arxiv.org/abs/2308.06548 PDF: https://arxiv.org/pdf/2308.06548.pdf
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, Attention is all you need, 2017, arXive preprint arXiv:1706.03762. https://arxiv.org/abs/1706.03762
- azhar, Dec 29, 2023, Decoding Mamba: The Next Big Leap in AI Sequence Modeling, https://medium.com/ai-insights-cobet/decoding-mamba-the-next-big-leap-in-ai-sequence-modeling-ef3908060cb8
- kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
- Chen, C, 2024, Hardware‑software co‑exploration and optimization for next‑generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 (Extensive coverage of hardware design with multiple contributions to accelerating various neural network types, ranging from acceleration of various single non-linear functions and end-to-end optimization algorithms. Specific topics include data compression, non-maximum suppression, MHA, and MatMul/GEMM optimizations.)
- Louis-François Bouchard, Louie Peters, May 2024, Chapter 2: Architectures, Building LLMs for Production: Enhancing LLM Abilities and Reliability with Prompting, Fine-Tuning, and RAG, https://www.amazon.com/Building-LLMs-Production-Reliability-Fine-Tuning/dp/B0D4FFPFW8/
- Matt Murphy, Tim Tully, Derek Xiao, January 18, 2024, The Modern AI Stack: Design Principles for the Future of Enterprise AI Architectures, Menlo Ventures, https://menlovc.com/perspective/the-modern-ai-stack-design-principles-for-the-future-of-enterprise-ai-architectures/ (Various details about the AI tech stack, organizational AI maturity levels, and several interesting facts: inference is 95% of AI cost now, 60% of organizations are using multi-model methods, RAG is the dominant architecture currently, and AI application development teams are primarily made up of non-ML software engineers leveraging on top of AI models.)
- Shangyu Wu, Ying Xiong, Yufei Cui, Haolun Wu, Can Chen, Ye Yuan, Lianming Huang, Xue Liu, Tei-Wei Kuo, Nan Guan, Chun Jason Xue, 18 Jul 2024, Retrieval-Augmented Generation for Natural Language Processing: A Survey, https://arxiv.org/abs/2407.13193
- Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
- Yorick Sens, Henriette Knopp, Sven Peldszus, Thorsten Berger, 12 Aug 2024, A Large-Scale Study of Model Integration in ML-Enabled Software Systems, https://arxiv.org/abs/2408.06226
- Rohan Baskar Prabhakar, Hengrui Zhang, David Wentlzaff, 14 Aug 2024, Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference, https://arxiv.org/abs/2408.07802 (Modified Transformer architecture with parallelized sub-layers of attention and FFN.)
- Hugo Laurençon, Andrés Marafioti, Victor Sanh, Léo Tronchon, 22 Aug 2024, Building and better understanding vision-language models: insights and future directions, https://arxiv.org/abs/2408.12637
- Tymofii Reizin, 2024, Fast Algorithms for Attention Mechanism, Bachelor Thesis, Department of Applied Mathematics, Charles University, Prague, https://dspace.cuni.cz/bitstream/handle/20.500.11956/192084/130390128.pdf?sequence=1
- Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. Dynabert: Dynamic bert with adaptive width and depth. arXiv preprint arXiv:2004.04037, 2020. https://arxiv.org/abs/2004.04037
- Zejian Liu, Fanrong Li, Gang Li, and Jian Cheng. 2021, EBERT: Efficient BERT Inference with Dynamic Structured Pruning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4814– 4823, 2021. https://aclanthology.org/2021.findings-acl.425/
- Bobby He, Thomas Hofmann, 31 May 2024 (v2), Simplifying Transformer Blocks, https://arxiv.org/abs/2311.01906 (Examines the removal of various Transformer sublayer components including skip connections, projection/value parameters, and normalization.)
- Minghao Shao, Abdul Basit, Ramesh Karri, Muhammad Shafique, Architectures: Trends, Benchmarks, and Challenges, https://www.researchgate.net/profile/Minghao_Shao2/publication/383976933Survey of different Large Language Model_Survey_of_different_Large_Language_Model_Architectures_Trends_Benchmarks_and_Challenges/links/66e2d320f84dd1716ce79f85/Survey-of-different-Large-Language-Model-Architectures-Trends-Benchmarks-and-Challenges.pdf
- Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
- Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuoling Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping, 17 Sep 2024, NVLM: Open Frontier-Class Multimodal LLMs, NVIDIA, https://arxiv.org/abs/2409.11402 https://huggingface.co/nvidia/NVLM-D-72B https://nvlm-project.github.io/
- Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, Ping Luo, 17 Oct 2024, Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation, https://arxiv.org/abs/2410.13848 https://github.com/deepseek-ai/Janus?tab=readme-ov-file
That's a Lot of BERTs!
BERT was an early 2019 Transformer architecture that was significantly innovative. Since then, there have been a great many variants of "BERT" (e.g. FastBERT, MobileBERT, DistilBERT, etc.). Research papers on variants of BERT include:
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, May 2019, https://arxiv.org/abs/1810.04805, Code: https://github.com/google-research/bert (The one BERT to rule them all.)
- Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. MobileBERT: a compact task-agnostic BERT for resource-limited devices. arXiv preprint arXiv:2004.02984, 2020. https://arxiv.org/abs/2004.02984
- Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Haotang Deng, and Qi Ju. FastBERT: a self-distilling BERT with adaptive inference time. arXiv preprint arXiv:2004.02178, 2020. https://arxiv.org/abs/2004.02178
- Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin. DeeBERT: Dynamic early exiting for accelerating BERT inference. arXiv preprint arXiv:2004.12993, 2020. https://arxiv.org/abs/2004.12993
- Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019, https://arxiv.org/abs/1910.01108
- Forrest N Iandola, Albert E Shaw, Ravi Krishna, and Kurt W Keutzer. SqueezeBERT: What can computer vision teach NLP about efficient neural networks? arXiv preprint arXiv:2006.11316, 2020. https://arxiv.org/abs/2006.11316
- Daoyuan Chen, Yaliang Li, Minghui Qiu, Zhen Wang, Bofang Li, Bolin Ding, Hongbo Deng, Jun Huang, Wei Lin, and Jingren Zhou. AdaBERT: Task-adaptive BERT compression with differentiable neural architecture search. arXiv preprint arXiv:2001.04246, 2020. https://arxiv.org/abs/2001.04246
- Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. Dynabert: Dynamic bert with adaptive width and depth. arXiv preprint arXiv:2004.04037, 2020. https://arxiv.org/abs/2004.04037
- Zejian Liu, Fanrong Li, Gang Li, and Jian Cheng. EBERT: Efficient BERT Inference with Dynamic Structured Pruning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4814– 4823, 2021. https://aclanthology.org/2021.findings-acl.425/
- Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized BERT pretraining approach,” CoRR, 2019. https://arxiv.org/abs/1907.11692
- Z. Jiang, W. Yu, D. Zhou, Y. Chen, J. Feng, and S. Yan, “Convbert: Improving BERT with span-based dynamic convolution,” in NeurIPS, 2020, https://arxiv.org/abs/2008.02496
- H. Bao, L. Dong, S. Piao, and F. Wei, “BEit: BERT pre-training of image transformers,” in International Conference on Learning Representations, 2022. https://arxiv.org/abs/2106.08254
- Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu, Oct 2020, TinyBERT: Distilling BERT for Natural Language Understanding, https://arxiv.org/abs/1909.10351
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
Next-Generation Architectures
What comes after Transformers? Maybe the answer is: more Transformers! Certainly, the newer multi-modal Transformers are gaining momentum, and there are other advanced Transformers:
- Vision Transformer (ViT)
- Multimodal transformer
- Ensemble architectures (multi-AI, such as MoE)
- Agent architectures (e.g., function calling, autonomous agents)
- Advanced RAG architectures
- Tool Augmented Language Models (TALM)
- Retrieval Augment Language Models (RALM)
- Compound AI architectures
However, there are some alternatives to Transformers that have been gathering steam. Here are a few newer architectures already being worked on:
- State Space Models (SSMs)
- RWKV (Transformer-RNN hybrid)
- Mamba (a type of SSM)
- Graph Neural Networks and Knowledge Graph extensions
- S4 Hyena architecture
- Spiking Neural Networks (SNNs) and Spiking Transformers
- Weightless Neural Networks (WNNs)
- Liquid Neural Networks (LNNs)
- Hybrid Transformer-RNN architectures
- Hybrid Transformer-CNN architectures
Research papers on next-gen architectures:
- Rob Toews, Sep 3, 2023, Transformers Revolutionized AI. What Will Replace Them? Forbes, https://www.forbes.com/sites/robtoews/2023/09/03/transformers-revolutionized-ai-what-will-replace-them/
- Badri Narayana Patro, Vijay Srinivas Agneeswaran, 24 Apr 2024, Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges, https://arxiv.org/abs/2404.16112
- Sathya Krishnan Suresh, Shunmugapriya P, 24 Apr 2024 (v2), Towards smaller, faster decoder-only transformers: Architectural variants and their implications, https://arxiv.org/abs/2404.14462 Code: https://github.com/SkAndMl/gpt-variations (Focuses on three new variants of decoder-only Transformer architectures: ParallelGPT (p-gpt), LinearlyCompressedGPT (lc-gpt), and ConvCompressedGPT (cc-gpt).)
- Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
- David Spuler, March 2024, Chapter 43. Overview of AI Research, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- Johannes Schneider, 1 Aug 2024, What comes after transformers? -- A selective survey connecting ideas in deep learning, https://arxiv.org/abs/2408.00386
- Rohan Baskar Prabhakar, Hengrui Zhang, David Wentlzaff, 14 Aug 2024, Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference, https://arxiv.org/abs/2408.07802 (Modified Transformer architecture with parallelized sub-layers of attention and FFN.)
- Cem Dilmegani, Jan 10, 2024, The Future of Large Language Models in 2024, https://research.aimultiple.com/future-of-large-language-models/
- Bobby He, Thomas Hofmann, 31 May 2024 (v2), Simplifying Transformer Blocks, https://arxiv.org/abs/2311.01906 (Examines the removal of various Transformer sublayer components including skip connections, projection/value parameters, and normalization.)
- Roy Lo, June 13, 2024, Defining AI 2.0: Beyond Generative AI, https://www.linkedin.com/pulse/defining-ai-20-beyond-generative-roy-lo-tbvie/
- Ryan McNeal, Aug 27, 2024, ChatGPT and GPT-4 could get a sweet upgrade this fall with 'strawberry', https://www.androidauthority.com/openai-strawberry-ai-3475682/
- Jiuxiang Gu, Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou, 26 May 2024, Tensor Attention Training: Provably Efficient Learning of Higher-order Transformers, https://arxiv.org/abs/2405.16411 (Higher-order attention using tensors to generalize QKV matrices.)
- Joanne Chen, July 23, 2024, What’s Next After Transformers, https://foundationcapital.com/whats-next-after-transformers/
- Martin_Casado, Aug 31, 2024, Tweet (State of LLMs) https://threadreaderapp.com/thread/1829905130512400775.html
- Anil Ananthaswamy, August 30, 2024, A new way to build neural networks could make AI more understandable, https://www.technologyreview.com/2024/08/30/1103385/a-new-way-to-build-neural-networks-could-make-ai-more-understandable/?tpcc=NL_Marketing (About Kolmogorov-Arnold Networks or KANs.)
- Niklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, Douwe Kiela, 17 Apr 2024 (v2), Generative Representational Instruction Tuning, https://arxiv.org/abs/2402.09906
- Anson Ho, Tamay Besiroglu, Ege Erdil, David Owen, Robi Rahman, Zifan Carl Guo, David Atkinson, Neil Thompson, Jaime Sevilla, 9 Mar 2024, Algorithmic progress in language models, https://arxiv.org/abs/2403.05812
- Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, Omer Levy, 20 Aug 2024, Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model, https://www.arxiv.org/abs/2408.11039 (Merging Transformer architectures with diffusion in training multimodal models.)
- Cobus Greyling, Sep 2024, An AI Agent Architecture & Framework Is Emerging, https://cobusgreyling.medium.com/an-ai-agent-architecture-framework-is-emerging-addae3804f23
- Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
- Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuoling Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping, 17 Sep 2024, NVLM: Open Frontier-Class Multimodal LLMs, NVIDIA, https://arxiv.org/abs/2409.11402 https://huggingface.co/nvidia/NVLM-D-72B https://nvlm-project.github.io/
- Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, Ping Luo, 17 Oct 2024, Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation, https://arxiv.org/abs/2410.13848 https://github.com/deepseek-ai/Janus?tab=readme-ov-file
- Carl Franzen, October 23, 2024, OpenAI researchers develop new model that speeds up media generation by 50X, https://venturebeat.com/ai/openai-researchers-develop-new-model-that-speeds-up-media-generation-by-50x/
- Dr. Ashish Bamania, Nov 2024, XNets Are Here To Outcompete MLPs & KANs A deep dive into XNets, a new neural network architecture that outperforms MLPs, KANs, and PINNs across various benchmarks, along with a guide to building one from scratch. https://levelup.gitconnected.com/xnets-are-here-to-outcompete-mlps-kans-3ff569819165
- Xin Li, Zhihong Xia, Hongkun Zhang, 28 Sep 2024, Cauchy activation function and XNet, https://arxiv.org/abs/2409.19221
- Felix Petersen, Hilde Kuehne, Christian Borgelt, Julian Welzel, Stefano Ermon, 7 Nov 2024, Convolutional Differentiable Logic Gate Networks, 38th Conference on Neural Information Processing Systems (NeurIPS 2024), https://arxiv.org/abs/2411.04732
- From Transformers to the Future: An In-Depth Exploration of Modern Language Model Architectures H Xu, Z Bi, H Tseng, X Song, P Feng, https://osf.io/n8r5j/download
- Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov, 20 Nov 2024, Hymba: A Hybrid-head Architecture for Small Language Models, https://arxiv.org/abs/2411.13676
- Gil Dibner, Sep 25, 2024, Am I thinking about AI the right way? Angular Ventures, https://medium.com/angularventures/am-i-thinking-about-ai-the-right-way-4513760cd83e
RWKV Architecture
The RWKV architecture is a hybrid Transformer-RNN architecture. Research papers on RWKV include:
- Jean Mercat, Igor Vasiljevic, Sedrick Keh, Kushal Arora, Achal Dave, Adrien Gaidon, Thomas Kollar, 10 May 2024, Linearizing Large Language Models, https://arxiv.org/abs/2405.06640 Code: https://github.com/TRI-ML/linear_open_lm
- Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao, 5 Feb 2024. A Survey on Transformer Compression. https://arxiv.org/abs/2402.05964 (Model compression survey paper with focus on pruning, quantization, knowledge distillation, and efficient architecture design.)
- Tiancheng Gu, Kaicheng Yang, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai, Jiankang Deng, 11 Jun 2024. RWKV-CLIP: A Robust Vision-Language Representation Learner, https://arxiv.org/abs/2406.06973 Code: https://github.com/deepglint/RWKV-CLIP
- Xinji Mai, Zeng Tao, Junxiong Lin, Haoran Wang, Yang Chang, Yanlan Kang, Yan Wang, Wenqiang Zhang, 27 Jun 2024, From Efficient Multimodal Models to World Models: A Survey, https://arxiv.org/abs/2407.00118 (A survey of multimodal models with coverage of many optimization techniques.)
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- Joanne Chen, July 23, 2024, What’s Next After Transformers, https://foundationcapital.com/whats-next-after-transformers/
- Théodor Lemerle, Harrison Vanderbyl, Vaibhav Srivastav, Nicolas Obin, Axel Roebel, 30 Oct 2024, Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis, https://arxiv.org/abs/2410.23320 https://theodorblackbird.github.io/blog/demo_lina/
- Akul Datta, 5 Nov 2024, The Evolution of RWKV: Advancements in Efficient Language Modeling, https://arxiv.org/abs/2411.02795
- From Transformers to the Future: An In-Depth Exploration of Modern Language Model Architectures H Xu, Z Bi, H Tseng, X Song, P Feng, https://osf.io/n8r5j/download
State Space Models (SSMs)
- Badri Narayana Patro, Vijay Srinivas Agneeswaran, 24 Apr 2024, Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges, https://arxiv.org/abs/2404.16112
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Karan Goel, August 27, 2024, The On‑Device Intelligence Update https://cartesia.ai/blog/2024-08-27-on-device (On-device state space models.)
- Nicolas Stellwag, 2024, Structured State Space Models, https://nicolasstellwag.com/download/structured_SSMs.pdf
- Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
- Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
- Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov, 20 Nov 2024, Hymba: A Hybrid-head Architecture for Small Language Models, https://arxiv.org/abs/2411.13676
- Yash Akhauri, Safeen Huda, Mohamed S. Abdelfattah, 26 Nov 2024, Attamba: Attending To Multi-Token States, https://arxiv.org/abs/2411.17685
- Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Ravi Netravali, Yida Wang, 28 Nov 2024, Marconi: Prefix Caching for the Era of Hybrid LLMs, https://arxiv.org/abs/2411.19379 (Prefix caching applied to hybrid SSM-Transformer LLMs.)
Hyena Architecture
- Pierre-David Letourneau, Manish Kumar Singh, Hsin-Pai Cheng, Shizhong Han, Yunxiao Shi, Dalton Jones, Matthew Harper Langston, Hong Cai, Fatih Porikli, 16 Jul 2024, PADRe: A Unifying Polynomial Attention Drop-in Replacement for Efficient Vision Transformer, https://arxiv.org/abs/2407.11306
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- From Transformers to the Future: An In-Depth Exploration of Modern Language Model Architectures H Xu, Z Bi, H Tseng, X Song, P Feng, https://osf.io/n8r5j/download
Mamba Architecture
The Mamba architecture is an advanced AI architecture based on the State Space Model (SSM) architecture. Research papers on Mamba include:
- Jean Mercat, Igor Vasiljevic, Sedrick Keh, Kushal Arora, Achal Dave, Adrien Gaidon, Thomas Kollar, 10 May 2024, Linearizing Large Language Models, https://arxiv.org/abs/2405.06640 Code: https://github.com/TRI-ML/linear_open_lm
- Badri Narayana Patro, Vijay Srinivas Agneeswaran, 24 Apr 2024, Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges, https://arxiv.org/abs/2404.16112
- Zeyu Wang, Chen Li, Huiying Xu, Xinzhong Zhu, 9 Jun 2024, Mamba YOLO: SSMs-Based YOLO For Object Detection, https://arxiv.org/abs/2406.05835
- Mehmet Hamza Erol, Arda Senocak, Jiu Feng, Joon Son Chung, 5 Jun 2024, Audio Mamba: Bidirectional State Space Model for Audio Representation Learning, https://arxiv.org/abs/2406.03344
- Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Youqiang Zhang, Junshi Huang, 3 Jun 2024, Dimba: Transformer-Mamba Diffusion Models, https://arxiv.org/abs/2406.01159
- Radar AI, Mar 2024, An Introduction to the Mamba LLM Architecture: A New Paradigm in Machine Learning, https://www.datacamp.com/tutorial/introduction-to-the-mamba-llm-architecture
- Albert Gu, Tri Dao, 31 May 2024 (v2), Mamba: Linear-Time Sequence Modeling with Selective State Spaces, https://arxiv.org/abs/2312.00752
- Xiaogang Jia, Qian Wang, Atalay Donat, Bowen Xing, Ge Li, Hongyi Zhou, Onur Celik, Denis Blessing, Rudolf Lioutikov, Gerhard Neumann, 12 Jun 2024, MaIL: Improving Imitation Learning with Mamba, https://arxiv.org/abs/2406.08234
- Marko Vidrih, Jun 7, 2024, Mamba-2 is Out: Can it replace Transformers? https://vidrihmarko.medium.com/mamba-2-is-out-can-it-replace-transformers-6cfb3372ea39
- Albert Gu, Tri Dao, State Space Duality (Mamba-2) Part I - The Model, May 31, 2024, https://goombalab.github.io/blog/2024/mamba2-part1-model/
- azhar, Dec 29, 2023, Decoding Mamba: The Next Big Leap in AI Sequence Modeling, https://medium.com/ai-insights-cobet/decoding-mamba-the-next-big-leap-in-ai-sequence-modeling-ef3908060cb8
- Waleffe, Roger ; Byeon, Wonmin ; Riach, Duncan ; Norick, Brandon ; Korthikanti, Vijay ; Dao, Tri ; Gu, Albert ; Hatamizadeh, Ali ; Singh, Sudhakar ; Narayanan, Deepak ; Kulshreshtha, Garvit ; Singh, Vartika ; Casper, Jared ; Kautz, Jan ; Shoeybi, Mohammad ; Catanzaro, Bryan, June 2024, An Empirical Study of Mamba-based Language Models, https://arxiv.org/abs/2406.07887 https://ui.adsabs.harvard.edu/abs/2024arXiv240607887W/abstract
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Xinji Mai, Zeng Tao, Junxiong Lin, Haoran Wang, Yang Chang, Yanlan Kang, Yan Wang, Wenqiang Zhang, 27 Jun 2024, From Efficient Multimodal Models to World Models: A Survey, https://arxiv.org/abs/2407.00118 (A survey of multimodal models with coverage of many optimization techniques.)
- Pierre-David Letourneau, Manish Kumar Singh, Hsin-Pai Cheng, Shizhong Han, Yunxiao Shi, Dalton Jones, Matthew Harper Langston, Hong Cai, Fatih Porikli, 16 Jul 2024, PADRe: A Unifying Polynomial Attention Drop-in Replacement for Efficient Vision Transformer, https://arxiv.org/abs/2407.11306
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- Haohao Qu, Liangbo Ning, Rui An, Wenqi Fan, Tyler Derr, Xin Xu, Qing Li, 2 Aug 2024, A Survey of Mamba, https://arxiv.org/abs/2408.01129
- Jingwei Zuo, Maksim Velikanov, Dhiya Eddine, Ilyas Chahed, Younes Belkada, Guillaume Kunsch, August 12, 2024, Welcome FalconMamba: The first strong attention-free 7B model, https://huggingface.co/blog/falconmamba
- Jamba Team, 22 Aug 2024, Jamba-1.5: Hybrid Transformer-Mamba Models at Scale, https://arxiv.org/abs/2408.12570
- Danlong Yuan, Jiahao Liu, Bei Li, Huishuai Zhang, Jingang Wang, Xunliang Cai, Dongyan Zhao, 9 Aug 2024 (v2), ReMamba: Equip Mamba with Effective Long-Sequence Modeling, https://arxiv.org/abs/2408.15496
- From Transformers to the Future: An In-Depth Exploration of Modern Language Model Architectures H Xu, Z Bi, H Tseng, X Song, P Feng, https://osf.io/n8r5j/download
- Shengkun Tang, Liqun Ma, Haonan Li, Mingjie Sun, Zhiqiang Shen, 18 Nov 2024, Bi-Mamba: Towards Accurate 1-Bit State Space Models, https://arxiv.org/abs/2411.11843
Knowledge Graph AI Architectures
Knowledge graphs represent structured information as a graph, usually a Directed Acyclic Graph (DAG). This additional structural information can improve LLM results, but it is not easy to integrate graph-structured data into the sequential text sequences expected by an LLM. One particular usage of Knowledge Graphs is to extend RAG architectures, called a "RAG Graph" architecture.
Research papers on Knowledge Graphs in AI include:
- Shenzhe Zhu, 6 May 2024, Exploring knowledge graph-based neural-symbolic system from application perspective, https://arxiv.org/abs/2405.03524 (Integrate knowledge graph and symbolic reasoning into neural networks.)
- GG Klager, March 12, 2024, Is GPT fit for KGQA? Masters Thesis, Department of Information Systems & Operations Management, Vienna University of Economics and Business, https://aic.ai.wu.ac.at/~polleres/supervised_theses/Gerhard_Klager_MSc_2024.pdf
- Louis-François Bouchard, Aug 12, 2024, When to Use GraphRAG, https://louisbouchard.substack.com/p/when-to-use-graphrag
- Bhaskarjit Sarmah, Benika Hall, Rohan Rao, Sunil Patel, Stefano Pasquali, Dhagash Mehta, 9 Aug 2024, HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction, https://arxiv.org/abs/2408.04948
- Dr. Ashish Bamania, Aug 2024, ‘MedGraphRAG’ Is A Complete Game Changer For AI In Medicine A deep-dive into how RAG, GraphRAG, and MedGraphRAG work and how they significantly improve the performance of LLM responses in Medicine, https://levelup.gitconnected.com/medgraphrag-is-a-complete-game-changer-for-ai-in-medicine-c6b41b0effd6
- Junde Wu, Jiayuan Zhu, Yunli Qi, 8 Aug 2024, Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation, https://arxiv.org/abs/2408.04187 Code: https://github.com/MedicineToken/Medical-Graph-RAG/tree/main
- Yuntong Hu, Zhihan Lei, Zheng Zhang, Bo Pan, Chen Ling, Liang Zhao, 26 May 2024, GRAG: Graph Retrieval-Augmented Generation, https://arxiv.org/abs/2405.16506
- Philip Rathle, Jul 11, 2024, The GraphRAG Manifesto: Adding Knowledge to GenAI, https://neo4j.com/blog/graphrag-manifesto/
- Microsoft, Aug 2024 (accessed), GraphRAG: A modular graph-based Retrieval-Augmented Generation (RAG) system, https://github.com/microsoft/graphrag
- Harry Li, Gabriel Appleby, Ashley Suh, 7 Jun 2024, LinkQ: An LLM-Assisted Visual Interface for Knowledge Graph Question-Answering, https://arxiv.org/abs/2406.06621
- Xuan Chen, Tong Lu, Zhichun Wang, 6 Dec 2024, LLM-Align: Utilizing Large Language Models for Entity Alignment in Knowledge Graphs, https://arxiv.org/abs/2412.04690
Graph Neural Networks
Research papers on Graph Neural Networks (GNNs):
- Zhichun Guo, April 2024, Empowering Graph Neural Networks for Real-World Tasks, Ph.D. Thesis, Computer Science and Engineering, University of Notre Dame, Indiana, https://doi.org/10.7274/25608504.v1 https://curate.nd.edu/articles/dataset/Empowering_Graph_Neural_Networks_for_Real-World_Tasks/25608504/1 PDF: https://curate.nd.edu/ndownloader/files/46035312/1
- Lu Ma, Zeang Sheng, Xunkai Li, Xinyi Gao, Zhezheng Hao, Ling Yang, Wentao Zhang, Bin Cui, 7 May 2024, Acceleration Algorithms in GNNs: A Survey, https://arxiv.org/abs/2405.04114
- Yun Zhu, Yaoke Wang, Haizhou Shi, Siliang Tang, 28 Jan 2024, Efficient Tuning and Inference for Large Language Models on Textual Graphs, https://arxiv.org/abs/2401.15569 (Optimizing Graph Neural Networks on textual graphs using caching and early exit inference.)
- Sebastian Eliassen, Raghavendra Selvan, 16 Jan 2024 (v2), Activation Compression of Graph Neural Networks using Block-wise Quantization with Improved Variance Minimization, https://arxiv.org/abs/2309.11856
- Weishu Deng, Jia Rao, 2024, Mega: More Efficient Graph Attention for GNNs, 2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS), Year: 2024, Pages: 71-81, DOI Bookmark: 10.1109/ICDCS60910.2024.00016, https://www.computer.org/csdl/proceedings-article/icdcs/2024/860500a071/1ZCgMaVLfRm
- Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 9 Aug 2024 (v2), A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations, IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2024.3447085, https://arxiv.org/abs/2308.06767 https://ieeexplore.ieee.org/abstract/document/10643325
- Lena Sasal, Daniel Busby, Abdenour Hadid, 29 Aug 2024, TempoKGAT: A Novel Graph Attention Network Approach for Temporal Graph Analysis, https://arxiv.org/abs/2408.16391
- Xinke Jiang, Rihong Qiu, Yongxin Xu, Wentao Zhang, Yichen Zhu, Ruizhe Zhang, Yuchen Fang, Xu Chu, Junfeng Zhao, Yasha Wang, 31 Oct 2024, RAGraph: A General Retrieval-Augmented Graph Learning Framework, https://arxiv.org/abs/2410.23855
Compound AI Architectures
Compound AI architectures are a new category that generalizes both RAG and multi-AI ensemble architectures. The general idea is that various components can be placed around an LLM, or multiple LLM queries can be used, and this can be done in a variety of ways. RAG is a well-known subcategory in this vein, as are extensions using Knowledge Graphs.
Research on Compound AI architectures:
- Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, Azalia Mirhoseini, 31 Jul 2024, Large Language Monkeys: Scaling Inference Compute with Repeated Sampling, https://arxiv.org/abs/2407.21787 (Generating multiple answers by repeated inference queries, and then using a verifier to choose the best one, which is shown to greatly increase overall accuracy.)
- Xi Wang, Procheta Sen, Ruizhe Li, Emine Yilmaz, 31 Jul 2024, Adaptive Retrieval-Augmented Generation for Conversational Systems, https://arxiv.org/abs/2407.21712 (Deciding whether or not to include a RAG external data request in the inference of a chatbot in a multi-turn conversation.)
- Matei Zaharia, Omar Khattab, Lingjiao Chen, Jared Quincy Davis, Heather Miller, Chris Potts, James Zou, Michael Carbin, Jonathan Frankle, Naveen Rao, Ali Ghodsi, Feb 18, 2024, The Shift from Models to Compound AI Systems, https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/
- Jared Quincy Davis, Boris Hanin, Lingjiao Chen, Peter Bailis, Ion Stoica, Matei Zaharia, 23 Jul 2024, Networks of Networks: Complexity Class Principles Applied to Compound AI Systems Design, https://www.arxiv.org/abs/2407.16831
- Sherry Ruan, Tian Zhao, 28 May 2024, JungleGPT: Designing and Optimizing Compound AI Systems for E-Commerce, https://arxiv.org/abs/2407.00038
- Cognine, 2024, Why 2024 is the Year of AI Agents and Compound AI Systems? https://cognine.com/why-2024-is-the-year-of-ai-agents-and-compound-ai-systems/
- Sean Sheng and Sherlock Xu, August 15, 2024, A Guide to Compound AI Systems, https://www.bentoml.com/blog/a-guide-to-compound-ai-systems
- Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng, 6 Jun 2024 (v2), SGLang: Efficient Execution of Structured Language Model Programs, https://arxiv.org/abs/2312.07104 https://github.com/sgl-project/sglang
- An Efficient Network Orchestrator for Distributed Compound Language Model Systems Muhammad Shahir Abdurrahman, Stanford University, Stanford, California, USA, https://www.scs.stanford.edu/24sp-cs244b/projects/An_Efficient_Network_Orchestrator_for_Distributed_Compound_Language_Model_Systems.pdf
- Melissa Malec, June 5, 2024, AI Orchestration Explained: The What, Why & How for 2024, https://hatchworks.com/blog/gen-ai/ai-orchestration/
- Yanxi Chen, Yaliang Li, Bolin Ding, Jingren Zhou, 20 Jul 2024, On the Design and Analysis of LLM-Based Algorithms, https://arxiv.org/abs/2407.14788 https://github.com/modelscope/agentscope/tree/main/examples/paper_llm_based_algorithm
- Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei Zaharia, James Zou, 4 Jun 2024 (v2), Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems, https://arxiv.org/abs/2403.02419
- Latent Space, Nov 2024, Why Compound AI + Open Source will beat Closed AI, https://www.latent.space/p/fireworks
Research on Efficient Architectures
- Canwen Xu, 2024, Efficient Natural Language Processing for Language Models, Ph.D. thesis, Computer Science, UNIVERSITY OF CALIFORNIA SAN DIEGO, PDF: https://escholarship.org/uc/item/9dv1k5xv PDF: https://escholarship.org/content/qt9dv1k5xv/qt9dv1k5xv.pdf?t=sc34ay (Evaluates several acceleration methods including early-exit, PEFT, and distillation.)
- Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
- Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar, Dec 2023, LLM in a flash: Efficient Large Language Model Inference with Limited Memory Apple Research, https://arxiv.org/abs/2312.11514
- Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, and Yuxiong He. 2022b. Extreme compression for pre-trained transformers made simple and efficient. In Advances in Neural Information Processing Systems https://arxiv.org/abs/2206.01859
- Howard, A., Sandler, M., Chu, G., Chen, L., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q. V., and Adam, H. (2019). Searching for mobilenetv3. CoRR, abs/1905.02244. URL: http://arxiv.org/abs/1905.02244
- Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., and Le, Q. V. (2019). Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://openaccess.thecvf.com/content_CVPR_2019/html/Tan_MnasNet_Platform_Aware_Neural_Architecture_Search_for_Mobile_CVPR_2019_paper.html.
- Tan, M. and Le, Q. (2019). EfficientNet: Rethinking model scaling for convolutional neural networks. In Chaudhuri, K. and Salakhutdinov, R., editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6105–6114, Long Beach, California, USA. PMLR. URL: http://proceedings.mlr.press/v97/tan19a.html
- Tan, M. and Le, Q. (2021). Efficientnetv2: Smaller models and faster training. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10096–10106. PMLR. URL: https://proceedings.mlr.press/ v139/tan21a.html
- Iandola, F. N., Moskewicz, M. W., Ashraf, K., Han, S., Dally, W. J., and Keutzer, K. (2016). Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size. CoRR, abs/1602.07360. URL: http://arxiv.org/abs/1602.07360
- Gholami, A., Kwon, K., Wu, B., Tai, Z., Yue, X., Jin, P., Zhao, S., and Keutzer, K. (2018). Squeezenext: Hardware-aware neural network design. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. URL: https://openaccess.thecvf.com/content_cvpr_2018_workshops/w33/html/Gholami_SqueezeNext_Hardware_Aware_Neural_CVPR_2018_paper.html
- Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861. URL: http://arxiv.org/abs/1704.04861
- Vivienne Sze , Yu-Hsin Chen, et al., Jun 24, 2020, Efficient Processing of Deep Neural Networks (Synthesis Lectures on Computer Architecture) Part of: Synthesis Lectures on Computer Architecture (7 books), https://www.amazon.com/Efficient-Processing-Networks-Synthesis-Architecture/dp/1681738317/
- Samsul Ariffin Abdul Karim, Oct 12, 2022, Intelligent Systems Modeling and Simulation II: Machine Learning, Neural Networks, Efficient Numerical Algorithm and Statistical Methods (Studies in Systems, Decision and Control Book 444) https://www.amazon.com/Intelligent-Systems-Modeling-Simulation-Statistical-ebook/dp/B0BJ1P94WC/
- Manpreet Singh Ghotra and Rajdeep Dua, Nov 10, 2017, Neural Network Programming with TensorFlow: Unleash the power of TensorFlow to train efficient neural networks, https://www.amazon.com/Neural-Network-Programming-TensorFlow-efficient-ebook/dp/B077DFVV43/
- Lukas Arno Jakob Cavigelli, Qiuting Huang, et al., Jul 26, 2019, Towards Energy-Efficient Convolutional Neural Network Inference, https://www.amazon.com/Towards-Energy-Efficient-Convolutional-Network-Inference/dp/3866286511/
- Vgel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
- H Xu, Y Song, Q Liu, J van Genabith, D Xiong, 2024, Rewiring the Transformer with Depth-Wise LSTMs, LREC-COLING 2024, pages 14122–14133, 20-25 May, 2024, https://aclanthology.org/2024.lrec-main.1231.pdf
- Jean Mercat, Igor Vasiljevic, Sedrick Keh, Kushal Arora, Achal Dave, Adrien Gaidon, Thomas Kollar, 10 May 2024, Linearizing Large Language Models, https://arxiv.org/abs/2405.06640 Code: https://github.com/TRI-ML/linear_open_lm
- Lu Ma, Zeang Sheng, Xunkai Li, Xinyi Gao, Zhezheng Hao, Ling Yang, Wentao Zhang, Bin Cui, 7 May 2024, Acceleration Algorithms in GNNs: A Survey, https://arxiv.org/abs/2405.04114
- Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei, 9 May 2024 (v2), You Only Cache Once: Decoder-Decoder Architectures for Language Models, https://arxiv.org/abs/2405.05254 Code: https://aka.ms/YOCO (A novel decoder-decoder architecture with fast KV caching and cross-attention.)
- Badri Narayana Patro, Vijay Srinivas Agneeswaran, 24 Apr 2024, Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges, https://arxiv.org/abs/2404.16112
- Sathya Krishnan Suresh, Shunmugapriya P, 24 Apr 2024 (v2), Towards smaller, faster decoder-only transformers: Architectural variants and their implications, https://arxiv.org/abs/2404.14462 Code: https://github.com/SkAndMl/gpt-variations (Focuses on three new variants of decoder-only Transformer architectures: ParallelGPT (p-gpt), LinearlyCompressedGPT (lc-gpt), and ConvCompressedGPT (cc-gpt).)
- Jianhui Pang, Fanghua Ye, Longyue Wang, Dian Yu, Derek F. Wong, Shuming Shi, Zhaopeng Tu, 17 Jan 2024 (v2), Salute the Classic: Revisiting Challenges of Machine Translation in the Age of Large Language Models, https://arxiv.org/abs/2401.08350 Code: https://github.com/pangjh3/LLM4MT
- Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, 1 Dec 2023, The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, https://arxiv.org/abs/2312.00678 Project: https://github.com/tding1/Efficient-LLM-Survey
- Jesse Roberts, 2 Feb 2024 (v3), How Powerful are Decoder-Only Transformer Neural Models? https://arxiv.org/abs/2305.17026
- Mackenzie Morehead, Apr 16, 2024, Is Attention All You Need? https://www.mackenziemorehead.com/is-attention-all-you-need/
- Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot, Johan Ferret, Sertan Girgin, Olivier Bachem, Alek Andreev, Kathleen Kenealy, Thomas Mesnard, Cassidy Hardin, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Armand Joulin, Noah Fiedel, Evan Senter, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, David Budden, Arnaud Doucet, Sharad Vikram, Adam Paszke, Trevor Gale, Sebastian Borgeaud, Charlie Chen, Andy Brock, Antonia Paterson, Jenny Brennan, Meg Risdal, Raj Gundluru, Nesh Devanathan, Paul Mooney, Nilay Chauhan, Phil Culliton, Luiz GUStavo Martins, Elisa Bandy, David Huntsperger, Glenn Cameron, Arthur Zucker, Tris Warkentin, Ludovic Peran, Minh Giang, Zoubin Ghahramani, Clément Farabet, Koray Kavukcuoglu, Demis Hassabis, Raia Hadsell, Yee Whye Teh, Nando de Frietas, 11 Apr 2024, RecurrentGemma: Moving Past Transformers for Efficient Open Language Models, Google Research, https://arxiv.org/abs/2404.07839
- Georgy Tyukin, 2 Apr 2024, Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations, Masters Thesis, Data Science and Machine Learning, University College London., https://arxiv.org/abs/2404.05741 (Reviews various model compression and inference optimization techniques, and specifically analyzes layer skipping and sublayer skipping, such as attention head pruning and FFN/MLP pruning.)
- Stan Gibson, 03 Jun 2024, Getting infrastructure right for generative AI, CIO, https://www.cio.com/article/2128440/getting-infrastructure-right-for-generative-ai.html
- Staphord Bengesi, Hoda El-Sayed, Md Kamruzzaman Sarker, Yao Houkpati, John Irungu, Timothy Oladunni, 2023, Advancements in Generative AI: A Comprehensive Review of GANs, GPT, Autoencoders, Diffusion Model, and Transformers, 21 Nov 2023, https://arxiv.org/abs/2311.10242
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, Jan 2024, Understanding LLMs: A Comprehensive Overview from Training to Inference https://arxiv.org/abs/2401.02038
- Steve Yadlowsky, Lyric Doshi, Nilesh Tripuraneni, Nov 2023, Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models, https://arxiv.org/abs/2311.00871
- Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher Ré, Apr 2023, Hyena Hierarchy: Towards Larger Convolutional Language Models, https://arxiv.org/pdf/2302.10866.pdf
- Javier Ferrando, Gabriele Sarti, Arianna Bisazza, Marta R. Costa-jussà, 2 May 2024 (v2), A Primer on the Inner Workings of Transformer-based Language Models, https://arxiv.org/pdf/2405.00208 (Analyzes the theory of the Transformer architecture, including an interesting separation of the effects of attention versus FFNs on logits to give attributions.)
- Simeon Emanuilov, Apr 4, 2024 LLM agent operating system (AIOS) and the future of LLM-powered agents, https://medium.com/@simeon.emanuilov/llm-agent-operating-system-aios-and-the-future-of-llm-powered-agents-3d08b4e91c34 https://unfoldai.com/aios-llm-powered-agents/
- CAMERON R. WOLFE, PH.D. MAR 04, 2024, Decoder-Only Transformers: The Workhorse of Generative LLMs, https://cameronrwolfe.substack.com/p/decoder-only-transformers-the-workhorse
- Rachel Gordon, Publication Date:March 21, 2024, AI generates high-quality images 30 times faster in a single step, MIT News, https://news.mit.edu/2024/ai-generates-high-quality-images-30-times-faster-single-step-0321 (MIT's new image generation framework called "distribution matching distillation" is faster than diffusion models.)
- Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao, 5 Feb 2024. A Survey on Transformer Compression. https://arxiv.org/abs/2402.05964 (Model compression survey paper with focus on pruning, quantization, knowledge distillation, and efficient architecture design.)
- Han Zhao, Min Zhang, Wei Zhao, Pengxiang Ding, Siteng Huang, Donglin Wang, 22 Mar 2024, Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference, https://arxiv.org/abs/2403.14520 Code: https://sites.google.com/view/cobravlm (Multimodal version of the new Mamba architecture.)
- Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai, 25 Jan 2024, ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models, https://arxiv.org/abs/2401.14351 Code: https://github.com/ServerlessLLM/ServerlessLLM
- Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, Young Jin Kim, 18 Jan 2024, Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation, https://arxiv.org/abs/2401.08417
- Gavin Li, Nov 19, 2023, Unbelievable! Run 70B LLM Inference on a Single 4GB GPU with This NEW Technique, AI Advances https://ai.gopubby.com/unbelievable-run-70b-llm-inference-on-a-single-4gb-gpu-with-this-new-technique-93e2057c7eeb
- Yumo Bai, Feb 3, 2024 Why are most LLMs decoder-only? Dive into the rabbit hole of recent advancement in Large Language Models, https://medium.com/@yumo-bai/why-are-most-llms-decoder-only-590c903e4789
- Christopher Wolters, Xiaoxuan Yang, Ulf Schlichtmann, Toyotaro Suzumura, 12 Jun 2024, Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference, https://arxiv.org/abs/2406.08413
- David Spuler, March 2024, Chapter 2. Transformers & LLMs, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Shuning Chang, Pichao Wang, Hao Luo, Fan Wang, Mike Zheng Shou, 2023, Revisiting Vision Transformer from the View of Path Ensemble, https://arxiv.org/abs/2308.06548 PDF: https://arxiv.org/pdf/2308.06548.pdf
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, Attention is all you need, 2017, arXive preprint arXiv:1706.03762. https://arxiv.org/abs/1706.03762
- azhar, Dec 29, 2023, Decoding Mamba: The Next Big Leap in AI Sequence Modeling, https://medium.com/ai-insights-cobet/decoding-mamba-the-next-big-leap-in-ai-sequence-modeling-ef3908060cb8
- Chen, C, 2024, Hardware‑software co‑exploration and optimization for next‑generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 (Extensive coverage of hardware design with multiple contributions to accelerating various neural network types, ranging from acceleration of various single non-linear functions and end-to-end optimization algorithms. Specific topics include data compression, non-maximum suppression, MHA, and MatMul/GEMM optimizations.)
- Louis-François Bouchard, Louie Peters, May 2024, Chapter 2: Architectures, Building LLMs for Production: Enhancing LLM Abilities and Reliability with Prompting, Fine-Tuning, and RAG, https://www.amazon.com/Building-LLMs-Production-Reliability-Fine-Tuning/dp/B0D4FFPFW8/
- Matt Murphy, Tim Tully, Derek Xiao, January 18, 2024, The Modern AI Stack: Design Principles for the Future of Enterprise AI Architectures, Menlo Ventures, https://menlovc.com/perspective/the-modern-ai-stack-design-principles-for-the-future-of-enterprise-ai-architectures/ (Various details about the AI tech stack, organizational AI maturity levels, and several interesting facts: inference is 95% of AI cost now, 60% of organizations are using multi-model methods, RAG is the dominant architecture currently, and AI application development teams are primarily made up of non-ML software engineers leveraging on top of AI models.)
- MongoDB, Jun 20, 2024, Understanding the AI Stack In the Era of Generative AI: Exploring the Layers and Components of Today’s AI Applications https://medium.com/mongodb/understanding-the-ai-stack-in-the-era-of-generative-ai-f1fcd66e1393
- Shangyu Wu, Ying Xiong, Yufei Cui, Haolun Wu, Can Chen, Ye Yuan, Lianming Huang, Xue Liu, Tei-Wei Kuo, Nan Guan, Chun Jason Xue, 18 Jul 2024, Retrieval-Augmented Generation for Natural Language Processing: A Survey, https://arxiv.org/abs/2407.13193
- Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
- Yorick Sens, Henriette Knopp, Sven Peldszus, Thorsten Berger, 12 Aug 2024, A Large-Scale Study of Model Integration in ML-Enabled Software Systems, https://arxiv.org/abs/2408.06226
- Rohan Baskar Prabhakar, Hengrui Zhang, David Wentlzaff, 14 Aug 2024, Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference, https://arxiv.org/abs/2408.07802 (Modified Transformer architecture with parallelized sub-layers of attention and FFN.)
- Hugo Laurençon, Andrés Marafioti, Victor Sanh, Léo Tronchon, 22 Aug 2024, Building and better understanding vision-language models: insights and future directions, https://arxiv.org/abs/2408.12637
- Tymofii Reizin, 2024, Fast Algorithms for Attention Mechanism, Bachelor Thesis, Department of Applied Mathematics, Charles University, Prague, https://dspace.cuni.cz/bitstream/handle/20.500.11956/192084/130390128.pdf?sequence=1
- Minghao Shao, Abdul Basit, Ramesh Karri, Muhammad Shafique, Architectures: Trends, Benchmarks, and Challenges, https://www.researchgate.net/profile/Minghao_Shao2/publication/383976933Survey of different Large Language Model_Survey_of_different_Large_Language_Model_Architectures_Trends_Benchmarks_and_Challenges/links/66e2d320f84dd1716ce79f85/Survey-of-different-Large-Language-Model-Architectures-Trends-Benchmarks-and-Challenges.pdf
- Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
- Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuoling Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping, 17 Sep 2024, NVLM: Open Frontier-Class Multimodal LLMs, NVIDIA, https://arxiv.org/abs/2409.11402 https://huggingface.co/nvidia/NVLM-D-72B https://nvlm-project.github.io/
- Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, Ping Luo, 17 Oct 2024, Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation, https://arxiv.org/abs/2410.13848 https://github.com/deepseek-ai/Janus?tab=readme-ov-file
- Lak Lakshmanan, Oct 4, 2024, How to Choose the Architecture for Your GenAI Application. A framework to select the simplest, fastest, cheapest architecture that will balance LLMs’ creativity and risk, https://towardsdatascience.com/how-to-choose-the-architecture-for-your-genai-application-6053e862c457
- Dr. Ashish Bamania, Nov 2024, Vision Transformers Completely Redefine How AI Perceives The Real World: A deep dive into the Vision Transformer (ViT) architecture that transformed Computer Vision and learning to build one from scratch, https://levelup.gitconnected.com/vision-transformers-completely-redefine-how-ai-perceives-the-real-world-e3a06b826760
- From Transformers to the Future: An In-Depth Exploration of Modern Language Model Architectures H Xu, Z Bi, H Tseng, X Song, P Feng, https://osf.io/n8r5j/download
- Narcisa Guran, Florian Knauf, Man Ngo, Stefan Petrescu, Jan S. Rellermeyer, 21 Nov 2024, Towards a Middleware for Large Language Models, https://arxiv.org/abs/2411.14513
More AI Research
Read more about:
- List of AI Optimizations
- Transformer Optimizations
- Inference Optimizations
- Shallow decoder architecture
- Inference Cache
- Zero-Multiplication Models
- Attention head pruning
- Embeddings pruning
- FFN pruning
- Loop Optimizations
- Code Optimizations
- « Research Home