Aussie AI
Positional Encoding Optimization
-
Last Updated 22 October, 2024
-
by David Spuler, Ph.D.
Positional Encoding (PE) is the algorithm whereby relative positional information about the placements of words in relation to each other is encoded into "embeddings" that are input into the AI model. The term is often used synonymously with "positional embeddings", but technically, positional encoding is the algorithm (i.e. code) used to create a vector of positional embeddings (i.e. data).
The positional encoding algorithm was one of the important parts of the vanilla 2017 Transformer architecture, which used a sinusoidal positional encoding. Various attempts have been made to try other methods of positional encoding, and to optimize them both in terms of perplexity (prediction accuracy) and computation speed. Positional encoding is not usually a major CPU bottleneck, but it can nevertheless be optimized via improved algorithms, approximations (including integer-only versions), and surprisely, by removing PE entirely with a "NoPE" algorithm.
Research on Positional Encoding Optimizations
Research on faster position encoding algorithms includes:
- Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia, Sep 2023, LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, https://arxiv.org/abs/2309.12307 (Includes a section discussing positional embedding improvements.)
- Galina Alperovich, May 16, 2023, The Secret Sauce behind 100K context window in LLMs: all tricks in one place, https://blog.gopenai.com/how-to-speed-up-llms-and-use-100k-context-window-all-tricks-in-one-place-ffd40577b4c (Identifies three bottlenecks including positional embeddings.)
- Meta, August 24, 2023, Introducing Code Llama, a state-of-the-art large language model for coding, Meta Blog, https://ai.meta.com/blog/code-llama-large-language-model-coding/ (100k context window size for Meta's Code Llama model.)
- Mike Young, Sep 22, 2023, LongLoRA: A New, More Efficient Way to Fine-Tune LLMs, https://notes.aimodels.fyi/longlora-a-new-efficient-fine-tuning-of-long-context-llms/
- Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, Sinong Wang, Sep 2023, LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models, https://arxiv.org/abs/2308.16137 (Has a useful review of the positional encoding issues.)
- Guolin Ke, Di He, and Tie-Yan Liu. Rethinking positional encoding in language pre-training. In International Conference on Learning Representations, March 2021. https://arxiv.org/abs/2006.15595
- Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu, Aug 2022, Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, https://arxiv.org/abs/2104.09864
- Ta-Chung Chi, Ting-Han Fan, Alexander I. Rudnicky, Peter J. Ramadge, May 2023, Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis, https://arxiv.org/abs/2212.10356
- Mingxu Tao, Yansong Feng, Dongyan Zhao, May 2023, A Frustratingly Easy Improvement for Position Embeddings via Random Padding, https://arxiv.org/abs/2305.04859
- Tianyang Lin, Yuxin Wang, Xiangyang Liu, Xipeng Qiu, June 2021, A Survey of Transformers, AI Open, https://arxiv.org/abs/2106.04554 (Examines some Transformer models with different positional encoding methods.)
- Benyou Wang, Donghao Zhao, Christina Lioma, Qiuchi Li, Peng Zhang, and Jakob Grue Simonsen. 2020. Encoding word order in complex embeddings. In Proceedings of ICLR. https://openreview.net/forum?id=Hke-WTVtwr, https://arxiv.org/abs/1912.12333 (Replaces positional encoding with continuous word functions.)
- Georgi Gerganov, June 2023 Extending context size via RoPE scaling #1965, Llama.cpp project, https://github.com/ggerganov/llama.cpp/discussions/1965
- H Jin, X Han, J Yang, Z Jiang, CY Chang, X Hu, Oct 2023, GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length, arXiv preprint arXiv:2310.00576, https://browse.arxiv.org/pdf/2310.00576.pdf (Uses RoPE for long contexts in training.)
- Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. https://arxiv.org/abs/2204.02311 (Google Palm architecture used RoPE embeddings.)
- Sean McLeish, Arpit Bansal, Alex Stein, Neel Jain, John Kirchenbauer, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Jonas Geiping, Avi Schwarzschild, Tom Goldstein, 27 May 2024, Transformers Can Do Arithmetic with the Right Embeddings, https://arxiv.org/abs/2405.17399 (Positional encoding of numeric digits improves math arithmetic accuracy.)
- Yunpeng Huang, Jingwei Xu, Zixu Jiang, Junyu Lai, Zenan Li, Yuan Yao, Taolue Chen, Lijuan Yang, Zhou Xin, Xiaoxing Ma, Nov 2023, Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey, https://arxiv.org/abs/2311.12351 Project: https://github.com/Strivin0311/long-llms-learning
- A Haviv, O Ram, O Press, P Izsak, O Levy, 2022, Transformer language models without positional encodings still learn positional information, https://arxiv.org/abs/2203.16634
- Karim Lasri, Alessandro Lenci, Thierry Poibeau, Nov 2022, Word Order Matters when you Increase Masking, https://arxiv.org/abs/2211.04427
- Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe, May 2023, Let's Verify Step by Step, https://arxiv.org/abs/2305.20050
- Shun Kiyono, Sosuke Kobayashi, Jun Suzuki, Kentaro Inui, Nov 2021, SHAPE: Shifted Absolute Position Embedding for Transformers, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, https://aclanthology.org/2021.emnlp-main.266/ PDF: https://aclanthology.org/2021.emnlp-main.266.pdf
- Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. https://openreview.net/forum?id=YicbFdNTTy
- Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In International Conference on Learning Representations, September 2019. https://openreview.net/forum?id=H1eA7AEtvS
- Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc Le. Transformer Quality in Linear Time. In Proceedings of the 39th International Conference on Machine Learning, pp. 9099–9117. PMLR, June 2022. https://proceedings.mlr.press/v162/hua22a.html
- Hyung Won Chung, Thibault Fevry, Henry Tsai, Melvin Johnson, and Sebastian Ruder. Rethinking Embedding Coupling in Pre-trained Language Models. In International Conference on Learning Representations, September 2020. https://openreview.net/forum?id=xpFFI_NtgpW
- Ruiqing Yan, Xingbo Du, Haoyu Deng, Linghan Zheng, Qiuzhuang Sun, Jifang Hu, Yuhang Shao, Penghao Jiang, Jinrong Jiang, Lian Zhao, 3 Jul 2024 (v2), Unveiling and Controlling Anomalous Attention Distribution in Transformers, https://arxiv.org/abs/2407.01601 (Examination of why the very first token in a sequence always gets more attention than others, including the effect of positional encoding, and its impact on KV cache compression.)
- Aziz Belaweid, Mar 31, 2024, Complete Summary of Absolute, Relative And Rotary Position Embeddings! https://generativeai.pub/complete-summary-of-absolute-relative-and-rotary-position-embeddings-e2775f663088
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
Pruning Positional Encoding ("NoPE")
Whereas positional encoding methods were important in the paper on the vanilla 2017 Transformer (Vaswani et al, 2017), some recent research suggests they could be removed entirely (Kazemnejad et al, 2023).
- Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. May 2023. The impact of positional encoding on length generalization in transformers. arXiv preprint arXiv:2305.19466, https://arxiv.org/abs/2305.19466 (Evaluates various positional encoding algorithms in decoder-only Transformers, including none, which they styled "NoPE".)
- Tianyang Lin, Yuxin Wang, Xiangyang Liu, Xipeng Qiu, June 2021, A Survey of Transformers, AI Open, https://arxiv.org/abs/2106.04554 (Examines some Transformer models with "implicit" positional encodings.)
- Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. 2021. Conditional Positional Encodings for Vision Transformers. arXiv:2102.10882 [cs.CV] https://arxiv.org/abs/2102.10882
- Zhiwei Wang, Yao Ma, Zitao Liu, and Jiliang Tang. 2019. R-Transformer: Recurrent Neural Network Enhanced Transformer. CoRR abs/1907.05572 (2019). arXiv:1907.05572 https://arxiv.org/abs/1907.05572, Code: https://github.com/DSE-MSU/R-transformer
RoPE (Rotary Positional Encoding)
Research papers on RoPE:
- Jesus Rodriguez, Apr 22, 2024, Some Technical Notes About Llama 3: New tokenizer, optimized pretraining and some other details about Meta AI’s new model, Towards AI, https://pub.towardsai.net/some-technical-notes-about-llama-3-042c0b19db14
- Byeongho Heo, Song Park, Dongyoon Han, Sangdoo Yun, 20 Mar 2024, Rotary Position Embedding for Vision Transformer, https://arxiv.org/abs/2403.13298 Code: https://github.com/naver-ai/rope-vit
- Yunpeng Huang, Jingwei Xu, Zixu Jiang, Junyu Lai, Zenan Li, Yuan Yao, Taolue Chen, Lijuan Yang, Zhou Xin, Xiaoxing Ma, Nov 2023, Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey, https://arxiv.org/abs/2311.12351 Project: https://github.com/Strivin0311/long-llms-learning
- Jie Wu, Yufeng Zhu, Lei Shen, Xuqing Lu, 14 Jun 2024, GEB-1.3B: Open Lightweight Large Language Model, https://arxiv.org/abs/2406.09900 Code: https://huggingface.co/GEB-AGI/geb-1.3b
- Ankit Patel, June 14, 2024, NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models, NVIDIA Blog, https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/ (NVIDIA releases Nemotron-4 340B model, under an open source license, for the creation of synthetic data, with a decoder-only architecture using grouped-query attention and RoPE.)
- NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
- DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024. https://arxiv.org/abs/2405.04434 Code: https://github.com/deepseek-ai/DeepSeek-V2 (Introduces various architectural optimizations, notably RoPE handling and KV cache compression via low-rank matrices.)
- kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
- Aziz Belaweid, Mar 31, 2024, Complete Summary of Absolute, Relative And Rotary Position Embeddings! https://generativeai.pub/complete-summary-of-absolute-relative-and-rotary-position-embeddings-e2775f663088
- An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Zeyu Cui, Zhenru Zhang, Zhihao Fan, 15 Jul 2024, Qwen2 Technical Report, https://arxiv.org/abs/2407.10671
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- Seungrok Jung., 15, Mar 2024, Large language model inference optimizations on AMD GPUs, ROCm blogs, https://rocm.blogs.amd.com/artificial-intelligence/llm-inference-optimize/README.html
- Papers With Code, 2024, Relative Position Encodings, https://paperswithcode.com/method/relative-position-encodings
- Hugging Face, 2024, Optimizing LLMs for Speed and Memory, https://huggingface.co/docs/transformers/main/en/llm_tutorial_optimization
- Nils Graef, Matthew Clapp, Andrew Wasielewski, 12 Jul 2024, Flash normalization: fast RMSNorm for LLMs, https://arxiv.org/abs/2407.09577 Code: https://huggingface.co/open-machine/FlashNorm
- Nils Graef, Aarush Gupta, 2024 (accessed), Approximate attention: infinite context length with constant complexity per token [work in progress], https://github.com/OpenMachine-ai/transformer-tricks/blob/main/approximate.pdf
- Zihao Ye,, Lequn Chen, Ruihang Lai, Yilong Zhao, Size Zheng, Junru Shao, Bohan Hou, Hongyi Jin, Yifei Zuo, Liangsheng Yin, Tianqi Chen, Luis Ceze, Feb 2, 2024, Accelerating Self-Attentions for LLM Serving with FlashInfer, https://flashinfer.ai/2024/02/02/introduce-flashinfer.html
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
- Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo, 11 Sep 2024, Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU, https://arxiv.org/abs/2409.09086
- Byron (Pin-Lun)Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, Yanning Chen, 14 Oct 2024, Liger Kernel: Efficient Triton Kernels for LLM Training, https://arxiv.org/abs/2410.10989 http://github.com/linkedin/Liger-Kernel
More AI Research
Read more about: