Aussie AI

Positional Encoding Optimization

  • Last Updated 22 October, 2024
  • by David Spuler, Ph.D.

Positional Encoding (PE) is the algorithm whereby relative positional information about the placements of words in relation to each other is encoded into "embeddings" that are input into the AI model. The term is often used synonymously with "positional embeddings", but technically, positional encoding is the algorithm (i.e. code) used to create a vector of positional embeddings (i.e. data).

The positional encoding algorithm was one of the important parts of the vanilla 2017 Transformer architecture, which used a sinusoidal positional encoding. Various attempts have been made to try other methods of positional encoding, and to optimize them both in terms of perplexity (prediction accuracy) and computation speed. Positional encoding is not usually a major CPU bottleneck, but it can nevertheless be optimized via improved algorithms, approximations (including integer-only versions), and surprisely, by removing PE entirely with a "NoPE" algorithm.

Research on Positional Encoding Optimizations

Research on faster position encoding algorithms includes:

  • Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia, Sep 2023, LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, https://arxiv.org/abs/2309.12307 (Includes a section discussing positional embedding improvements.)
  • Galina Alperovich, May 16, 2023, The Secret Sauce behind 100K context window in LLMs: all tricks in one place, https://blog.gopenai.com/how-to-speed-up-llms-and-use-100k-context-window-all-tricks-in-one-place-ffd40577b4c (Identifies three bottlenecks including positional embeddings.)
  • Meta, August 24, 2023, Introducing Code Llama, a state-of-the-art large language model for coding, Meta Blog, https://ai.meta.com/blog/code-llama-large-language-model-coding/ (100k context window size for Meta's Code Llama model.)
  • Mike Young, Sep 22, 2023, LongLoRA: A New, More Efficient Way to Fine-Tune LLMs, https://notes.aimodels.fyi/longlora-a-new-efficient-fine-tuning-of-long-context-llms/
  • Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, Sinong Wang, Sep 2023, LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models, https://arxiv.org/abs/2308.16137 (Has a useful review of the positional encoding issues.)
  • Guolin Ke, Di He, and Tie-Yan Liu. Rethinking positional encoding in language pre-training. In International Conference on Learning Representations, March 2021. https://arxiv.org/abs/2006.15595
  • Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu, Aug 2022, Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, https://arxiv.org/abs/2104.09864
  • Ta-Chung Chi, Ting-Han Fan, Alexander I. Rudnicky, Peter J. Ramadge, May 2023, Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis, https://arxiv.org/abs/2212.10356
  • Mingxu Tao, Yansong Feng, Dongyan Zhao, May 2023, A Frustratingly Easy Improvement for Position Embeddings via Random Padding, https://arxiv.org/abs/2305.04859
  • Tianyang Lin, Yuxin Wang, Xiangyang Liu, Xipeng Qiu, June 2021, A Survey of Transformers, AI Open, https://arxiv.org/abs/2106.04554 (Examines some Transformer models with different positional encoding methods.)
  • Benyou Wang, Donghao Zhao, Christina Lioma, Qiuchi Li, Peng Zhang, and Jakob Grue Simonsen. 2020. Encoding word order in complex embeddings. In Proceedings of ICLR. https://openreview.net/forum?id=Hke-WTVtwr, https://arxiv.org/abs/1912.12333 (Replaces positional encoding with continuous word functions.)
  • Georgi Gerganov, June 2023 Extending context size via RoPE scaling #1965, Llama.cpp project, https://github.com/ggerganov/llama.cpp/discussions/1965
  • H Jin, X Han, J Yang, Z Jiang, CY Chang, X Hu, Oct 2023, GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length, arXiv preprint arXiv:2310.00576, https://browse.arxiv.org/pdf/2310.00576.pdf (Uses RoPE for long contexts in training.)
  • Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. https://arxiv.org/abs/2204.02311 (Google Palm architecture used RoPE embeddings.)
  • Sean McLeish, Arpit Bansal, Alex Stein, Neel Jain, John Kirchenbauer, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Jonas Geiping, Avi Schwarzschild, Tom Goldstein, 27 May 2024, Transformers Can Do Arithmetic with the Right Embeddings, https://arxiv.org/abs/2405.17399 (Positional encoding of numeric digits improves math arithmetic accuracy.)
  • Yunpeng Huang, Jingwei Xu, Zixu Jiang, Junyu Lai, Zenan Li, Yuan Yao, Taolue Chen, Lijuan Yang, Zhou Xin, Xiaoxing Ma, Nov 2023, Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey, https://arxiv.org/abs/2311.12351 Project: https://github.com/Strivin0311/long-llms-learning
  • A Haviv, O Ram, O Press, P Izsak, O Levy, 2022, Transformer language models without positional encodings still learn positional information, https://arxiv.org/abs/2203.16634
  • Karim Lasri, Alessandro Lenci, Thierry Poibeau, Nov 2022, Word Order Matters when you Increase Masking, https://arxiv.org/abs/2211.04427
  • Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe, May 2023, Let's Verify Step by Step, https://arxiv.org/abs/2305.20050
  • Shun Kiyono, Sosuke Kobayashi, Jun Suzuki, Kentaro Inui, Nov 2021, SHAPE: Shifted Absolute Position Embedding for Transformers, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, https://aclanthology.org/2021.emnlp-main.266/ PDF: https://aclanthology.org/2021.emnlp-main.266.pdf
  • Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. https://openreview.net/forum?id=YicbFdNTTy
  • Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In International Conference on Learning Representations, September 2019. https://openreview.net/forum?id=H1eA7AEtvS
  • Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc Le. Transformer Quality in Linear Time. In Proceedings of the 39th International Conference on Machine Learning, pp. 9099–9117. PMLR, June 2022. https://proceedings.mlr.press/v162/hua22a.html
  • Hyung Won Chung, Thibault Fevry, Henry Tsai, Melvin Johnson, and Sebastian Ruder. Rethinking Embedding Coupling in Pre-trained Language Models. In International Conference on Learning Representations, September 2020. https://openreview.net/forum?id=xpFFI_NtgpW
  • Ruiqing Yan, Xingbo Du, Haoyu Deng, Linghan Zheng, Qiuzhuang Sun, Jifang Hu, Yuhang Shao, Penghao Jiang, Jinrong Jiang, Lian Zhao, 3 Jul 2024 (v2), Unveiling and Controlling Anomalous Attention Distribution in Transformers, https://arxiv.org/abs/2407.01601 (Examination of why the very first token in a sequence always gets more attention than others, including the effect of positional encoding, and its impact on KV cache compression.)
  • Aziz Belaweid, Mar 31, 2024, Complete Summary of Absolute, Relative And Rotary Position Embeddings! https://generativeai.pub/complete-summary-of-absolute-relative-and-rotary-position-embeddings-e2775f663088
  • 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
  • Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038

Pruning Positional Encoding ("NoPE")

Whereas positional encoding methods were important in the paper on the vanilla 2017 Transformer (Vaswani et al, 2017), some recent research suggests they could be removed entirely (Kazemnejad et al, 2023).

  • Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. May 2023. The impact of positional encoding on length generalization in transformers. arXiv preprint arXiv:2305.19466, https://arxiv.org/abs/2305.19466 (Evaluates various positional encoding algorithms in decoder-only Transformers, including none, which they styled "NoPE".)
  • Tianyang Lin, Yuxin Wang, Xiangyang Liu, Xipeng Qiu, June 2021, A Survey of Transformers, AI Open, https://arxiv.org/abs/2106.04554 (Examines some Transformer models with "implicit" positional encodings.)
  • Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. 2021. Conditional Positional Encodings for Vision Transformers. arXiv:2102.10882 [cs.CV] https://arxiv.org/abs/2102.10882
  • Zhiwei Wang, Yao Ma, Zitao Liu, and Jiliang Tang. 2019. R-Transformer: Recurrent Neural Network Enhanced Transformer. CoRR abs/1907.05572 (2019). arXiv:1907.05572 https://arxiv.org/abs/1907.05572, Code: https://github.com/DSE-MSU/R-transformer

RoPE (Rotary Positional Encoding)

Research papers on RoPE:

More AI Research

Read more about: