Aussie AI

Positional Encoding Optimization

Last Updated 17 March, 2025

by David Spuler, Ph.D.

Positional Encoding (PE) is the algorithm whereby relative positional information about the placements of words in relation to each other is encoded into "embeddings" that are input into the AI model. The term is often used synonymously with "positional embeddings", but technically, positional encoding is the algorithm (i.e. code) used to create a vector of positional embeddings (i.e. data).

The positional encoding algorithm was one of the important parts of the vanilla 2017 Transformer architecture, which used a sinusoidal positional encoding. Various attempts have been made to try other methods of positional encoding, and to optimize them both in terms of perplexity (prediction accuracy) and computation speed. Positional encoding is not usually a major CPU bottleneck, but it can nevertheless be optimized via improved algorithms, approximations (including integer-only versions), and surprisely, by removing PE entirely with a "NoPE" algorithm.

Research on Positional Encoding Optimizations

Research on faster position encoding algorithms includes:

Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia, Sep 2023, LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, https://arxiv.org/abs/2309.12307 (Includes a section discussing positional embedding improvements.)
Galina Alperovich, May 16, 2023, The Secret Sauce behind 100K context window in LLMs: all tricks in one place, https://blog.gopenai.com/how-to-speed-up-llms-and-use-100k-context-window-all-tricks-in-one-place-ffd40577b4c (Identifies three bottlenecks including positional embeddings.)
Meta, August 24, 2023, Introducing Code Llama, a state-of-the-art large language model for coding, Meta Blog, https://ai.meta.com/blog/code-llama-large-language-model-coding/ (100k context window size for Meta's Code Llama model.)
Mike Young, Sep 22, 2023, LongLoRA: A New, More Efficient Way to Fine-Tune LLMs, https://notes.aimodels.fyi/longlora-a-new-efficient-fine-tuning-of-long-context-llms/
Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, Sinong Wang, Sep 2023, LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models, https://arxiv.org/abs/2308.16137 (Has a useful review of the positional encoding issues.)
Guolin Ke, Di He, and Tie-Yan Liu. Rethinking positional encoding in language pre-training. In International Conference on Learning Representations, March 2021. https://arxiv.org/abs/2006.15595
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu, Aug 2022, Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, https://arxiv.org/abs/2104.09864
Ta-Chung Chi, Ting-Han Fan, Alexander I. Rudnicky, Peter J. Ramadge, May 2023, Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis, https://arxiv.org/abs/2212.10356
Mingxu Tao, Yansong Feng, Dongyan Zhao, May 2023, A Frustratingly Easy Improvement for Position Embeddings via Random Padding, https://arxiv.org/abs/2305.04859
Tianyang Lin, Yuxin Wang, Xiangyang Liu, Xipeng Qiu, June 2021, A Survey of Transformers, AI Open, https://arxiv.org/abs/2106.04554 (Examines some Transformer models with different positional encoding methods.)
Benyou Wang, Donghao Zhao, Christina Lioma, Qiuchi Li, Peng Zhang, and Jakob Grue Simonsen. 2020. Encoding word order in complex embeddings. In Proceedings of ICLR. https://openreview.net/forum?id=Hke-WTVtwr, https://arxiv.org/abs/1912.12333 (Replaces positional encoding with continuous word functions.)
Georgi Gerganov, June 2023 Extending context size via RoPE scaling #1965, Llama.cpp project, https://github.com/ggerganov/llama.cpp/discussions/1965
H Jin, X Han, J Yang, Z Jiang, CY Chang, X Hu, Oct 2023, GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length, arXiv preprint arXiv:2310.00576, https://browse.arxiv.org/pdf/2310.00576.pdf (Uses RoPE for long contexts in training.)
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. https://arxiv.org/abs/2204.02311 (Google Palm architecture used RoPE embeddings.)
Sean McLeish, Arpit Bansal, Alex Stein, Neel Jain, John Kirchenbauer, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Jonas Geiping, Avi Schwarzschild, Tom Goldstein, 27 May 2024, Transformers Can Do Arithmetic with the Right Embeddings, https://arxiv.org/abs/2405.17399 (Positional encoding of numeric digits improves math arithmetic accuracy.)
Yunpeng Huang, Jingwei Xu, Zixu Jiang, Junyu Lai, Zenan Li, Yuan Yao, Taolue Chen, Lijuan Yang, Zhou Xin, Xiaoxing Ma, Nov 2023, Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey, https://arxiv.org/abs/2311.12351 Project: https://github.com/Strivin0311/long-llms-learning
A Haviv, O Ram, O Press, P Izsak, O Levy, 2022, Transformer language models without positional encodings still learn positional information, https://arxiv.org/abs/2203.16634
Karim Lasri, Alessandro Lenci, Thierry Poibeau, Nov 2022, Word Order Matters when you Increase Masking, https://arxiv.org/abs/2211.04427
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe, May 2023, Let's Verify Step by Step, https://arxiv.org/abs/2305.20050
Shun Kiyono, Sosuke Kobayashi, Jun Suzuki, Kentaro Inui, Nov 2021, SHAPE: Shifted Absolute Position Embedding for Transformers, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, https://aclanthology.org/2021.emnlp-main.266/ PDF: https://aclanthology.org/2021.emnlp-main.266.pdf
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. https://openreview.net/forum?id=YicbFdNTTy
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In International Conference on Learning Representations, September 2019. https://openreview.net/forum?id=H1eA7AEtvS
Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc Le. Transformer Quality in Linear Time. In Proceedings of the 39th International Conference on Machine Learning, pp. 9099–9117. PMLR, June 2022. https://proceedings.mlr.press/v162/hua22a.html
Hyung Won Chung, Thibault Fevry, Henry Tsai, Melvin Johnson, and Sebastian Ruder. Rethinking Embedding Coupling in Pre-trained Language Models. In International Conference on Learning Representations, September 2020. https://openreview.net/forum?id=xpFFI_NtgpW
Ruiqing Yan, Xingbo Du, Haoyu Deng, Linghan Zheng, Qiuzhuang Sun, Jifang Hu, Yuhang Shao, Penghao Jiang, Jinrong Jiang, Lian Zhao, 3 Jul 2024 (v2), Unveiling and Controlling Anomalous Attention Distribution in Transformers, https://arxiv.org/abs/2407.01601 (Examination of why the very first token in a sequence always gets more attention than others, including the effect of positional encoding, and its impact on KV cache compression.)
Aziz Belaweid, Mar 31, 2024, Complete Summary of Absolute, Relative And Rotary Position Embeddings! https://generativeai.pub/complete-summary-of-absolute-relative-and-rotary-position-embeddings-e2775f663088
18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)

Pruning Positional Encoding ("NoPE")

Whereas positional encoding methods were important in the paper on the vanilla 2017 Transformer (Vaswani et al, 2017), some recent research suggests they could be removed entirely (Kazemnejad et al, 2023).

Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. May 2023. The impact of positional encoding on length generalization in transformers. arXiv preprint arXiv:2305.19466, https://arxiv.org/abs/2305.19466 (Evaluates various positional encoding algorithms in decoder-only Transformers, including none, which they styled "NoPE".)
Tianyang Lin, Yuxin Wang, Xiangyang Liu, Xipeng Qiu, June 2021, A Survey of Transformers, AI Open, https://arxiv.org/abs/2106.04554 (Examines some Transformer models with "implicit" positional encodings.)
Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. 2021. Conditional Positional Encodings for Vision Transformers. arXiv:2102.10882 [cs.CV] https://arxiv.org/abs/2102.10882
Zhiwei Wang, Yao Ma, Zitao Liu, and Jiliang Tang. 2019. R-Transformer: Recurrent Neural Network Enhanced Transformer. CoRR abs/1907.05572 (2019). arXiv:1907.05572 https://arxiv.org/abs/1907.05572, Code: https://github.com/DSE-MSU/R-transformer

RoPE (Rotary Positional Encoding)

Research papers on RoPE:

Jesus Rodriguez, Apr 22, 2024, Some Technical Notes About Llama 3: New tokenizer, optimized pretraining and some other details about Meta AI’s new model, Towards AI, https://pub.towardsai.net/some-technical-notes-about-llama-3-042c0b19db14
Byeongho Heo, Song Park, Dongyoon Han, Sangdoo Yun, 20 Mar 2024, Rotary Position Embedding for Vision Transformer, https://arxiv.org/abs/2403.13298 Code: https://github.com/naver-ai/rope-vit
Yunpeng Huang, Jingwei Xu, Zixu Jiang, Junyu Lai, Zenan Li, Yuan Yao, Taolue Chen, Lijuan Yang, Zhou Xin, Xiaoxing Ma, Nov 2023, Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey, https://arxiv.org/abs/2311.12351 Project: https://github.com/Strivin0311/long-llms-learning
Jie Wu, Yufeng Zhu, Lei Shen, Xuqing Lu, 14 Jun 2024, GEB-1.3B: Open Lightweight Large Language Model, https://arxiv.org/abs/2406.09900 Code: https://huggingface.co/GEB-AGI/geb-1.3b
Ankit Patel, June 14, 2024, NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models, NVIDIA Blog, https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/ (NVIDIA releases Nemotron-4 340B model, under an open source license, for the creation of synthetic data, with a decoder-only architecture using grouped-query attention and RoPE.)
NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024. https://arxiv.org/abs/2405.04434 Code: https://github.com/deepseek-ai/DeepSeek-V2 (Introduces various architectural optimizations, notably RoPE handling and KV cache compression via low-rank matrices.)
kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
Aziz Belaweid, Mar 31, 2024, Complete Summary of Absolute, Relative And Rotary Position Embeddings! https://generativeai.pub/complete-summary-of-absolute-relative-and-rotary-position-embeddings-e2775f663088
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Zeyu Cui, Zhenru Zhang, Zhihao Fan, 15 Jul 2024, Qwen2 Technical Report, https://arxiv.org/abs/2407.10671
18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
Seungrok Jung., 15, Mar 2024, Large language model inference optimizations on AMD GPUs, ROCm blogs, https://rocm.blogs.amd.com/artificial-intelligence/llm-inference-optimize/README.html
Papers With Code, 2024, Relative Position Encodings, https://paperswithcode.com/method/relative-position-encodings
Hugging Face, 2024, Optimizing LLMs for Speed and Memory, https://huggingface.co/docs/transformers/main/en/llm_tutorial_optimization
Nils Graef, Matthew Clapp, Andrew Wasielewski, 12 Jul 2024, Flash normalization: fast RMSNorm for LLMs, https://arxiv.org/abs/2407.09577 Code: https://huggingface.co/open-machine/FlashNorm
Nils Graef, Aarush Gupta, 2024 (accessed), Approximate attention: infinite context length with constant complexity per token [work in progress], https://github.com/OpenMachine-ai/transformer-tricks/blob/main/approximate.pdf
Zihao Ye,, Lequn Chen, Ruihang Lai, Yilong Zhao, Size Zheng, Junru Shao, Bohan Hou, Hongyi Jin, Yifei Zuo, Liangsheng Yin, Tianqi Chen, Luis Ceze, Feb 2, 2024, Accelerating Self-Attentions for LLM Serving with FlashInfer, https://flashinfer.ai/2024/02/02/introduce-flashinfer.html
Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo, 11 Sep 2024, Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU, https://arxiv.org/abs/2409.09086
Byron (Pin-Lun)Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, Yanning Chen, 14 Oct 2024, Liger Kernel: Efficient Triton Kernels for LLM Training, https://arxiv.org/abs/2410.10989 http://github.com/linkedin/Liger-Kernel
NVIDIA, Dec 2024, Multi-Head, Multi-Query, and Group-Query Attention, https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html#kv-cache
Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, Iacopo Poli, 19 Dec 2024 (v2)], Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference, https://arxiv.org/abs/2412.13663 (Encoder-only BERT model updated with modern optimizations including Flash attention, bias removal, RoPE, pre-norm, and GeGLU, a GELU varaint, hybrid local-global attention, and zero padding removal.)
Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Malik, William Merrill, Lester James V. Miranda, Jacob Morrison, Tyler Murray, Crystal Nam, Valentina Pyatkin, Aman Rangapur, Michael Schmitz, Sam Skjonsberg, David Wadden, Christopher Wilhelm, Michael Wilson, Luke Zettlemoyer, Ali Farhadi, Noah A. Smith, Hannaneh Hajishirzi, 31 Dec 2024, 2 OLMo 2 Furious, https://arxiv.org/abs/2501.00656
Kazuki Irie, 31 Dec 2024, Why Are Positional Encodings Nonessential for Deep Autoregressive Transformers? Revisiting a Petroglyph, https://arxiv.org/abs/2501.00659
Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
Y. Jeon et al., "RoPIM: A Processing-in-Memory Architecture for Accelerating Rotary Positional Embedding in Transformer Models" in IEEE Computer Architecture Letters, vol. , no. 01, pp. 1-4, PrePrints 5555, doi: 10.1109/LCA.2025.3535470. https://www.computer.org/csdl/journal/ca/5555/01/10856355/23Sakp1WnDi
Zunhai Su, Zhe Chen, Wang Shen, Hanyu Wei, Linge Li, Huangqi Yu, Kehong Yuan, 25 Jan 2025, RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations, https://arxiv.org/abs/2501.16383 (INT2 KV caching with special handling of outliers, RoPE, and attention sinks, and the resulting architecture works in Chain-of-Thought.)
Mingyu Jin, Kai Mei, Wujiang Xu, Mingjie Sun, Ruixiang Tang, Mengnan Du, Zirui Liu, Yongfeng Zhang, 3 Feb 2025, Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding, https://arxiv.org/abs/2502.01563 https://github.com/MingyuJ666/Rope_with_LLM (Finds that outliers in attention are important, and arise by being generated by RoPE.)
Ning Shang, Li Lyna Zhang, Siyuan Wang, Gaokai Zhang, Gilsinia Lopez, Fan Yang, Weizhu Chen, Mao Yang, 27 Feb 2025, LongRoPE2: Near-Lossless LLM Context Window Scaling, https://arxiv.org/abs/2502.20082 https://github.com/microsoft/LongRoPE (Addresses RopE issues with long context optimization.)
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Asif Razzaq, March 5, 2025, Qwen Releases QwQ-32B: A 32B Reasoning Model that Achieves Significantly Enhanced Performance in Downstream Task, https://www.marktechpost.com/2025/03/05/qwen-releases-qwq-32b-a-32b-reasoning-model-that-achieves-significantly-enhanced-performance-in-downstream-task/ (Features 32B parameters, 32K context length, 64 layers, RoPE, SwiGLU, RMSNorm, and attention enhancements.)
Ali Veisi, Amir Mansourian, 11 Mar 2025, Context-aware Biases for Length Extrapolation, https://arxiv.org/abs/2503.08067 https://github.com/axiomlab/Cable

Aussie AI

Positional Encoding Optimization

Research on Positional Encoding Optimizations

Pruning Positional Encoding ("NoPE")

RoPE (Rotary Positional Encoding)

More AI Research

Quick Links

Product

New to Writing?

Writing Styles