Aussie AI

Long and Ultralong Context LLMs

Last Updated 22 May, 2025

by David Spuler, Ph.D.

What are Long Context LLMs?

Long context LLMs are model architectures that can accept input texts that are long, consisting of many tokens. Long context models are generally defined as those beyond 100k tokens, such as 128k, which has become common in major models. The latest SOTA models that can handle over a million tokens in their context window are called "ultralong models" in recent terminology.

Context window size is the number of input tokens that a model can process. Early models, even ChatGPT, had small context windows of about 2048. Each token is usually a part-word or a whole word, so this meant it could process inputs of about 1,000-2,000 words.

Why Long Context?

Why seek a longer input size? Or output size? Well, because a report is going to be 5,000 words, and a full-length novel is 100,000 words, or even 200,000 words if it's in the "epic sci-fi" genre. These are all tokens that must be output (for generation) or input (for editing).

Context is lots of other things, too. It's not just single documents in the "context" of an LLM, and the input and output of an LLM may include:

The conversational history for a chatbot
Chunks of multiple documents in a RAG system
Video streams of almost unfathomable sizes
All of the PDF receipt documents on your laptop (an AI tax return app)
An entire software repository of many code files (an AI coding assistant)
Combinations of text and images (multimodal LLMs)

There are many reasons we might want an LLM to be able to handle a lot of input, or to create a lot of output. But the main problem has been doing this many tokens efficiently.

Newer models have been increasing the context window size. For example, GPT-4 has a 32k window size, which is 32,000 tokens, which will handle a small novella or novellette of maybe 15,000-20,000 words. Anthropic reportedly has a Claude model with a 100k context window. MPT has an open-source model called "MPT-StoryWriter-65k+" with a 65,000 token window size.

Optimizing the Context Window Size

Why is there this context size limitation? One of the main bottlenecks is the "quadratic" cost of the self-attention mechanism. And there are various ways to optimize attention to overcome this limitation. However, it's not the only bottleneck. Alperovich (2023) offers the "secret sauce" for long contexts as fixing three main bottlenecks:

Quadratic attention cost (in the input token size)
Quadratic size of internal tensors (in the model dimension)
Positional embedding cost.

Some of the techniques with relevance to allowing the processing and creation of longer token sequences include:

Ultra-Long Context Models

Ultra-long context models are LLMs that accept a context width of more than 1M tokens. Although the largest models are not yet accepting such a long context (although 128k is common), there are a few commercial models that accept over 1M tokens in their context window.

Reserach papers on ultra-long context LLMs:

MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li, Junbin Xie, Junhao Xu, Junjie Yan, Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Zheng, Linbo Chai, Long Xing, Meizhi Ju, Mingyuan Chi, Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei Li, Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang, Qin Wang, Qiuhui Li, Ruitao Leng, Shengmin Shi, Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang, Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu Cheng, Wenkai Li, Xiangjun Song, Xiao Su, Xiaodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min, Xun Zou, Xuyang Shen, Yan Gong, Yingjie Zhu, Yipeng Zhou, Yiran Zhong, Yongyi Hu, Yuanxiang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan Huang, Yunji Li, Yunpeng Huang, Yunzhi Xu, Yuxin Mao, Zehan Li, Zekang Li, Zewei Tao, Zewen Ying, Zhaoyang Cong, Zhen Qin, Zhenhua Fan, Zhihang Yu, Zhuo Jiang, Zijia Wu, 14 Jan 2025, MiniMax-01: Scaling Foundation Models with Lightning Attention, https://arxiv.org/abs/2501.08313 https://github.com/MiniMax-AI (Content window over 1 million tokens.)
MiniMax, Jan 2025, MiniMax-01: Scaling Foundation Models with Lightning Attention, https://filecdn.minimax.chat/_Arxiv_MiniMax_01_Report.pdf
MiniMax, Jan 2025, MiniMax-01 is Now Open-Source: Scaling Lightning Attention for the AI Agent Era, https://www.minimaxi.com/en/news/minimax-01-series-2
Jonathan Roberts, Kai Han, Samuel Albanie, 7 Nov 2024, Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks? https://arxiv.org/abs/2411.05000
Lili Yu, Dániel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, Mike Lewis, 19 May 2023, MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers, https://arxiv.org/abs/2305.07185
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami, 4 Apr 2024 (v3), KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization, https://arxiv.org/abs/2401.18079
Amey Agrawal, Junda Chen, Íñigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, Esha Choukse, 25 Sep 2024, Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations, https://arxiv.org/abs/2409.17264
Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jongsoo Park, Jianyu Huang, 4 Nov 2024, Context Parallelism for Scalable Million-Token Inference, https://arxiv.org/abs/2411.01783
Qwen Team, November 15, 2024, Extending the Context Length to 1M Tokens! https://qwenlm.github.io/blog/qwen2.5-turbo/ (Qwen extended to long context via sparse attention.)
Demis Hassabis, Jan 2025, X post: Announcing Gemini 2.0 Flash https://x.com/demishassabis/status/1881844417746632910 (Gemini 2.0 Flash from Google is a Large Reasoning Model with a 1M ultra-long context.)
Manpreet Singh, Feb 2025, Goodbye RAG? Gemini 2.0 Flash Have Just Killed It! https://ai.gopubby.com/goodbye-rag-gemini-2-0-flash-have-just-killed-it-96301113c01f
Heejun Lee, Geon Park, Jaduk Suh, Sung Ju Hwang, 13 Feb 2025, InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU, https://arxiv.org/abs/2502.08910 (Using dynamic token pruning and KV cache data offloading to CPU memory.)
Tong Wu, Junzhe Shen, Zixia Jia, Yuxuan Wang, Zilong Zheng, 26 Feb 2025, From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens, https://arxiv.org/abs/2502.18890 (Extending speculative decoding to address three bottlenecks in ultra-long context: frequent model reloading, KV cache size, and repetitive output content generation. Uses techniques such as KV cache eviction and decoding penalties to avoid repetition.) https://github.com/bigai-nlco/TokenSwift
Hao Ge, Junda Feng, Qi Huang, Fangcheng Fu, Xiaonan Nie, Lei Zuo, Haibin Lin, Bin Cui, Xin Liu, 28 Feb 2025, ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs, https://arxiv.org/abs/2502.21231 (Addressing training inefficiencies when training data ranges from short to very long queries, including via hybrid data parallelism and communications optimizations.)

Length Generalization: Accuracy with Longer Contexts

Speed is not the only problem with long contexts. The vanilla Transformers are also not particularly good at generalizing their results with a long context window. This ability is known as "length generalization" (or "length extrapolation"), and improving the accuracy in long inputs and longer outputs is an area of active research.

One of the methods being analyzed to improve length generalization is called "scratchpad" or "chain-of-thought" algorithms. The idea is that the AI inference engine emits rough summaries to an internal scratchpad at regular intervals, which are merged into subsequent inference, thereby the AI helps itself keep track of its own thoughts over a longer output sequence.

Research papers on "length generalization" include:

Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, Behnam Neyshabur, Nov 2022, Exploring Length Generalization in Large Language Models, https://arxiv.org/abs/2207.04901
Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. May 2023. The impact of positional encoding on length generalization in transformers. arXiv preprint arXiv:2305.19466, https://arxiv.org/abs/2305.19466
H Jin, X Han, J Yang, Z Jiang, CY Chang, X Hu, Oct 2023, GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length, arXiv preprint arXiv:2310.00576, https://browse.arxiv.org/pdf/2310.00576.pdf
Shuaiqi Liu, 2024, Neural Abstractive Summarization for Long Documents, Ph.D. Thesis, The Hong Kong Polytechnic University, https://theses.lib.polyu.edu.hk/bitstream/200/12810/3/7260.pdf
Mackenzie Morehead, Apr 16, 2024, Is Attention All You Need? https://www.mackenziemorehead.com/is-attention-all-you-need/
Chonghua Wang, Haodong Duan, Songyang Zhang, Dahua Lin, Kai Chen, 10 Apr 2024 (v2), Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks, https://arxiv.org/abs/2404.06480 Code: https://github.com/open-compass/Ada-LEval
Jishnu Ray Chowdhury, Cornelia Caragea, May 2023, Monotonic Location Attention for Length Generalization, https://arxiv.org/abs/2305.20019
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe, May 2023, Let's Verify Step by Step, https://arxiv.org/abs/2305.20050
Nan Yang, Laicheng Zhong, Fan Huang, Dong Yuan, Wei Bao, Feb 2023, Random Padding Data Augmentation, https://arxiv.org/abs/2302.08682
Benjamin Newman, John Hewitt, Percy Liang, Christopher D. Manning, Oct 2020, The EOS Decision and Length Extrapolation, https://arxiv.org/abs/2010.07174
Anian Ruoss, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Róbert Csordás, Mehdi Bennani, Shane Legg, Joel Veness, May 2023, Randomized Positional Encodings Boost Length Generalization of Transformers, https://arxiv.org/abs/2305.16843
Benjamin Newman, John Hewitt, Percy Liang, Christopher D. Manning, Oct 2020, The EOS Decision and Length Extrapolation, https://arxiv.org/abs/2010.07174v1
Mirelle Bueno, Carlos Gemmell, Jeffrey Dalton, Roberto Lotufo, Rodrigo Nogueira, Nov 2022, Induced Natural Language Rationales and Interleaved Markup Tokens Enable Extrapolation in Large Language Models, https://arxiv.org/abs/2208.11445 Code: https://github.com/MirelleB/induced-rationales-markup-tokens
Kai Shen, Junliang Guo, Xu Tan, Siliang Tang, Rui Wang, Jiang Bian, Feb 2023, A Study on ReLU and Softmax in Transformer, https://arxiv.org/abs/2302.06461
David Chiang, Peter Cholak, Feb 2022, Overcoming a Theoretical Limitation of Self-Attention, https://arxiv.org/abs/2202.12172
Amirkeivan Mohtashami, Martin Jaggi, May 2023, Landmark Attention: Random-Access Infinite Context Length for Transformers, https://arxiv.org/abs/2305.16300
kaiokendev.github.io, Sep 2023 (accessed), Extending Context is Hard…but not Impossible, https://kaiokendev.github.io/context
Saurav Pawar, S.M Towhidul Islam Tonmoy, S M Mehedi Zaman, Vinija Jain, Aman Chadha, Amitava Das, 15 Jan 2024, The What, Why, and How of Context Length Extension Techniques in Large Language Models -- A Detailed Survey, https://arxiv.org/abs/2401.07872
Akarsh Kumar, 2023, Long-Range Memory Transformers, Massachusetts Institute of Technology, https://akarshkumar.com/downloads/nlp_longrange_memory_transformer.pdf https://github.com/akarshkumar0101/transformer-memory/tree/master/skip (Separates gradient information into explicitly long-range and short-range components.)
David Spuler, March 2024, Chapter 20. Attention, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
MohammadReza Ebrahimi, Sunny Panchal, Roland Memisevic, 10 Aug 2024, Your Context Is Not an Array: Unveiling Random Access Limitations in Transformers, https://arxiv.org/abs/2408.05506 (Explores how the sequential nature of token access in attention reduces the accuracy in long context analysis.)
Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li, 13 Aug 2024, LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs, https://arxiv.org/abs/2408.07055
David Spuler, March 2024, Length Generalization, in Generative AI in C++, https://www.aussieai.com/book/ch20-length-generalization
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, Boris Ginsburg, 6 Aug 2024 (v3), RULER: What's the Real Context Size of Your Long-Context Language Models? https://arxiv.org/abs/2404.06654 https://github.com/hsiehjackson/RULER
Jonathan Roberts, Kai Han, Samuel Albanie, 7 Nov 2024, Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks? https://arxiv.org/abs/2411.05000
Minghan Li, Eric Gaussier, Juntao Li, Guodong Zhou, 9 Nov 2024, KeyB2: Selecting Key Blocks is Also Important for Long Document Ranking with Large Language Models, https://arxiv.org/abs/2411.06254
Qingfei Zhao, Ruobing Wang, Yukuo Cen, Daren Zha, Shicheng Tan, Yuxiao Dong, Jie Tang, 23 Oct 2024, LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering, https://arxiv.org/abs/2410.18050 https://github.com/QingFei1/LongRAG
Yansheng Mao, Jiaqi Li, Fanxu Meng, Jing Xiong, Zilong Zheng, Muhan Zhang, 18 Dec 2024, LIFT: Improving Long Context Understanding Through Long Input Fine-Tuning, https://arxiv.org/abs/2412.13626
Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li, 19 Dec 2024, LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks, https://arxiv.org/abs/2412.15204 https://longbench2.github.io/
Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
Haotian Xu, Xing Wu, Weinong Wang, Zhongzhi Li, Da Zheng, Boyuan Chen, Yi Hu, Shijia Kang, Jiaming Ji, Yingying Zhang, Zhijiang Guo, Yaodong Yang, Muhan Zhang, Debing Zhang, 20 Jan 2025, RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems? https://arxiv.org/abs/2501.11284 https://huggingface.co/RedStar-Reasoning
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, Yuanxing Zhang, Zhuo Chen, Hangyu Guo, Shilong Li, Ziqiang Liu, Yong Shan, Yifan Song, Jiayi Tian, Wenhao Wu, Zhejian Zhou, Ruijie Zhu, Junlan Feng, Yang Gao, Shizhu He, Zhoujun Li, Tianyu Liu, Fanyu Meng, Wenbo Su, Yingshui Tan, Zili Wang, Jian Yang, Wei Ye, Bo Zheng, Wangchunshu Zhou, Wenhao Huang, Sujian Li, Zhaoxiang Zhang, 20 Mar 2025, A Comprehensive Survey on Long Context Language Modeling, https://arxiv.org/abs/2503.17407

Industry Articles on Long Context Length

Real-world industry models have started offering longer context windows:

Anthropic, Introducing 100K Context Windows, May 11, 2023, https://www.anthropic.com/index/100k-context-windows
The MosaicML NLP Team, May 5, 2023, Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs, https://www.mosaicml.com/blog/mpt-7b (See long context model MPT-7B-StoryWriter-65k+)
Galina Alperovich, May 16, 2023, The Secret Sauce behind 100K context window in LLMs: all tricks in one place, https://blog.gopenai.com/how-to-speed-up-llms-and-use-100k-context-window-all-tricks-in-one-place-ffd40577b4c
Carl Franzen, September 29, 2023, Meta quietly unveils Llama 2 Long AI that beats GPT-3.5 Turbo and Claude 2 on some tasks, Venture Beat, https://venturebeat.com/ai/meta-quietly-releases-llama-2-long-ai-that-outperforms-gpt-3-5-and-claude-2-on-some-tasks//a> (Context windows up to 32k.)
Reddit User pseudonerv, June 2022, A simple way to "Extending Context to 8K"?! Reddit Local LLama group, https://www.reddit.com/r/LocalLLaMA/comments/14fgjqj/a_simple_way_to_extending_context_to_8k/
Zichang Liu, April 2024, Ph.D. Thesis, Rice University, Houston, Texas, https://repository.rice.edu/server/api/core/bitstreams/a089344e-6f6b-44d2-a1c3-6cef2c303e86/content (Using sparsity to compress the KV cache for long context windows.)
Yao Fu, 14 May 2024, Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis, https://arxiv.org/abs/2405.08944 (The KV cache size is the main bottleneck for long context processing, in both prefill and decoding phases, and includes analysis of different optimizations to address this.)
Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, Chunting Zhou, 16 Apr 2024 (v2), Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length, https://arxiv.org/abs/2404.08801 Code: https://github.com/XuezheMax/megalodon
Mackenzie Morehead, Apr 16, 2024, Is Attention All You Need? https://www.mackenziemorehead.com/is-attention-all-you-need/
Tsendsuren Munkhdalai, Manaal Faruqui, Siddharth Gopal, 10 Apr 2024, Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention, https://arxiv.org/abs/2404.07143
Chonghua Wang, Haodong Duan, Songyang Zhang, Dahua Lin, Kai Chen, 10 Apr 2024 (v2), Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks, https://arxiv.org/abs/2404.06480 Code: https://github.com/open-compass/Ada-LEval
Aydar Bulatov, Yuri Kuratov, Yermek Kapushev, Mikhail Burtsev, March 2024, Beyond Attention: Breaking the Limits of Transformer Context Length with Recurrent Memory, Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38 No. 16: AAAI-24 Technical Tracks 16, https://ojs.aaai.org/index.php/AAAI/article/view/29722
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
Yunpeng Huang, Jingwei Xu, Zixu Jiang, Junyu Lai, Zenan Li, Yuan Yao, Taolue Chen, Lijuan Yang, Zhou Xin, Xiaoxing Ma, Nov 2023, Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey, https://arxiv.org/abs/2311.12351 Project: https://github.com/Strivin0311/long-llms-learning
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis, Dec 2023, Efficient Streaming Language Models with Attention Sinks https://arxiv.org/abs/2309.17453 Code: https://github.com/mit-han-lab/streaming-llm
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, Enrico Shippole, Nov 2023, YaRN: Efficient Context Window Extension of Large Language Models, https://arxiv.org/abs/2309.00071 Code: https://huggingface.co/NousResearch/Yarn-Mistral-7b-128k
kaiokendev.github.io, Sep 2023 (accessed), Extending Context is Hard…but not Impossible, https://kaiokendev.github.io/context
Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. May 2023. The impact of positional encoding on length generalization in transformers. arXiv preprint arXiv:2305.19466, https://arxiv.org/abs/2305.19466 (Evaluates various positional encoding algorithms in decoder-only Transformers.)
D Mamakas, P Tsotsi, I Androutsopoulos, 2022, Processing long legal documents with pre-trained transformers: Modding legalbert and longformer, https://arxiv.org/abs/2211.00974
A Askari, S Verberne, O Alonso, S Marchesin, 2021, Combining Lexical and Neural Retrieval with Longformer-based Summarization for Effective Case Law Retrieval. DESIRES, 2021, https://desires.dei.unipd.it/2021/papers/paper-02.pdf.pdf
Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating Wikipedia by Summarizing Long Sequences. In Proceedings of the 6th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1801.10198
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, 2019, https://arxiv.org/abs/1901.02860 Code: https://github.com/kimiyoung/transformer-xl
Cao, Q.; Trivedi, H.; Balasubramanian, A.; and Balasubramanian, N. 2020. Deformer: Decomposing pre-trained transformers for faster question answering. arXiv preprint arXiv:2005.00697, https://arxiv.org/abs/2005.00697 Code: https://github.com/StonyBrookNLP/deformer
3 Feb 2024, Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models, Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, Armaghan Eshaghi, https://arxiv.org/abs/2402.02244 (A survey of processing long context length using methods such as positional encoding and approximate attention including Softmax-free attention.)
Shuaiqi Liu, 2024, Neural Abstractive Summarization for Long Documents, Ph.D. Thesis, The Hong Kong Polytechnic University, https://theses.lib.polyu.edu.hk/bitstream/200/12810/3/7260.pdf
Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, Lingpeng Kong, 27 Feb 2024, Training-Free Long-Context Scaling of Large Language Models, https://arxiv.org/abs/2402.17463 Code: https://github.com/hkunlp/chunkllama
Lili Yu, Dániel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, Mike Lewis, 19 May 2023, MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers, https://arxiv.org/abs/2305.07185
Emma Järvinen, 2023, Aalto University, Finland, Master’s thesis in Mathematics and Operations Research, Long-input summarization using Large Language Models, https://aaltodoc.aalto.fi/server/api/core/bitstreams/cd47964e-5b5e-4af0-9af4-731970184358/content
Akruti Acharya, May 25, 2023, MEGABYTE, Meta AI’s New Revolutionary Model Architecture, Explained, https://encord.com/blog/meta-ai-megabyte-model-architecture-explained/
Saurav Pawar, S.M Towhidul Islam Tonmoy, S M Mehedi Zaman, Vinija Jain, Aman Chadha, Amitava Das, 15 Jan 2024, The What, Why, and How of Context Length Extension Techniques in Large Language Models -- A Detailed Survey, https://arxiv.org/abs/2401.07872
Jingyu Wang; Lu Zhang; Xueqing Li; Huazhong Yang; Yongpan Liu, Nov 2023, ULSeq-TA: Ultra-Long Sequence Attention Fusion Transformer Accelerator Supporting Grouped Sparse Softmax and Dual-Path Sparse LayerNorm, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (Early Access), https://ieeexplore.ieee.org/abstract/document/10304367
Yuzhen Mao, Martin Ester, Ke Li, 2023, IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs, NeurIPS 2023, https://neurips2023-enlsp.github.io/papers/paper_24.pdf
Akarsh Kumar, 2023, Long-Range Memory Transformers, Massachusetts Institute of Technology, https://akarshkumar.com/downloads/nlp_longrange_memory_transformer.pdf https://github.com/akarshkumar0101/transformer-memory/tree/master/skip (Separates gradient information into explicitly long-range and short-range components.)
Bin Lin, Tao Peng, Chen Zhang, Minmin Sun, Lanbo Li, Hanyu Zhao, Wencong Xiao, Qi Xu, Xiafei Qiu, Shen Li, Zhigang Ji, Yong Li, Wei Lin, 5 Jan 2024, Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache, https://arxiv.org/abs/2401.02669 (Long context processing by a modification to the QKV caching mechanisms.)
QuaLA-MiniLM: a Quantized Length Adaptive MiniLM, Shira Guskin, Moshe Wasserblat, Chang Wang, Haihao Shen, May 2023, https://arxiv.org/abs/2210.17114 (Intel labs paper. Low-bit quantization, distillation, and Length-Adaptive Transformer (LAT) technique. )
Amazon, Oct 2023, MistralLite Model, https://huggingface.co/amazon/MistralLite
DY Fu, S Arora, J Grogan, I Johnson, S Eyuboglu, Oct 2023, Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture https://arxiv.org/pdf/2310.12109.pdf
C Hao, P Zhang, M Xie, D Zhao, 2023, Recurrent Transformers for Long Document Understanding, CCF International Conference on Natural Language, https://dl.acm.org/doi/abs/10.1007/978-3-031-44693-1_5
Dacheng Li, Rulin Shao, Anze Xie, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica, Xuezhe Ma, Hao Zhang, Oct 2023, LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers, https://arxiv.org/abs/2310.03294v1
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. https://openreview.net/forum?id=YicbFdNTTy
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami, 4 Apr 2024 (v3), KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization, https://arxiv.org/abs/2401.18079
S Yang, S Zhang, M Fang, F Yang, S Liu, 2022, A hierarchical representation model based on longformer and transformer for extractive summarization https://www.mdpi.com/2079-9292/11/11/1706
J Ding, S Ma, L Dong, X Zhang, S Huang 2023, Longnet: Scaling transformers to 1,000,000,000 tokens, https://arxiv.org/abs/2307.02486
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 Code: https://github.com/HazyResearch/flash-attention (The original FlashAttention version 1, now superceded by FlashAttention 2, which uses tiling and memory-aware kernels to optimize attention.)
David Spuler, March 2024, Chapter 20. Attention, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia, 8 Mar 2024 (v3), LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, https://arxiv.org/abs/2309.12307 Code: https://github.com/dvlab-research/LongLoRA
Yingbing Huang, Lily Jiaxin Wan, Hanchen Ye, Manvi Jha, Jinghua Wang, Yuhong Li, Xiaofan Zhang, Deming Chen, 16 Jun 2024, New Solutions on LLM Acceleration, Optimization, and Application, https://arxiv.org/abs/2406.10903 (A survey of inference optimization methods and further analysis of Medusa-type speculative decoding and KV cache compression. Also explores hardware co-design, ML compilers and LLM-assisted code debugging.)
Ziyan Jiang, Xueguang Ma, Wenhu Chen, June 2024, LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs, arXiv preprint arXiv:2406.15319, https://arxiv.org/abs/2406.15319 (Improved accuracy performance of RAG methods when using a long context LLM and longer chunk sizes for the retriever.)
Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu, 24 Jun 2024, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers, https://arxiv.org/abs/2406.16747 (Sparse KV cache for memory-efficient decoding on long contexts by selecting KV pairs of salient tokens.)
Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan, 26 Jun 2024, LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference, https://arxiv.org/abs/2406.18139 (KV cache compression in text and multimodal inference, prioritizing eviction of text over image tokens, and using new ways to merge evicted KV cache data into the retained KV cache, including averaging, pivotal tokens, and weighted averages, which is relevant to token merging and KV cache fusion
Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun, 4 Jun 2024, Block Transformer: Global-to-Local Language Modeling for Fast Inference, https://arxiv.org/abs//2406.02657 Code: https://github.com/itsnamgyu/block-transformer (Impressive technique of combining tokens into blocks, then doing inference on the blocks, then unblocking to get tokens.)
Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 28 Jun 2024, InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management, https://arxiv.org/abs/2406.19707 (KV caching optimization using salient token pruning for the attention layer.)
Jiayi Yuan, Hongyi Liu, Shaochen (Henry)Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, Xia Hu, 1 Jul 2024, KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches, https://arxiv.org/abs/2407.01527 Code: https://github.com/henryzhongsc/longctx_bench (Survey and benchmarking of several KV cache compression and long context handling techniques.)
Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu, 2 Jul 2024, MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention, https://arxiv.org/abs/2407.02490 Code: https://aka.ms/MInference
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Zeyu Cui, Zhenru Zhang, Zhihao Fan, 15 Jul 2024, Qwen2 Technical Report, https://arxiv.org/abs/2407.10671
Siddharth Jha, Lutfi Eren Erdogan, Sehoon Kim, Kurt Keutzer, Amir Gholami, 11 Jul 2024, Characterizing Prompt Compression Methods for Long Context Inference, https://arxiv.org/abs/2407.08892
Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi, 19 Jul 2024, LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference, https://arxiv.org/abs/2407.14057
Emilia David, April 30, 2024, ChatGPT’s AI ‘memory’ can remember the preferences of paying customers, The Verge, https://www.theverge.com/2024/4/29/24144680/chatgpt-plus-memory-chatbot-subscription-details-preferences-personal-assistant
Iz Beltagy, Matthew E. Peters, Arman Cohan, Dec 2020, Longformer: The Long-Document Transformer, arXiv preprint arXiv:2004.05150 (2020). https://arxiv.org/abs/2004.05150
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers. https://arxiv.org/abs/1904.10509
J. Su. 2023, Rectified rotary position embeddings. https://github.com/bojone/rerope
B Peng, J Quesnelle, H Fan, E Shippole, 2023 YaRN: Efficient Context Window Extension of Large Language Models, https://arxiv.org/pdf/2309.00071.pdf
Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li, 13 Aug 2024, LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs, https://arxiv.org/abs/2408.07055 Code: https://github.com/THUDM/LongWriter
Zhenyu Li, Yike Zhang, Tengyu Pan, Yutao Sun, Zhichao Duan, Junjie Fang, Rong Han, Zixuan Wang, Jianyong Wang, 21 Aug 2024, FocusLLM: Scaling LLM's Context by Parallel Decoding, https://arxiv.org/abs/2408.11745 Code: https://github.com/leezythu/FocusLLM
Guangxuan Xiao, May 2024, Efficient Deployment Algorithms for Large Language Models, Masters Thesis, MIT, https://dspace.mit.edu/bitstream/handle/1721.1/156332/xiao-xgx-sm-eecs-2024-thesis.pdf
Lilly Kumari, Anthony Rowe, Shengjie Wang, Jeff Bilmes, 2024, BumbleBee: Dynamic KV-Cache Streaming Submodular Summarization for Infinite-Context Transformers, COLM 2024, https://openreview.net/pdf?id=8w0RApM5yG (KV cache compression via "summaries" of the KV cache data.)
Magic Team, August 29, 2024, 100M Token Context Windows: Research update on ultra-long context models, our partnership with Google Cloud, and new funding, https://magic.dev/blog/100m-token-context-windows
Pai Zeng, Zhenyu Ning, Jieru Zhao, Weihao Cui, Mengwei Xu, Liwei Guo, Xusheng Chen, Yizhou Shan, 27 May 2024 (v2), The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving, https://arxiv.org/abs/2405.11299
Insu Han, Rajesh Jayaram, Amin Karbasi, Vahab Mirrokni, David P. Woodruff, Amir Zandieh, 1 Dec 2023 (v3), HyperAttention: Long-context Attention in Near-Linear Time, https://arxiv.org/abs/2310.05869
Wei Chen, Zhiyuan Li, Shuo Xin, Yihao Wang, 28 Aug 2024, Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models, https://arxiv.org/abs/2408.15518 https://huggingface.co/NexaAIDev/Dolphin (Using vision transformer architecture to process longer text.)
KVCache.AI, 2024, K Transformers: A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations, https://github.com/kvcache-ai/ktransformers
Danlong Yuan, Jiahao Liu, Bei Li, Huishuai Zhang, Jingang Wang, Xunliang Cai, Dongyan Zhao, 9 Aug 2024 (v2), ReMamba: Equip Mamba with Effective Long-Sequence Modeling, https://arxiv.org/abs/2408.15496
David Spuler, March 2024, Long Context Research, in Generative AI in C++, https://www.aussieai.com/book/ch20-long-context-research
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, Boris Ginsburg, 6 Aug 2024 (v3), RULER: What's the Real Context Size of Your Long-Context Language Models? https://arxiv.org/abs/2404.06654 https://github.com/hsiehjackson/RULER
Michael Nuñez, September 4, 2024, 500,000 tokens: How Anthropic’s Claude Enterprise is pushing AI boundaries, https://venturebeat.com/ai/500000-tokens-how-anthropics-claude-enterprise-is-pushing-ai-boundaries/
jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, Juanzi Li, 4 Sep 2024, LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA, https://arxiv.org/abs/2409.02897
Tan Yu, Anbang Xu, Rama Akkiraju, 3 Sep 2024, In Defense of RAG in the Era of Long-Context Language Models, https://arxiv.org/abs/2409.01666
Asif Razzaq, September 5, 2024, Yi-Coder Released by 01.AI: A Powerful Small-Scale Code LLM Series, Delivering Exceptional Performance in Code Generation, Editing, and Long-Context Comprehension, https://www.marktechpost.com/2024/09/05/yi-coder-released-by-01-ai-a-powerful-small-scale-code-llm-series-delivering-exceptional-performance-in-code-generation-editing-and-long-context-comprehension/
Woomin Song, Seunghyuk Oh, Sangwoo Mo, Jaehyung Kim, Sukmin Yun, Jung-Woo Ha, Jinwoo Shin, 16 Apr 2024, Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs, https://arxiv.org/abs/2404.10308 https://github.com/alinlab/HOMER
Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, Jie Zhang, 8 Sep 2024, InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference, https://arxiv.org/abs/2409.04992
Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo, 11 Sep 2024, Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU, https://arxiv.org/abs/2409.09086
Kiran Vodrahalli, Santiago Ontanon, Nilesh Tripuraneni, Kelvin Xu, Sanil Jain, Rakesh Shivanna, Jeffrey Hui, Nishanth Dikkala, Mehran Kazemi, Bahare Fatemi, Rohan Anil, Ethan Dyer, Siamak Shakeri, Roopali Vij, Harsh Mehta, Vinay Ramasesh, Quoc Le, Ed Chi, Yifeng Lu, Orhan Firat, Angeliki Lazaridou, Jean-Baptiste Lespiau, Nithya Attaluri, Kate Olszewska, 20 Sep 2024 (v2), Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries, https://arxiv.org/abs/2409.12640 (Long context model evaluation dataset.)
Zeyu Zhang, Haiying Shen, 23 Sep 2024, CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts, https://arxiv.org/abs/2409.15104 (Sparse attention and overlapped communication with computation and disaggregates prefill/decoding with chunked prefill, with a novel QKV splitting approach focused on the Q values.)
Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, Shafiq Joty, 25 Sep 2024, Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction, https://arxiv.org/abs/2409.17422 https://github.com/SalesforceAIResearch/GemFilter (Use the early layers of a model to choose the most relevant tokens, similar to early exiting, and then compress the input token sequences based on the importance of these tokens. Notably, this reduces latency and also increases accuracy on long contexts.)
Yun Joon Soh, Hanxian Huang, Yuandong Tian, Jishen Zhao, 3 Sep 2024, You Only Use Reactive Attention Slice For Long Context Retrieval, https://arxiv.org/abs/2409.13695
Amey Agrawal, Junda Chen, Íñigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, Esha Choukse, 25 Sep 2024, Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations, https://arxiv.org/abs/2409.17264
Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei, 7 Oct 2024, Differential Transformer, https://arxiv.org/abs/2410.05258
Zixuan Li, Jing Xiong, Fanghua Ye, Chuanyang Zheng, Xun Wu, Jianqiao Lu, Zhongwei Wan, Xiaodan Liang, Chengming Li, Zhenan Sun, Lingpeng Kong, Ngai Wong, 3 Oct 2024, UncertaintyRAG: Span-Level Uncertainty Enhanced Long-Context Modeling for Retrieval-Augmented Generation, https://arxiv.org/abs/2410.02719
Minsoo Kim, Kyuhong Shim, Jungwook Choi, Simyung Chang, 2 Oct 2024, InfiniPot: Infinite Context Processing on Memory-Constrained LLMs, https://arxiv.org/abs/2410.01518 (Length-wise KV cache pruning by analyzing token importance.)
Suyu Ge, Xihui Lin, Yunan Zhang, Jiawei Han, Hao Peng, 2 Oct 2024, A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts, https://arxiv.org/abs/2410.01485
Yuxiang Huang, Binhang Yuan, Xu Han, Chaojun Xiao, Zhiyuan Liu, 2 Oct 2024, Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads, https://arxiv.org/abs/2410.01805
Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky, 6 Oct 2024, Inference Scaling for Long-Context Retrieval Augmented Generation, https://arxiv.org/abs/2410.04343
Jing Xiong, Jianghan Shen, Fanghua Ye, Chaofan Tao, Zhongwei Wan, Jianqiao Lu, Xun Wu, Chuanyang Zheng, Zhijiang Guo, Lingpeng Kong, Ngai Wong, 4 Oct 2024, UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference, https://arxiv.org/abs/2410.03090
Bowen Jin, Jinsung Yoon, Jiawei Han, Sercan O. Arik, 8 Oct 2024, Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG, https://arxiv.org/abs/2410.05983
Qingfei Zhao, Ruobing Wang, Yukuo Cen, Daren Zha, Shicheng Tan, Yuxiao Dong, Jie Tang, 23 Oct 2024, LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering, https://arxiv.org/abs/2410.18050 https://github.com/QingFei1/LongRAG
Anonymous authors, Oct 2024, LooongLlava: Scaling Multi-Modal LLMs to 1000 Images Efficiently Via a Hybrid Architecture, https://openreview.net/pdf?id=wqA7QmpUwa
Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen, 28 Oct 2024, ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference, https://arxiv.org/abs/2410.21465 https://github.com/bytedance/ShadowKV
Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jongsoo Park, Jianyu Huang, 4 Nov 2024, Context Parallelism for Scalable Million-Token Inference, https://arxiv.org/abs/2411.01783
Barhoumi Mosbeh, Nov 2024, Late Chunking In Long Context Embedding Models, https://pub.towardsai.net/late-chunking-in-long-context-embedding-models-caf1c1209042
Jonathan Roberts, Kai Han, Samuel Albanie, 7 Nov 2024, Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks? https://arxiv.org/abs/2411.05000
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Monishwaran Maheswaran, June Paik, Michael W. Mahoney, Kurt Keutzer, Amir Gholami, 14 Nov 2024, Squeezed Attention: Accelerating Long Context Length LLM Inference, https://arxiv.org/abs/2411.09688 https://github.com/SqueezeAILab/SqueezedAttention (This is like a combination of semantic caching and prefix KV caching, and close to fused KV caching.)
Qwen Team, November 15, 2024, Extending the Context Length to 1M Tokens! https://qwenlm.github.io/blog/qwen2.5-turbo/ (Qwen extended to long context via sparse attention.)
Zhuohan Gu, Jiayi Yao, Kuntai Du, Junchen Jiang, 21 Nov 2024 (v2), LLMSteer: Improving Long-Context LLM Inference by Steering Attention on Reused Contexts, https://arxiv.org/abs/2411.13009
M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
Junqi Ge, Ziyi Chen, Jintao Lin, Jinguo Zhu, Xihui Liu, Jifeng Dai, Xizhou Zhu, 12 Dec 2024, V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding, https://arxiv.org/abs/2412.09616 https://github.com/OpenGVLab/V2PE
Jérôme DIAZ, Dec 2024, Why Retrieval-Augmented Generation Is Still Relevant in the Era of Long-Context Language Models. In this article we will explore why 128K tokens (and more) models can’t fully replace using RAG. https://towardsdatascience.com/why-retrieval-augmented-generation-is-still-relevant-in-the-era-of-long-context-language-models-e36f509abac5
Zhuowan Li, Cheng Li, Mingyang Zhang, Qiaozhu Mei, Michael Bendersky, 17 Oct 2024 (v2), Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach, https://arxiv.org/abs/2407.16833
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang, 20 Nov 2023 (v3), Lost in the Middle: How Language Models Use Long Contexts, https://arxiv.org/abs/2307.03172 (Information is best placed at the start, or otherwise at the end, of a long context.)
Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu, 13 Dec 2024, SCBench: A KV Cache-Centric Analysis of Long-Context Methods, https://arxiv.org/abs/2412.10319 https://aka.ms/SCBench
Zeyuan Yang, Delin Chen, Xueyang Yu, Maohao Shen, Chuang Gan, 12 Dec 2024, VCA: Video Curious Agent for Long Video Understanding, https://arxiv.org/abs/2412.10471
Hongjin Qian, Zheng Liu, Peitian Zhang, Zhicheng Dou, Defu Lian, 18 Dec 2024 (v2), Boosting Long-Context Management via Query-Guided Activation Refilling, https://arxiv.org/abs/2412.12486 (Maintaining two KV caches, one global, one local.)
Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, Deyu Zhou, 18 Dec 2024, SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation, https://arxiv.org/abs/2412.13649 (Different KV cache optimizations for prefill and decoding phases.)
Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li, 19 Dec 2024, LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks, https://arxiv.org/abs/2412.15204 https://longbench2.github.io/
Chao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-Zhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song, Bo Zheng, Cheng-Lin Liu, 24 Dec 2024, LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating, https://arxiv.org/abs/2412.18424
Y Li, H Jiang, Q Wu, X Luo, S Ahn, C Zhang, AH Abdi, Dec 2024, SharedContextBench: Evaluating Long-Context Methods in KV Cache Reuse, 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024), https://neurips2024-enlsp.github.io/papers/paper_93.pdf (Evaluating model performance with KV cache compression.)
Jiaan Wang, Fandong Meng, Yunlong Liang, Jie Zhou, 23 Dec 2024, DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought, https://arxiv.org/abs/2412.17498 https://github.com/krystalan/DRT-o1 (Examines similes and metaphors in literature using long CoT.)
Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
Hyucksung Kwon, Kyungmo Koo, Janghyeon Kim, Woongkyu Lee, Minjae Lee, Hyungdeok Lee, Yousub Jung, Jaehan Park, Yosub Song, Byeongsu Yang, Haerang Choi, Guhyun Kim, Jongsoon Won, Woojae Shin, Changhyun Kim, Gyeongcheol Shin, Yongkee Kwon, Ilkon Kim, Euicheol Lim, John Kim, Jungwook Choi, 28 Dec 2024, LoL-PIM: Long-Context LLM Decoding with Scalable DRAM-PIM System, https://arxiv.org/abs/2412.20166
MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li, Junbin Xie, Junhao Xu, Junjie Yan, Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Zheng, Linbo Chai, Long Xing, Meizhi Ju, Mingyuan Chi, Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei Li, Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang, Qin Wang, Qiuhui Li, Ruitao Leng, Shengmin Shi, Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang, Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu Cheng, Wenkai Li, Xiangjun Song, Xiao Su, Xiaodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min, Xun Zou, Xuyang Shen, Yan Gong, Yingjie Zhu, Yipeng Zhou, Yiran Zhong, Yongyi Hu, Yuanxiang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan Huang, Yunji Li, Yunpeng Huang, Yunzhi Xu, Yuxin Mao, Zehan Li, Zekang Li, Zewei Tao, Zewen Ying, Zhaoyang Cong, Zhen Qin, Zhenhua Fan, Zhihang Yu, Zhuo Jiang, Zijia Wu, 14 Jan 2025, MiniMax-01: Scaling Foundation Models with Lightning Attention, https://arxiv.org/abs/2501.08313 https://github.com/MiniMax-AI (Content window over 1 million tokens.)
Kuicai Dong, Yujing Chang, Xin Deik Goh, Dexun Li, Ruiming Tang, Yong Liu, 15 Jan 2025, MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents, https://arxiv.org/abs/2501.08828
Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
Weizhi Fei, Xueyan Niu, Guoqing Xie, Yingqing Liu, Bo Bai, Wei Han, 22 Jan 2025, Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference, https://arxiv.org/abs/2501.12959 (Input token scanning efficiencly using early exit during prefill to prune tokens for the decoding phase.)
Zhan Ling, Kang Liu, Kai Yan, Yifan Yang, Weijian Lin, Ting-Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen, 25 Jan 2025, LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion, https://arxiv.org/abs/2501.15089
Cristian Leo, Feb 2025, Don’t Do RAG: Cache is the future: CAG or RAG? Let’s explore Cached Augmented Generation, its math, and trade-offs. Let’s dig into its research paper to see what it excels at, and how you could leverage it. https://levelup.gitconnected.com/dont-do-rag-cache-is-the-future-d1e995f0c76f
Nathaniel Tomczak, Sanmukh Kuppannagari, 31 Jan 2025, Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques, https://arxiv.org/abs/2502.01659 (Approaching attention optimization as a graph-theoretical problem.)
Pinxue Zhao, Hailin Zhang, Fangcheng Fu, Xiaonan Nie, Qibin Liu, Fang Yang, Yuanbo Peng, Dian Jiao, Shuaipeng Li, Jinbao Xue, Yangyu Tao, and Bin Cui. 2025. MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training. Proc. ACM Manag. Data 3, 1, Article 53 (February 2025), 28 pages. https://doi.org/10.1145/3709703 https://dl.acm.org/doi/abs/10.1145/3709703
Jack Wallen, Feb. 13, 2025, How I feed my files to a local AI for better, more relevant responses Msty is one of the best apps for interacting with the Ollama local AI tool and it contains a feature you'll want to use to help provide contextuality to its responses. https://www.zdnet.com/article/how-i-feed-my-files-to-a-local-ai-for-better-more-relevant-responses/
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, (authors omitted), 22 Jan 2025, Kimi k1.5: Scaling Reinforcement Learning with LLMs, https://arxiv.org/abs/2501.12599 (Includes a "length penalty" to address token reduction.)
Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Neo Y. Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, Jiezhong Qiu, 18 Feb 2025, MoBA: Mixture of Block Attention for Long-Context LLMs, https://arxiv.org/abs/2502.13189 https://github.com/MoonshotAI/MoBA
Konstantin Donhauser, Charles Arnal, Mohammad Pezeshki, Vivien Cabannes, David Lopez-Paz, Kartik Ahuja, 11 Feb 2025, Unveiling Simplicities of Attention: Adaptive Long-Context Head Identification, https://arxiv.org/abs/2502.09647 (Analysis of how attention works in long context scenarios.)
Weihao Liu, Ning Wu, Shiping Yang, Wenbiao Ding, Shining Liang, Ming Gong, Dongmei Zhang, 19 Feb 2025, MuDAF: Long-Context Multi-Document Attention Focusing through Contrastive Learning on Attention Heads, https://arxiv.org/abs/2502.13963
Ning Shang, Li Lyna Zhang, Siyuan Wang, Gaokai Zhang, Gilsinia Lopez, Fan Yang, Weizhu Chen, Mao Yang, 27 Feb 2025, LongRoPE2: Near-Lossless LLM Context Window Scaling, https://arxiv.org/abs/2502.20082 https://github.com/microsoft/LongRoPE (Addresses RopE issues with long context optimization.)
Avanika Narayan, Dan Biderman, Sabri Eyuboglu, Avner May, Scott Linderman, James Zou, Christopher Re, 21 Feb 2025, Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models, https://arxiv.org/abs/2502.15964 (Reading long documents using on-device small models, by breaking the document into small chunks processed by local LLMs, and only using the cloud LLMs for finalization tasks.)
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Asif Razzaq, March 5, 2025, Qwen Releases QwQ-32B: A 32B Reasoning Model that Achieves Significantly Enhanced Performance in Downstream Task, https://www.marktechpost.com/2025/03/05/qwen-releases-qwq-32b-a-32b-reasoning-model-that-achieves-significantly-enhanced-performance-in-downstream-task/ (Features 32B parameters, 32K context length, 64 layers, RoPE, SwiGLU, RMSNorm, and attention enhancements.)
Yijun Liu, Jinzheng Yu, Yang Xu, Zhongyang Li, Qingfu Zhu, 17 Mar 2025, A Survey on Transformer Context Extension: Approaches and Evaluation, https://arxiv.org/abs/2503.13299
Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, Yuanxing Zhang, Zhuo Chen, Hangyu Guo, Shilong Li, Ziqiang Liu, Yong Shan, Yifan Song, Jiayi Tian, Wenhao Wu, Zhejian Zhou, Ruijie Zhu, Junlan Feng, Yang Gao, Shizhu He, Zhoujun Li, Tianyu Liu, Fanyu Meng, Wenbo Su, Yingshui Tan, Zili Wang, Jian Yang, Wei Ye, Bo Zheng, Wangchunshu Zhou, Wenhao Huang, Sujian Li, Zhaoxiang Zhang, 20 Mar 2025, A Comprehensive Survey on Long Context Language Modeling, https://arxiv.org/abs/2503.17407
Ghadir Alselwi, Hao Xue, Shoaib Jameel, Basem Suleiman, Flora D. Salim, Imran Razzak, 19 Mar 2025, Long Context Modeling with Ranked Memory-Augmented Retrieval, https://arxiv.org/abs/2503.14800
AnonymousACLsubmission, 2025, TokenSelect: Efficient Long-Context Inferenceand Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection, https://openreview.net/pdf?id=l7i2gtDKdU
Chen Wu, Yin Song, 13 May 2025, Scaling Context, Not Parameters: Training a Compact 7B Language Model for Efficient Long-Context Processing, Mistral, https://arxiv.org/abs/2505.08651 https://huggingface.co/aws-prototyping/MegaBeam-Mistral-7B-512k
Assaf Ben-Kish, Itamar Zimerman, M. Jehanzeb Mirza, James Glass, Leonid Karlinsky, Raja Giryes, 12 May 2025, Overflow Prevention Enhances Long-Context Recurrent LLMs, https://arxiv.org/abs/2505.07793
Yaoqi Chen, Jinkai Zhang, Baotong Lu, Qianxi Zhang, Chengruidong Zhang, Jingjia Luo, Di Liu, Huiqiang Jiang, Qi Chen, Jing Liu, Bailu Ding, Xiao Yan, Jiawei Jiang, Chen Chen, Mingxing Zhang, Yuqing Yang, Fan Yang, Mao Yang, 5 May 2025, RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference, https://arxiv.org/abs/2505.02922

Research on Quadratic Attention Cost

Linearizing the attention algorithm to avoid the quadratic cost of attention processing is an area with a massive research base with numerous algorithms proposed. Faster attention algorithms include sparse attention and Flash Attention. See research on attention optimization methods.

Research on Positional Encoding Optimization

One of the less obvious bottlenecks for long contexts is the positional encoding algorithm. See research on positional encoding optimizations and removing positional encoding.

Research on Context Length

Research on making "longer" Transformer models includes:

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation Ofir Press, Noah A. Smith, Mike Lewis, Apr 2022, https://arxiv.org/abs/2108.12409 (Alibi for longer context length.)
Siyu Ding, Junyuan Shang, Shuohuan Wang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang, May 2021, ERNIE-Doc: A Retrospective Long-Document Modeling Transformer, https://arxiv.org/abs/2012.15688
Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33, 2020. https://arxiv.org/abs/2007.14062 (Sparse linear attention algorithm.)
Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling. In The International Conference on Learning Representations (ICLR), 2020. https://arxiv.org/abs/1911.05507
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, 2019, https://arxiv.org/abs/1901.02860, Code: https://github.com/kimiyoung/transformer-xl
Longformer: The Long-Document Transformer Iz Beltagy, Matthew E. Peters, Arman Cohan, Dec 2020, arXiv preprint arXiv:2004.05150 (2020). https://arxiv.org/abs/2004.05150
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers. https://arxiv.org/abs/1904.10509
Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, Jianfeng Gao, May 2021, Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding, https://arxiv.org/abs/2103.15358, Code: https://github.com/microsoft/vision-longformer
Qiuhui Chen, Yi Hong, May 2023, Longformer: Longitudinal Transformer for Alzheimer's Disease Classification with Structural MRIs, https://arxiv.org/abs/2302.00901, Code: https://github.com/Qybc/LongFormer
Chun-Fu Chen, Quanfu Fan, Rameswar Panda, Aug 2021, CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification, https://arxiv.org/abs/2103.14899, Code: https://github.com/IBM/CrossViT
Qiuhui Chen, Yi Hong, May 2023, Longformer: Longitudinal Transformer for Alzheimer's Disease Classification with Structural MRIs, https://arxiv.org/abs/2302.00901, Code: https://github.com/Qybc/LongFormer
Qiqi Zhou, Yichen Zhu, July 2023, Make A Long Image Short: Adaptive Token Length for Vision Transformers https://arxiv.org/abs/2307.02092
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li, Aug 2023, LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding, https://arxiv.org/abs/2308.14508, Code: https://github.com/THUDM/LongBench
Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Nanning Zheng, Furu Wei, July 2023, LongNet: Scaling Transformers to 1,000,000,000 Tokens, https://arxiv.org/abs/2307.02486
Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia, Sep 2023, LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, https://arxiv.org/abs/2309.12307 (An efficient low-rank attention method as part of long context optimizations.)
Ivan Sekulić, Amir Soleimani, Mohammad Aliannejadi, Fabio Crestani, Sep 2020, Longformer for MS MARCO Document Re-ranking Task, https://arxiv.org/abs/2009.09392, Code: https://github.com/isekulic/longformer-marco
Yikuan Li, Ramsey M. Wehbe, Faraz S. Ahmad, Hanyin Wang, Yuan Luo, Apr 2022, Clinical-Longformer and Clinical-BigBird: Transformers for long clinical sequences, https://arxiv.org/abs/2201.11838, Code: https://github.com/luoyuanlab/Clinical-Longformer
Anant Khandelwal, 2021, Fine-Tune Longformer for Jointly Predicting Rumor Stance and Veracity, CODS-COMAD '21: Proceedings of the 3rd ACM India Joint International Conference on Data Science & Management of Data (8th ACM IKDD CODS & 26th COMAD), January 2021, Pages 10–19, https://dl.acm.org/doi/abs/10.1145/3430984.3431007 https://arxiv.org/abs/2007.07803
Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, Sinong Wang, Sep 2023, LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models, https://arxiv.org/abs/2308.16137
Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. How long can open-source llms truly promise on context length?, June 2023. https://lmsys.org/blog/2023-06-29-longchat/
Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian, June 2023, Extending Context Window of Large Language Models via Positional Interpolation, https://arxiv.org/abs/2306.15595 (Introduces "positional interpolation" for long contexts, extending to 32k windows.)
Y Chen, Y Li, A Xu, Q Sun, X Chen, C Xu, 2023, WAG-NAT: Window Attention and Generator Based Non-Autoregressive Transformer for Time Series Forecasting, ICANN 2023: Artificial Neural Networks and Machine Learning, pp. 293–304, https://link.springer.com/chapter/10.1007/978-3-031-44223-0_24, Code: https://github.com/cybisolated/WAG-NAT
kaiokendev, 2023, Things im learning while training superhot. https://kaiokendev.github.io/til#extending-context-to-8k
Together AI, Jul 28, 2023, Preparing for the era of 32K context: Early learnings and explorations, https://together.ai/blog/llama-2-7b-32k (Uses position interpolation and Flash Attention.)
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Xiangru Tang, Bolun Wang, Johan S. Wind, Stansilaw Wozniak, Ruichong Zhang, Zhenyuan Zhang, Qihang Zhao, Peng Zhou, Jian Zhu, Rui-Jie Zhu, May 2023, RWKV: Reinventing RNNs for the Transformer Era, https://arxiv.org/pdf/2305.13048.pdf, Code: https://github.com/BlinkDL/RWKV-LM (Hybrid RNN-Transformer that replaces QKV attention with Receptance Weighted Key Value (RWKV)).
Georgi Gerganov, June 2023 Extending context size via RoPE scaling #1965, Llama.cpp project, https://github.com/ggerganov/llama.cpp/discussions/1965
Hao Liu, Matei Zaharia, Pieter Abbeel, Oct 2023, Ring Attention with Blockwise Transformers for Near-Infinite Context, https://arxiv.org/abs/2310.01889
H Jin, X Han, J Yang, Z Jiang, CY Chang, X Hu, Oct 2023, GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length, arXiv preprint arXiv:2310.00576, https://browse.arxiv.org/pdf/2310.00576.pdf
Davis Yoshida, Allyson Ettinger, and Kevin Gimpel. 2020. Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size. CoRR abs/2008.07027 (2020). arXiv:2008.07027 https://arxiv.org/abs/2008.07027 (Hybrid RNN-Transformer architecture with increased context size.)
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed, Oct 2023, Mistral 7B, https://arxiv.org/abs/2310.06825, Code: https://mistral.ai/news/announcing-mistral-7b/ (Uses grouped-query attention and sliding window attention.)
S Ren, Q Jia, KQ Zhu, arXiv preprint arXiv:2310.08152, Context Compression for Auto-regressive Transformers with Sentinel Tokens, Oct 2023, https://arxiv.org/pdf/2310.08152.pdf, Code: https://github.com/DRSY/KV_Compression
Tri Dao, Daniel Haziza, Francisco Massa, Grigory Sizov, October 13, 2023, Flash-Decoding for long-context inference, PyTorch Blog, https://pytorch.org/blog/flash-decoding/
Alibaba Qwen Team, Sep 2023, Qwen Technical Report, https://arxiv.org/pdf/2309.16609.pdf
Jesse Mu, Xiang Lisa Li, and Noah Goodman. July 2023. Learning to compress prompts with gist tokens. arXiv preprint arXiv:2304.08467. https://arxiv.org/abs/2304.08467 (Prompt compression.)
Aydar Bulatov, Yuri Kuratov, Yermek Kapushev, Mikhail Burtsev, March 2024, Beyond Attention: Breaking the Limits of Transformer Context Length with Recurrent Memory, Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38 No. 16: AAAI-24 Technical Tracks 16, https://ojs.aaai.org/index.php/AAAI/article/view/29722
Shuaiqi Liu, 2024, Neural Abstractive Summarization for Long Documents, Ph.D. Thesis, The Hong Kong Polytechnic University, https://theses.lib.polyu.edu.hk/bitstream/200/12810/3/7260.pdf

Aussie AI

Long and Ultralong Context LLMs

What are Long Context LLMs?

Why Long Context?

Optimizing the Context Window Size

Ultra-Long Context Models

Length Generalization: Accuracy with Longer Contexts

Industry Articles on Long Context Length

Research on Quadratic Attention Cost

Research on Positional Encoding Optimization

Research on Context Length

More AI Research

Quick Links

Product

New to Writing?

Writing Styles