Aussie AI

Long and Ultralong Context LLMs

  • Last Updated 17 November, 2025
  • by David Spuler, Ph.D.

What are Long Context LLMs?

Long context LLMs are model architectures that can accept input texts that are long, consisting of many tokens. Long context models are generally defined as those beyond 100k tokens, such as 128k, which has become common in major models. The latest SOTA models that can handle over a million tokens in their context window are called "ultralong models" in recent terminology.

Context window size is the number of input tokens that a model can process. Early models, even ChatGPT, had small context windows of about 2048. Each token is usually a part-word or a whole word, so this meant it could process inputs of about 1,000-2,000 words.

Why Long Context?

Why seek a longer input size? Or output size? Well, because a report is going to be 5,000 words, and a full-length novel is 100,000 words, or even 200,000 words if it's in the "epic sci-fi" genre. These are all tokens that must be output (for generation) or input (for editing).

Context is lots of other things, too. It's not just single documents in the "context" of an LLM, and the input and output of an LLM may include:

  • The conversational history for a chatbot
  • Chunks of multiple documents in a RAG system
  • Video streams of almost unfathomable sizes
  • All of the PDF receipt documents on your laptop (an AI tax return app)
  • An entire software repository of many code files (an AI coding assistant)
  • Combinations of text and images (multimodal LLMs)

There are many reasons we might want an LLM to be able to handle a lot of input, or to create a lot of output. But the main problem has been doing this many tokens efficiently.

Newer models have been increasing the context window size. For example, GPT-4 has a 32k window size, which is 32,000 tokens, which will handle a small novella or novellette of maybe 15,000-20,000 words. Anthropic reportedly has a Claude model with a 100k context window. MPT has an open-source model called "MPT-StoryWriter-65k+" with a 65,000 token window size.

Optimizing the Context Window Size

Why is there this context size limitation? One of the main bottlenecks is the "quadratic" cost of the self-attention mechanism. And there are various ways to optimize attention to overcome this limitation. However, it's not the only bottleneck. Alperovich (2023) offers the "secret sauce" for long contexts as fixing three main bottlenecks:

  • Quadratic attention cost (in the input token size)
  • Quadratic size of internal tensors (in the model dimension)
  • Positional embedding cost.

Some of the techniques with relevance to allowing the processing and creation of longer token sequences include:

Ultra-Long Context Models

Ultra-long context models are LLMs that accept a context width of more than 1M tokens. Although the largest models are not yet accepting such a long context (although 128k is common), there are a few commercial models that accept over 1M tokens in their context window.

Reserach papers on ultra-long context LLMs:

  • MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li, Junbin Xie, Junhao Xu, Junjie Yan, Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Zheng, Linbo Chai, Long Xing, Meizhi Ju, Mingyuan Chi, Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei Li, Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang, Qin Wang, Qiuhui Li, Ruitao Leng, Shengmin Shi, Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang, Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu Cheng, Wenkai Li, Xiangjun Song, Xiao Su, Xiaodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min, Xun Zou, Xuyang Shen, Yan Gong, Yingjie Zhu, Yipeng Zhou, Yiran Zhong, Yongyi Hu, Yuanxiang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan Huang, Yunji Li, Yunpeng Huang, Yunzhi Xu, Yuxin Mao, Zehan Li, Zekang Li, Zewei Tao, Zewen Ying, Zhaoyang Cong, Zhen Qin, Zhenhua Fan, Zhihang Yu, Zhuo Jiang, Zijia Wu, 14 Jan 2025, MiniMax-01: Scaling Foundation Models with Lightning Attention, https://arxiv.org/abs/2501.08313 https://github.com/MiniMax-AI (Content window over 1 million tokens.)
  • MiniMax, Jan 2025, MiniMax-01: Scaling Foundation Models with Lightning Attention, https://filecdn.minimax.chat/_Arxiv_MiniMax_01_Report.pdf
  • MiniMax, Jan 2025, MiniMax-01 is Now Open-Source: Scaling Lightning Attention for the AI Agent Era, https://www.minimaxi.com/en/news/minimax-01-series-2
  • Jonathan Roberts, Kai Han, Samuel Albanie, 7 Nov 2024, Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks? https://arxiv.org/abs/2411.05000
  • Lili Yu, Dániel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, Mike Lewis, 19 May 2023, MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers, https://arxiv.org/abs/2305.07185
  • Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami, 4 Apr 2024 (v3), KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization, https://arxiv.org/abs/2401.18079
  • Amey Agrawal, Junda Chen, Íñigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, Esha Choukse, 25 Sep 2024, Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations, https://arxiv.org/abs/2409.17264
  • Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jongsoo Park, Jianyu Huang, 4 Nov 2024, Context Parallelism for Scalable Million-Token Inference, https://arxiv.org/abs/2411.01783
  • Qwen Team, November 15, 2024, Extending the Context Length to 1M Tokens! https://qwenlm.github.io/blog/qwen2.5-turbo/ (Qwen extended to long context via sparse attention.)
  • Demis Hassabis, Jan 2025, X post: Announcing Gemini 2.0 Flash https://x.com/demishassabis/status/1881844417746632910 (Gemini 2.0 Flash from Google is a Large Reasoning Model with a 1M ultra-long context.)
  • Manpreet Singh, Feb 2025, Goodbye RAG? Gemini 2.0 Flash Have Just Killed It! https://ai.gopubby.com/goodbye-rag-gemini-2-0-flash-have-just-killed-it-96301113c01f
  • Heejun Lee, Geon Park, Jaduk Suh, Sung Ju Hwang, 13 Feb 2025, InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU, https://arxiv.org/abs/2502.08910 (Using dynamic token pruning and KV cache data offloading to CPU memory.)
  • Tong Wu, Junzhe Shen, Zixia Jia, Yuxuan Wang, Zilong Zheng, 26 Feb 2025, From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens, https://arxiv.org/abs/2502.18890 (Extending speculative decoding to address three bottlenecks in ultra-long context: frequent model reloading, KV cache size, and repetitive output content generation. Uses techniques such as KV cache eviction and decoding penalties to avoid repetition.) https://github.com/bigai-nlco/TokenSwift
  • Hao Ge, Junda Feng, Qi Huang, Fangcheng Fu, Xiaonan Nie, Lei Zuo, Haibin Lin, Bin Cui, Xin Liu, 28 Feb 2025, ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs, https://arxiv.org/abs/2502.21231 (Addressing training inefficiencies when training data ranges from short to very long queries, including via hybrid data parallelism and communications optimizations.)
  • Joe DeLaere, Kirthi Devleker and Eduardo Alvarez, Sep 09, 2025, NVIDIA Rubin CPX Accelerates Inference Performance and Efficiency for 1M+ Token Context Workloads, https://developer.nvidia.com/blog/nvidia-rubin-cpx-accelerates-inference-performance-and-efficiency-for-1m-token-context-workloads/
  • Microsoft, 17 Sep, 2025, GPT-5 vs GPT-4.1: choosing the right model for your use case https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-models/how-to/model-choice-guide
  • Wenxuan Li, Chengruidong Zhang, Huiqiang Jiang, Yucheng Li, Yuqing Yang, Lili Qiu, 21 Oct 2025, MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training, https://arxiv.org/abs/2510.18830

Length Generalization: Accuracy with Longer Contexts

Speed is not the only problem with long contexts. The vanilla Transformers are also not particularly good at generalizing their results with a long context window. This ability is known as "length generalization" (or "length extrapolation"), and improving the accuracy in long inputs and longer outputs is an area of active research.

One of the methods being analyzed to improve length generalization is called "scratchpad" or "chain-of-thought" algorithms. The idea is that the AI inference engine emits rough summaries to an internal scratchpad at regular intervals, which are merged into subsequent inference, thereby the AI helps itself keep track of its own thoughts over a longer output sequence.

Research papers on "length generalization" include:

  • Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, Behnam Neyshabur, Nov 2022, Exploring Length Generalization in Large Language Models, https://arxiv.org/abs/2207.04901
  • Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. May 2023. The impact of positional encoding on length generalization in transformers. arXiv preprint arXiv:2305.19466, https://arxiv.org/abs/2305.19466
  • H Jin, X Han, J Yang, Z Jiang, CY Chang, X Hu, Oct 2023, GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length, arXiv preprint arXiv:2310.00576, https://browse.arxiv.org/pdf/2310.00576.pdf
  • Shuaiqi Liu, 2024, Neural Abstractive Summarization for Long Documents, Ph.D. Thesis, The Hong Kong Polytechnic University, https://theses.lib.polyu.edu.hk/bitstream/200/12810/3/7260.pdf
  • Mackenzie Morehead, Apr 16, 2024, Is Attention All You Need? https://www.mackenziemorehead.com/is-attention-all-you-need/
  • Chonghua Wang, Haodong Duan, Songyang Zhang, Dahua Lin, Kai Chen, 10 Apr 2024 (v2), Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks, https://arxiv.org/abs/2404.06480 Code: https://github.com/open-compass/Ada-LEval
  • Jishnu Ray Chowdhury, Cornelia Caragea, May 2023, Monotonic Location Attention for Length Generalization, https://arxiv.org/abs/2305.20019
  • Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe, May 2023, Let's Verify Step by Step, https://arxiv.org/abs/2305.20050
  • Nan Yang, Laicheng Zhong, Fan Huang, Dong Yuan, Wei Bao, Feb 2023, Random Padding Data Augmentation, https://arxiv.org/abs/2302.08682
  • Benjamin Newman, John Hewitt, Percy Liang, Christopher D. Manning, Oct 2020, The EOS Decision and Length Extrapolation, https://arxiv.org/abs/2010.07174
  • Anian Ruoss, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Róbert Csordás, Mehdi Bennani, Shane Legg, Joel Veness, May 2023, Randomized Positional Encodings Boost Length Generalization of Transformers, https://arxiv.org/abs/2305.16843
  • Benjamin Newman, John Hewitt, Percy Liang, Christopher D. Manning, Oct 2020, The EOS Decision and Length Extrapolation, https://arxiv.org/abs/2010.07174v1
  • Mirelle Bueno, Carlos Gemmell, Jeffrey Dalton, Roberto Lotufo, Rodrigo Nogueira, Nov 2022, Induced Natural Language Rationales and Interleaved Markup Tokens Enable Extrapolation in Large Language Models, https://arxiv.org/abs/2208.11445 Code: https://github.com/MirelleB/induced-rationales-markup-tokens
  • Kai Shen, Junliang Guo, Xu Tan, Siliang Tang, Rui Wang, Jiang Bian, Feb 2023, A Study on ReLU and Softmax in Transformer, https://arxiv.org/abs/2302.06461
  • David Chiang, Peter Cholak, Feb 2022, Overcoming a Theoretical Limitation of Self-Attention, https://arxiv.org/abs/2202.12172
  • Amirkeivan Mohtashami, Martin Jaggi, May 2023, Landmark Attention: Random-Access Infinite Context Length for Transformers, https://arxiv.org/abs/2305.16300
  • kaiokendev.github.io, Sep 2023 (accessed), Extending Context is Hard…but not Impossible, https://kaiokendev.github.io/context
  • Saurav Pawar, S.M Towhidul Islam Tonmoy, S M Mehedi Zaman, Vinija Jain, Aman Chadha, Amitava Das, 15 Jan 2024, The What, Why, and How of Context Length Extension Techniques in Large Language Models -- A Detailed Survey, https://arxiv.org/abs/2401.07872
  • Akarsh Kumar, 2023, Long-Range Memory Transformers, Massachusetts Institute of Technology, https://akarshkumar.com/downloads/nlp_longrange_memory_transformer.pdf https://github.com/akarshkumar0101/transformer-memory/tree/master/skip (Separates gradient information into explicitly long-range and short-range components.)
  • David Spuler, March 2024, Chapter 20. Attention, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • MohammadReza Ebrahimi, Sunny Panchal, Roland Memisevic, 10 Aug 2024, Your Context Is Not an Array: Unveiling Random Access Limitations in Transformers, https://arxiv.org/abs/2408.05506 (Explores how the sequential nature of token access in attention reduces the accuracy in long context analysis.)
  • Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li, 13 Aug 2024, LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs, https://arxiv.org/abs/2408.07055
  • David Spuler, March 2024, Length Generalization, in Generative AI in C++, https://www.aussieai.com/book/ch20-length-generalization
  • Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, Boris Ginsburg, 6 Aug 2024 (v3), RULER: What's the Real Context Size of Your Long-Context Language Models? https://arxiv.org/abs/2404.06654 https://github.com/hsiehjackson/RULER
  • Jonathan Roberts, Kai Han, Samuel Albanie, 7 Nov 2024, Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks? https://arxiv.org/abs/2411.05000
  • Minghan Li, Eric Gaussier, Juntao Li, Guodong Zhou, 9 Nov 2024, KeyB2: Selecting Key Blocks is Also Important for Long Document Ranking with Large Language Models, https://arxiv.org/abs/2411.06254
  • Qingfei Zhao, Ruobing Wang, Yukuo Cen, Daren Zha, Shicheng Tan, Yuxiao Dong, Jie Tang, 23 Oct 2024, LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering, https://arxiv.org/abs/2410.18050 https://github.com/QingFei1/LongRAG
  • Yansheng Mao, Jiaqi Li, Fanxu Meng, Jing Xiong, Zilong Zheng, Muhan Zhang, 18 Dec 2024, LIFT: Improving Long Context Understanding Through Long Input Fine-Tuning, https://arxiv.org/abs/2412.13626
  • Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li, 19 Dec 2024, LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks, https://arxiv.org/abs/2412.15204 https://longbench2.github.io/
  • Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
  • Haotian Xu, Xing Wu, Weinong Wang, Zhongzhi Li, Da Zheng, Boyuan Chen, Yi Hu, Shijia Kang, Jiaming Ji, Yingying Zhang, Zhijiang Guo, Yaodong Yang, Muhan Zhang, Debing Zhang, 20 Jan 2025, RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems? https://arxiv.org/abs/2501.11284 https://huggingface.co/RedStar-Reasoning
  • Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
  • Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, Yuanxing Zhang, Zhuo Chen, Hangyu Guo, Shilong Li, Ziqiang Liu, Yong Shan, Yifan Song, Jiayi Tian, Wenhao Wu, Zhejian Zhou, Ruijie Zhu, Junlan Feng, Yang Gao, Shizhu He, Zhoujun Li, Tianyu Liu, Fanyu Meng, Wenbo Su, Yingshui Tan, Zili Wang, Jian Yang, Wei Ye, Bo Zheng, Wangchunshu Zhou, Wenhao Huang, Sujian Li, Zhaoxiang Zhang, 20 Mar 2025, A Comprehensive Survey on Long Context Language Modeling, https://arxiv.org/abs/2503.17407
  • Kelly Hong, Anton Troynikov, Jeff Huber, July 14, 2025, Context Rot: How Increasing Input Tokens Impacts LLM Performance, Chroma Technical Report, https://research.trychroma.com/context-rot
  • Ziyang Cai, Nayoung Lee, Avi Schwarzschild, Samet Oymak, Dimitris Papailiopoulos, 4 Aug 2025, Extrapolation by Association: Length Generalization Transfer in Transformers, https://arxiv.org/abs/2506.09251
  • Buu Phan, Reza Ebrahimi, Sanjay Haresh, Roland Memisevic, 30 Sep 2025, Delayed Attention Training Improves Length Generalization in Transformer--RNN Hybrids, https://arxiv.org/abs/2510.00258
  • Peng Lu, Jerry Huang, Qiuhao Zeng, Xinyu Wang, Boxing Wang, Philippe Langlais, Yufei Cui, 23 Sep 2025, Mamba Modulation: On the Length Generalization of Mamba, https://arxiv.org/abs/2509.19633
  • Jiaqi Leng, Xiang Hu, Junxiong Wang, Jianguo Li, Wei Wu, Yucheng Lu, 20 Oct 2025, Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models, https://arxiv.org/abs/2510.17196
  • Yang Chen, Yitao Liang, Zhouchen Lin, 5 Oct 2025, On the Limitations and Capabilities of Position Embeddings for Length Generalization, https://arxiv.org/abs/2510.04130
  • Yang Chen, Long Yang, Yitao Liang, Zhouchen Lin, 5 Oct 2025, Low-Dimension-to-High-Dimension Generalization And Its Implications for Length Generalization, https://arxiv.org/abs/2410.08898
  • Ricardo Buitrago Ruiz and Albert Gu, 12 Oct 2025, Understanding and Improving Length Generalization in Recurrent Models, https://arxiv.org/abs/2507.02782
  • P\'al Zs\'amboki, Benjamin Levi, David Ansel Josef Smith, Mitansh Kagalwala, Arlington Kell, Samuel Liechty, Cong Wang, 9 Oct 2025, Learning What's Missing: Attention Dispersion and EMA Stabilization in Length Generalization, https://arxiv.org/abs/2510.08341
  • Eran Malach, Omid Saremi, Sinead Williamson, Arwen Bradley, Aryo Lotfi, Emmanuel Abbe, Josh Susskind, Etai Littwin, 16 Oct 2025, To Infinity and Beyond: Tool-Use Unlocks Length Generalization in State Space Models, https://arxiv.org/abs/2510.14826

Industry Articles on Long Context Length

Real-world industry models have started offering longer context windows:

  • Anthropic, Introducing 100K Context Windows, May 11, 2023, https://www.anthropic.com/index/100k-context-windows
  • The MosaicML NLP Team, May 5, 2023, Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs, https://www.mosaicml.com/blog/mpt-7b (See long context model MPT-7B-StoryWriter-65k+)
  • Galina Alperovich, May 16, 2023, The Secret Sauce behind 100K context window in LLMs: all tricks in one place, https://blog.gopenai.com/how-to-speed-up-llms-and-use-100k-context-window-all-tricks-in-one-place-ffd40577b4c
  • Carl Franzen, September 29, 2023, Meta quietly unveils Llama 2 Long AI that beats GPT-3.5 Turbo and Claude 2 on some tasks, Venture Beat, https://venturebeat.com/ai/meta-quietly-releases-llama-2-long-ai-that-outperforms-gpt-3-5-and-claude-2-on-some-tasks//a> (Context windows up to 32k.)
  • Reddit User pseudonerv, June 2022, A simple way to "Extending Context to 8K"?! Reddit Local LLama group, https://www.reddit.com/r/LocalLLaMA/comments/14fgjqj/a_simple_way_to_extending_context_to_8k/
  • Zichang Liu, April 2024, Ph.D. Thesis, Rice University, Houston, Texas, https://repository.rice.edu/server/api/core/bitstreams/a089344e-6f6b-44d2-a1c3-6cef2c303e86/content (Using sparsity to compress the KV cache for long context windows.)
  • Yao Fu, 14 May 2024, Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis, https://arxiv.org/abs/2405.08944 (The KV cache size is the main bottleneck for long context processing, in both prefill and decoding phases, and includes analysis of different optimizations to address this.)
  • Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, Chunting Zhou, 16 Apr 2024 (v2), Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length, https://arxiv.org/abs/2404.08801 Code: https://github.com/XuezheMax/megalodon
  • Mackenzie Morehead, Apr 16, 2024, Is Attention All You Need? https://www.mackenziemorehead.com/is-attention-all-you-need/
  • Tsendsuren Munkhdalai, Manaal Faruqui, Siddharth Gopal, 10 Apr 2024, Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention, https://arxiv.org/abs/2404.07143
  • Chonghua Wang, Haodong Duan, Songyang Zhang, Dahua Lin, Kai Chen, 10 Apr 2024 (v2), Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks, https://arxiv.org/abs/2404.06480 Code: https://github.com/open-compass/Ada-LEval
  • Aydar Bulatov, Yuri Kuratov, Yermek Kapushev, Mikhail Burtsev, March 2024, Beyond Attention: Breaking the Limits of Transformer Context Length with Recurrent Memory, Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38 No. 16: AAAI-24 Technical Tracks 16, https://ojs.aaai.org/index.php/AAAI/article/view/29722
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
  • Yunpeng Huang, Jingwei Xu, Zixu Jiang, Junyu Lai, Zenan Li, Yuan Yao, Taolue Chen, Lijuan Yang, Zhou Xin, Xiaoxing Ma, Nov 2023, Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey, https://arxiv.org/abs/2311.12351 Project: https://github.com/Strivin0311/long-llms-learning
  • Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
  • Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis, Dec 2023, Efficient Streaming Language Models with Attention Sinks https://arxiv.org/abs/2309.17453 Code: https://github.com/mit-han-lab/streaming-llm
  • Bowen Peng, Jeffrey Quesnelle, Honglu Fan, Enrico Shippole, Nov 2023, YaRN: Efficient Context Window Extension of Large Language Models, https://arxiv.org/abs/2309.00071 Code: https://huggingface.co/NousResearch/Yarn-Mistral-7b-128k
  • kaiokendev.github.io, Sep 2023 (accessed), Extending Context is Hard…but not Impossible, https://kaiokendev.github.io/context
  • Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. May 2023. The impact of positional encoding on length generalization in transformers. arXiv preprint arXiv:2305.19466, https://arxiv.org/abs/2305.19466 (Evaluates various positional encoding algorithms in decoder-only Transformers.)
  • D Mamakas, P Tsotsi, I Androutsopoulos, 2022, Processing long legal documents with pre-trained transformers: Modding legalbert and longformer, https://arxiv.org/abs/2211.00974
  • A Askari, S Verberne, O Alonso, S Marchesin, 2021, Combining Lexical and Neural Retrieval with Longformer-based Summarization for Effective Case Law Retrieval. DESIRES, 2021, https://desires.dei.unipd.it/2021/papers/paper-02.pdf.pdf
  • Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating Wikipedia by Summarizing Long Sequences. In Proceedings of the 6th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1801.10198
  • Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, 2019, https://arxiv.org/abs/1901.02860 Code: https://github.com/kimiyoung/transformer-xl
  • Cao, Q.; Trivedi, H.; Balasubramanian, A.; and Balasubramanian, N. 2020. Deformer: Decomposing pre-trained transformers for faster question answering. arXiv preprint arXiv:2005.00697, https://arxiv.org/abs/2005.00697 Code: https://github.com/StonyBrookNLP/deformer
  • 3 Feb 2024, Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models, Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, Armaghan Eshaghi, https://arxiv.org/abs/2402.02244 (A survey of processing long context length using methods such as positional encoding and approximate attention including Softmax-free attention.)
  • Shuaiqi Liu, 2024, Neural Abstractive Summarization for Long Documents, Ph.D. Thesis, The Hong Kong Polytechnic University, https://theses.lib.polyu.edu.hk/bitstream/200/12810/3/7260.pdf
  • Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, Lingpeng Kong, 27 Feb 2024, Training-Free Long-Context Scaling of Large Language Models, https://arxiv.org/abs/2402.17463 Code: https://github.com/hkunlp/chunkllama
  • Lili Yu, Dániel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, Mike Lewis, 19 May 2023, MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers, https://arxiv.org/abs/2305.07185
  • Emma Järvinen, 2023, Aalto University, Finland, Master’s thesis in Mathematics and Operations Research, Long-input summarization using Large Language Models, https://aaltodoc.aalto.fi/server/api/core/bitstreams/cd47964e-5b5e-4af0-9af4-731970184358/content
  • Akruti Acharya, May 25, 2023, MEGABYTE, Meta AI’s New Revolutionary Model Architecture, Explained, https://encord.com/blog/meta-ai-megabyte-model-architecture-explained/
  • Saurav Pawar, S.M Towhidul Islam Tonmoy, S M Mehedi Zaman, Vinija Jain, Aman Chadha, Amitava Das, 15 Jan 2024, The What, Why, and How of Context Length Extension Techniques in Large Language Models -- A Detailed Survey, https://arxiv.org/abs/2401.07872
  • Jingyu Wang; Lu Zhang; Xueqing Li; Huazhong Yang; Yongpan Liu, Nov 2023, ULSeq-TA: Ultra-Long Sequence Attention Fusion Transformer Accelerator Supporting Grouped Sparse Softmax and Dual-Path Sparse LayerNorm, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (Early Access), https://ieeexplore.ieee.org/abstract/document/10304367
  • Yuzhen Mao, Martin Ester, Ke Li, 2023, IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs, NeurIPS 2023, https://neurips2023-enlsp.github.io/papers/paper_24.pdf
  • Akarsh Kumar, 2023, Long-Range Memory Transformers, Massachusetts Institute of Technology, https://akarshkumar.com/downloads/nlp_longrange_memory_transformer.pdf https://github.com/akarshkumar0101/transformer-memory/tree/master/skip (Separates gradient information into explicitly long-range and short-range components.)
  • Bin Lin, Tao Peng, Chen Zhang, Minmin Sun, Lanbo Li, Hanyu Zhao, Wencong Xiao, Qi Xu, Xiafei Qiu, Shen Li, Zhigang Ji, Yong Li, Wei Lin, 5 Jan 2024, Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache, https://arxiv.org/abs/2401.02669 (Long context processing by a modification to the QKV caching mechanisms.)
  • QuaLA-MiniLM: a Quantized Length Adaptive MiniLM, Shira Guskin, Moshe Wasserblat, Chang Wang, Haihao Shen, May 2023, https://arxiv.org/abs/2210.17114 (Intel labs paper. Low-bit quantization, distillation, and Length-Adaptive Transformer (LAT) technique. )
  • Amazon, Oct 2023, MistralLite Model, https://huggingface.co/amazon/MistralLite
  • DY Fu, S Arora, J Grogan, I Johnson, S Eyuboglu, Oct 2023, Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture https://arxiv.org/pdf/2310.12109.pdf
  • C Hao, P Zhang, M Xie, D Zhao, 2023, Recurrent Transformers for Long Document Understanding, CCF International Conference on Natural Language, https://dl.acm.org/doi/abs/10.1007/978-3-031-44693-1_5
  • Dacheng Li, Rulin Shao, Anze Xie, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica, Xuezhe Ma, Hao Zhang, Oct 2023, LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers, https://arxiv.org/abs/2310.03294v1
  • Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. https://openreview.net/forum?id=YicbFdNTTy
  • Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami, 4 Apr 2024 (v3), KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization, https://arxiv.org/abs/2401.18079
  • S Yang, S Zhang, M Fang, F Yang, S Liu, 2022, A hierarchical representation model based on longformer and transformer for extractive summarization https://www.mdpi.com/2079-9292/11/11/1706
  • J Ding, S Ma, L Dong, X Zhang, S Huang 2023, Longnet: Scaling transformers to 1,000,000,000 tokens, https://arxiv.org/abs/2307.02486
  • Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 Code: https://github.com/HazyResearch/flash-attention (The original FlashAttention version 1, now superceded by FlashAttention 2, which uses tiling and memory-aware kernels to optimize attention.)
  • David Spuler, March 2024, Chapter 20. Attention, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
  • Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia, 8 Mar 2024 (v3), LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, https://arxiv.org/abs/2309.12307 Code: https://github.com/dvlab-research/LongLoRA
  • Yingbing Huang, Lily Jiaxin Wan, Hanchen Ye, Manvi Jha, Jinghua Wang, Yuhong Li, Xiaofan Zhang, Deming Chen, 16 Jun 2024, New Solutions on LLM Acceleration, Optimization, and Application, https://arxiv.org/abs/2406.10903 (A survey of inference optimization methods and further analysis of Medusa-type speculative decoding and KV cache compression. Also explores hardware co-design, ML compilers and LLM-assisted code debugging.)
  • Ziyan Jiang, Xueguang Ma, Wenhu Chen, June 2024, LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs, arXiv preprint arXiv:2406.15319, https://arxiv.org/abs/2406.15319 (Improved accuracy performance of RAG methods when using a long context LLM and longer chunk sizes for the retriever.)
  • Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu, 24 Jun 2024, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers, https://arxiv.org/abs/2406.16747 (Sparse KV cache for memory-efficient decoding on long contexts by selecting KV pairs of salient tokens.)
  • Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan, 26 Jun 2024, LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference, https://arxiv.org/abs/2406.18139 (KV cache compression in text and multimodal inference, prioritizing eviction of text over image tokens, and using new ways to merge evicted KV cache data into the retained KV cache, including averaging, pivotal tokens, and weighted averages, which is relevant to token merging and KV cache fusion
  • Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun, 4 Jun 2024, Block Transformer: Global-to-Local Language Modeling for Fast Inference, https://arxiv.org/abs//2406.02657 Code: https://github.com/itsnamgyu/block-transformer (Impressive technique of combining tokens into blocks, then doing inference on the blocks, then unblocking to get tokens.)
  • Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 28 Jun 2024, InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management, https://arxiv.org/abs/2406.19707 (KV caching optimization using salient token pruning for the attention layer.)
  • Jiayi Yuan, Hongyi Liu, Shaochen (Henry)Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, Xia Hu, 1 Jul 2024, KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches, https://arxiv.org/abs/2407.01527 Code: https://github.com/henryzhongsc/longctx_bench (Survey and benchmarking of several KV cache compression and long context handling techniques.)
  • Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu, 2 Jul 2024, MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention, https://arxiv.org/abs/2407.02490 Code: https://aka.ms/MInference
  • An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Zeyu Cui, Zhenru Zhang, Zhihao Fan, 15 Jul 2024, Qwen2 Technical Report, https://arxiv.org/abs/2407.10671
  • Siddharth Jha, Lutfi Eren Erdogan, Sehoon Kim, Kurt Keutzer, Amir Gholami, 11 Jul 2024, Characterizing Prompt Compression Methods for Long Context Inference, https://arxiv.org/abs/2407.08892
  • Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
  • Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi, 19 Jul 2024, LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference, https://arxiv.org/abs/2407.14057
  • Emilia David, April 30, 2024, ChatGPT’s AI ‘memory’ can remember the preferences of paying customers, The Verge, https://www.theverge.com/2024/4/29/24144680/chatgpt-plus-memory-chatbot-subscription-details-preferences-personal-assistant
  • Iz Beltagy, Matthew E. Peters, Arman Cohan, Dec 2020, Longformer: The Long-Document Transformer, arXiv preprint arXiv:2004.05150 (2020). https://arxiv.org/abs/2004.05150
  • Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers. https://arxiv.org/abs/1904.10509
  • J. Su. 2023, Rectified rotary position embeddings. https://github.com/bojone/rerope
  • B Peng, J Quesnelle, H Fan, E Shippole, 2023 YaRN: Efficient Context Window Extension of Large Language Models, https://arxiv.org/pdf/2309.00071.pdf
  • Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li, 13 Aug 2024, LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs, https://arxiv.org/abs/2408.07055 Code: https://github.com/THUDM/LongWriter
  • Zhenyu Li, Yike Zhang, Tengyu Pan, Yutao Sun, Zhichao Duan, Junjie Fang, Rong Han, Zixuan Wang, Jianyong Wang, 21 Aug 2024, FocusLLM: Scaling LLM's Context by Parallel Decoding, https://arxiv.org/abs/2408.11745 Code: https://github.com/leezythu/FocusLLM
  • Guangxuan Xiao, May 2024, Efficient Deployment Algorithms for Large Language Models, Masters Thesis, MIT, https://dspace.mit.edu/bitstream/handle/1721.1/156332/xiao-xgx-sm-eecs-2024-thesis.pdf
  • Lilly Kumari, Anthony Rowe, Shengjie Wang, Jeff Bilmes, 2024, BumbleBee: Dynamic KV-Cache Streaming Submodular Summarization for Infinite-Context Transformers, COLM 2024, https://openreview.net/pdf?id=8w0RApM5yG (KV cache compression via "summaries" of the KV cache data.)
  • Magic Team, August 29, 2024, 100M Token Context Windows: Research update on ultra-long context models, our partnership with Google Cloud, and new funding, https://magic.dev/blog/100m-token-context-windows
  • Pai Zeng, Zhenyu Ning, Jieru Zhao, Weihao Cui, Mengwei Xu, Liwei Guo, Xusheng Chen, Yizhou Shan, 27 May 2024 (v2), The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving, https://arxiv.org/abs/2405.11299
  • Insu Han, Rajesh Jayaram, Amin Karbasi, Vahab Mirrokni, David P. Woodruff, Amir Zandieh, 1 Dec 2023 (v3), HyperAttention: Long-context Attention in Near-Linear Time, https://arxiv.org/abs/2310.05869
  • Wei Chen, Zhiyuan Li, Shuo Xin, Yihao Wang, 28 Aug 2024, Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models, https://arxiv.org/abs/2408.15518 https://huggingface.co/NexaAIDev/Dolphin (Using vision transformer architecture to process longer text.)
  • KVCache.AI, 2024, K Transformers: A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations, https://github.com/kvcache-ai/ktransformers
  • Danlong Yuan, Jiahao Liu, Bei Li, Huishuai Zhang, Jingang Wang, Xunliang Cai, Dongyan Zhao, 9 Aug 2024 (v2), ReMamba: Equip Mamba with Effective Long-Sequence Modeling, https://arxiv.org/abs/2408.15496
  • David Spuler, March 2024, Long Context Research, in Generative AI in C++, https://www.aussieai.com/book/ch20-long-context-research
  • Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, Boris Ginsburg, 6 Aug 2024 (v3), RULER: What's the Real Context Size of Your Long-Context Language Models? https://arxiv.org/abs/2404.06654 https://github.com/hsiehjackson/RULER
  • Michael Nuñez, September 4, 2024, 500,000 tokens: How Anthropic’s Claude Enterprise is pushing AI boundaries, https://venturebeat.com/ai/500000-tokens-how-anthropics-claude-enterprise-is-pushing-ai-boundaries/
  • jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, Juanzi Li, 4 Sep 2024, LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA, https://arxiv.org/abs/2409.02897
  • Tan Yu, Anbang Xu, Rama Akkiraju, 3 Sep 2024, In Defense of RAG in the Era of Long-Context Language Models, https://arxiv.org/abs/2409.01666
  • Asif Razzaq, September 5, 2024, Yi-Coder Released by 01.AI: A Powerful Small-Scale Code LLM Series, Delivering Exceptional Performance in Code Generation, Editing, and Long-Context Comprehension, https://www.marktechpost.com/2024/09/05/yi-coder-released-by-01-ai-a-powerful-small-scale-code-llm-series-delivering-exceptional-performance-in-code-generation-editing-and-long-context-comprehension/
  • Woomin Song, Seunghyuk Oh, Sangwoo Mo, Jaehyung Kim, Sukmin Yun, Jung-Woo Ha, Jinwoo Shin, 16 Apr 2024, Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs, https://arxiv.org/abs/2404.10308 https://github.com/alinlab/HOMER
  • Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, Jie Zhang, 8 Sep 2024, InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference, https://arxiv.org/abs/2409.04992
  • Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo, 11 Sep 2024, Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU, https://arxiv.org/abs/2409.09086
  • Kiran Vodrahalli, Santiago Ontanon, Nilesh Tripuraneni, Kelvin Xu, Sanil Jain, Rakesh Shivanna, Jeffrey Hui, Nishanth Dikkala, Mehran Kazemi, Bahare Fatemi, Rohan Anil, Ethan Dyer, Siamak Shakeri, Roopali Vij, Harsh Mehta, Vinay Ramasesh, Quoc Le, Ed Chi, Yifeng Lu, Orhan Firat, Angeliki Lazaridou, Jean-Baptiste Lespiau, Nithya Attaluri, Kate Olszewska, 20 Sep 2024 (v2), Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries, https://arxiv.org/abs/2409.12640 (Long context model evaluation dataset.)
  • Zeyu Zhang, Haiying Shen, 23 Sep 2024, CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts, https://arxiv.org/abs/2409.15104 (Sparse attention and overlapped communication with computation and disaggregates prefill/decoding with chunked prefill, with a novel QKV splitting approach focused on the Q values.)
  • Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, Shafiq Joty, 25 Sep 2024, Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction, https://arxiv.org/abs/2409.17422 https://github.com/SalesforceAIResearch/GemFilter (Use the early layers of a model to choose the most relevant tokens, similar to early exiting, and then compress the input token sequences based on the importance of these tokens. Notably, this reduces latency and also increases accuracy on long contexts.)
  • Yun Joon Soh, Hanxian Huang, Yuandong Tian, Jishen Zhao, 3 Sep 2024, You Only Use Reactive Attention Slice For Long Context Retrieval, https://arxiv.org/abs/2409.13695
  • Amey Agrawal, Junda Chen, Íñigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, Esha Choukse, 25 Sep 2024, Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations, https://arxiv.org/abs/2409.17264
  • Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei, 7 Oct 2024, Differential Transformer, https://arxiv.org/abs/2410.05258
  • Zixuan Li, Jing Xiong, Fanghua Ye, Chuanyang Zheng, Xun Wu, Jianqiao Lu, Zhongwei Wan, Xiaodan Liang, Chengming Li, Zhenan Sun, Lingpeng Kong, Ngai Wong, 3 Oct 2024, UncertaintyRAG: Span-Level Uncertainty Enhanced Long-Context Modeling for Retrieval-Augmented Generation, https://arxiv.org/abs/2410.02719
  • Minsoo Kim, Kyuhong Shim, Jungwook Choi, Simyung Chang, 2 Oct 2024, InfiniPot: Infinite Context Processing on Memory-Constrained LLMs, https://arxiv.org/abs/2410.01518 (Length-wise KV cache pruning by analyzing token importance.)
  • Suyu Ge, Xihui Lin, Yunan Zhang, Jiawei Han, Hao Peng, 2 Oct 2024, A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts, https://arxiv.org/abs/2410.01485
  • Yuxiang Huang, Binhang Yuan, Xu Han, Chaojun Xiao, Zhiyuan Liu, 2 Oct 2024, Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads, https://arxiv.org/abs/2410.01805
  • Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky, 6 Oct 2024, Inference Scaling for Long-Context Retrieval Augmented Generation, https://arxiv.org/abs/2410.04343
  • Jing Xiong, Jianghan Shen, Fanghua Ye, Chaofan Tao, Zhongwei Wan, Jianqiao Lu, Xun Wu, Chuanyang Zheng, Zhijiang Guo, Lingpeng Kong, Ngai Wong, 4 Oct 2024, UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference, https://arxiv.org/abs/2410.03090
  • Bowen Jin, Jinsung Yoon, Jiawei Han, Sercan O. Arik, 8 Oct 2024, Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG, https://arxiv.org/abs/2410.05983
  • Qingfei Zhao, Ruobing Wang, Yukuo Cen, Daren Zha, Shicheng Tan, Yuxiao Dong, Jie Tang, 23 Oct 2024, LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering, https://arxiv.org/abs/2410.18050 https://github.com/QingFei1/LongRAG
  • Anonymous authors, Oct 2024, LooongLlava: Scaling Multi-Modal LLMs to 1000 Images Efficiently Via a Hybrid Architecture, https://openreview.net/pdf?id=wqA7QmpUwa
  • Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen, 28 Oct 2024, ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference, https://arxiv.org/abs/2410.21465 https://github.com/bytedance/ShadowKV
  • Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jongsoo Park, Jianyu Huang, 4 Nov 2024, Context Parallelism for Scalable Million-Token Inference, https://arxiv.org/abs/2411.01783
  • Barhoumi Mosbeh, Nov 2024, Late Chunking In Long Context Embedding Models, https://pub.towardsai.net/late-chunking-in-long-context-embedding-models-caf1c1209042
  • Jonathan Roberts, Kai Han, Samuel Albanie, 7 Nov 2024, Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks? https://arxiv.org/abs/2411.05000
  • Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Monishwaran Maheswaran, June Paik, Michael W. Mahoney, Kurt Keutzer, Amir Gholami, 14 Nov 2024, Squeezed Attention: Accelerating Long Context Length LLM Inference, https://arxiv.org/abs/2411.09688 https://github.com/SqueezeAILab/SqueezedAttention (This is like a combination of semantic caching and prefix KV caching, and close to fused KV caching.)
  • Qwen Team, November 15, 2024, Extending the Context Length to 1M Tokens! https://qwenlm.github.io/blog/qwen2.5-turbo/ (Qwen extended to long context via sparse attention.)
  • Zhuohan Gu, Jiayi Yao, Kuntai Du, Junchen Jiang, 21 Nov 2024 (v2), LLMSteer: Improving Long-Context LLM Inference by Steering Attention on Reused Contexts, https://arxiv.org/abs/2411.13009
  • M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
  • Junqi Ge, Ziyi Chen, Jintao Lin, Jinguo Zhu, Xihui Liu, Jifeng Dai, Xizhou Zhu, 12 Dec 2024, V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding, https://arxiv.org/abs/2412.09616 https://github.com/OpenGVLab/V2PE
  • Jérôme DIAZ, Dec 2024, Why Retrieval-Augmented Generation Is Still Relevant in the Era of Long-Context Language Models. In this article we will explore why 128K tokens (and more) models can’t fully replace using RAG. https://towardsdatascience.com/why-retrieval-augmented-generation-is-still-relevant-in-the-era-of-long-context-language-models-e36f509abac5
  • Zhuowan Li, Cheng Li, Mingyang Zhang, Qiaozhu Mei, Michael Bendersky, 17 Oct 2024 (v2), Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach, https://arxiv.org/abs/2407.16833
  • Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang, 20 Nov 2023 (v3), Lost in the Middle: How Language Models Use Long Contexts, https://arxiv.org/abs/2307.03172 (Information is best placed at the start, or otherwise at the end, of a long context.)
  • Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu, 13 Dec 2024, SCBench: A KV Cache-Centric Analysis of Long-Context Methods, https://arxiv.org/abs/2412.10319 https://aka.ms/SCBench
  • Zeyuan Yang, Delin Chen, Xueyang Yu, Maohao Shen, Chuang Gan, 12 Dec 2024, VCA: Video Curious Agent for Long Video Understanding, https://arxiv.org/abs/2412.10471
  • Hongjin Qian, Zheng Liu, Peitian Zhang, Zhicheng Dou, Defu Lian, 18 Dec 2024 (v2), Boosting Long-Context Management via Query-Guided Activation Refilling, https://arxiv.org/abs/2412.12486 (Maintaining two KV caches, one global, one local.)
  • Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, Deyu Zhou, 18 Dec 2024, SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation, https://arxiv.org/abs/2412.13649 (Different KV cache optimizations for prefill and decoding phases.)
  • Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li, 19 Dec 2024, LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks, https://arxiv.org/abs/2412.15204 https://longbench2.github.io/
  • Chao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-Zhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song, Bo Zheng, Cheng-Lin Liu, 24 Dec 2024, LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating, https://arxiv.org/abs/2412.18424
  • Y Li, H Jiang, Q Wu, X Luo, S Ahn, C Zhang, AH Abdi, Dec 2024, SharedContextBench: Evaluating Long-Context Methods in KV Cache Reuse, 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024), https://neurips2024-enlsp.github.io/papers/paper_93.pdf (Evaluating model performance with KV cache compression.)
  • Jiaan Wang, Fandong Meng, Yunlong Liang, Jie Zhou, 23 Dec 2024, DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought, https://arxiv.org/abs/2412.17498 https://github.com/krystalan/DRT-o1 (Examines similes and metaphors in literature using long CoT.)
  • Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
  • Hyucksung Kwon, Kyungmo Koo, Janghyeon Kim, Woongkyu Lee, Minjae Lee, Hyungdeok Lee, Yousub Jung, Jaehan Park, Yosub Song, Byeongsu Yang, Haerang Choi, Guhyun Kim, Jongsoon Won, Woojae Shin, Changhyun Kim, Gyeongcheol Shin, Yongkee Kwon, Ilkon Kim, Euicheol Lim, John Kim, Jungwook Choi, 28 Dec 2024, LoL-PIM: Long-Context LLM Decoding with Scalable DRAM-PIM System, https://arxiv.org/abs/2412.20166
  • MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li, Junbin Xie, Junhao Xu, Junjie Yan, Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Zheng, Linbo Chai, Long Xing, Meizhi Ju, Mingyuan Chi, Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei Li, Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang, Qin Wang, Qiuhui Li, Ruitao Leng, Shengmin Shi, Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang, Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu Cheng, Wenkai Li, Xiangjun Song, Xiao Su, Xiaodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min, Xun Zou, Xuyang Shen, Yan Gong, Yingjie Zhu, Yipeng Zhou, Yiran Zhong, Yongyi Hu, Yuanxiang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan Huang, Yunji Li, Yunpeng Huang, Yunzhi Xu, Yuxin Mao, Zehan Li, Zekang Li, Zewei Tao, Zewen Ying, Zhaoyang Cong, Zhen Qin, Zhenhua Fan, Zhihang Yu, Zhuo Jiang, Zijia Wu, 14 Jan 2025, MiniMax-01: Scaling Foundation Models with Lightning Attention, https://arxiv.org/abs/2501.08313 https://github.com/MiniMax-AI (Content window over 1 million tokens.)
  • Kuicai Dong, Yujing Chang, Xin Deik Goh, Dexun Li, Ruiming Tang, Yong Liu, 15 Jan 2025, MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents, https://arxiv.org/abs/2501.08828
  • Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
  • Weizhi Fei, Xueyan Niu, Guoqing Xie, Yingqing Liu, Bo Bai, Wei Han, 22 Jan 2025, Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference, https://arxiv.org/abs/2501.12959 (Input token scanning efficiencly using early exit during prefill to prune tokens for the decoding phase.)
  • Zhan Ling, Kang Liu, Kai Yan, Yifan Yang, Weijian Lin, Ting-Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen, 25 Jan 2025, LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion, https://arxiv.org/abs/2501.15089
  • Cristian Leo, Feb 2025, Don’t Do RAG: Cache is the future: CAG or RAG? Let’s explore Cached Augmented Generation, its math, and trade-offs. Let’s dig into its research paper to see what it excels at, and how you could leverage it. https://levelup.gitconnected.com/dont-do-rag-cache-is-the-future-d1e995f0c76f
  • Nathaniel Tomczak, Sanmukh Kuppannagari, 31 Jan 2025, Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques, https://arxiv.org/abs/2502.01659 (Approaching attention optimization as a graph-theoretical problem.)
  • Pinxue Zhao, Hailin Zhang, Fangcheng Fu, Xiaonan Nie, Qibin Liu, Fang Yang, Yuanbo Peng, Dian Jiao, Shuaipeng Li, Jinbao Xue, Yangyu Tao, and Bin Cui. 2025. MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training. Proc. ACM Manag. Data 3, 1, Article 53 (February 2025), 28 pages. https://doi.org/10.1145/3709703 https://dl.acm.org/doi/abs/10.1145/3709703
  • Jack Wallen, Feb. 13, 2025, How I feed my files to a local AI for better, more relevant responses Msty is one of the best apps for interacting with the Ollama local AI tool and it contains a feature you'll want to use to help provide contextuality to its responses. https://www.zdnet.com/article/how-i-feed-my-files-to-a-local-ai-for-better-more-relevant-responses/
  • Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, (authors omitted), 22 Jan 2025, Kimi k1.5: Scaling Reinforcement Learning with LLMs, https://arxiv.org/abs/2501.12599 (Includes a "length penalty" to address token reduction.)
  • Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Neo Y. Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, Jiezhong Qiu, 18 Feb 2025, MoBA: Mixture of Block Attention for Long-Context LLMs, https://arxiv.org/abs/2502.13189 https://github.com/MoonshotAI/MoBA
  • Konstantin Donhauser, Charles Arnal, Mohammad Pezeshki, Vivien Cabannes, David Lopez-Paz, Kartik Ahuja, 11 Feb 2025, Unveiling Simplicities of Attention: Adaptive Long-Context Head Identification, https://arxiv.org/abs/2502.09647 (Analysis of how attention works in long context scenarios.)
  • Weihao Liu, Ning Wu, Shiping Yang, Wenbiao Ding, Shining Liang, Ming Gong, Dongmei Zhang, 19 Feb 2025, MuDAF: Long-Context Multi-Document Attention Focusing through Contrastive Learning on Attention Heads, https://arxiv.org/abs/2502.13963
  • Ning Shang, Li Lyna Zhang, Siyuan Wang, Gaokai Zhang, Gilsinia Lopez, Fan Yang, Weizhu Chen, Mao Yang, 27 Feb 2025, LongRoPE2: Near-Lossless LLM Context Window Scaling, https://arxiv.org/abs/2502.20082 https://github.com/microsoft/LongRoPE (Addresses RopE issues with long context optimization.)
  • Avanika Narayan, Dan Biderman, Sabri Eyuboglu, Avner May, Scott Linderman, James Zou, Christopher Re, 21 Feb 2025, Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models, https://arxiv.org/abs/2502.15964 (Reading long documents using on-device small models, by breaking the document into small chunks processed by local LLMs, and only using the cloud LLMs for finalization tasks.)
  • Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
  • Asif Razzaq, March 5, 2025, Qwen Releases QwQ-32B: A 32B Reasoning Model that Achieves Significantly Enhanced Performance in Downstream Task, https://www.marktechpost.com/2025/03/05/qwen-releases-qwq-32b-a-32b-reasoning-model-that-achieves-significantly-enhanced-performance-in-downstream-task/ (Features 32B parameters, 32K context length, 64 layers, RoPE, SwiGLU, RMSNorm, and attention enhancements.)
  • Yijun Liu, Jinzheng Yu, Yang Xu, Zhongyang Li, Qingfu Zhu, 17 Mar 2025, A Survey on Transformer Context Extension: Approaches and Evaluation, https://arxiv.org/abs/2503.13299
  • Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, Yuanxing Zhang, Zhuo Chen, Hangyu Guo, Shilong Li, Ziqiang Liu, Yong Shan, Yifan Song, Jiayi Tian, Wenhao Wu, Zhejian Zhou, Ruijie Zhu, Junlan Feng, Yang Gao, Shizhu He, Zhoujun Li, Tianyu Liu, Fanyu Meng, Wenbo Su, Yingshui Tan, Zili Wang, Jian Yang, Wei Ye, Bo Zheng, Wangchunshu Zhou, Wenhao Huang, Sujian Li, Zhaoxiang Zhang, 20 Mar 2025, A Comprehensive Survey on Long Context Language Modeling, https://arxiv.org/abs/2503.17407
  • Ghadir Alselwi, Hao Xue, Shoaib Jameel, Basem Suleiman, Flora D. Salim, Imran Razzak, 19 Mar 2025, Long Context Modeling with Ranked Memory-Augmented Retrieval, https://arxiv.org/abs/2503.14800
  • AnonymousACLsubmission, 2025, TokenSelect: Efficient Long-Context Inferenceand Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection, https://openreview.net/pdf?id=l7i2gtDKdU
  • Chen Wu, Yin Song, 13 May 2025, Scaling Context, Not Parameters: Training a Compact 7B Language Model for Efficient Long-Context Processing, Mistral, https://arxiv.org/abs/2505.08651 https://huggingface.co/aws-prototyping/MegaBeam-Mistral-7B-512k
  • Assaf Ben-Kish, Itamar Zimerman, M. Jehanzeb Mirza, James Glass, Leonid Karlinsky, Raja Giryes, 12 May 2025, Overflow Prevention Enhances Long-Context Recurrent LLMs, https://arxiv.org/abs/2505.07793
  • Yaoqi Chen, Jinkai Zhang, Baotong Lu, Qianxi Zhang, Chengruidong Zhang, Jingjia Luo, Di Liu, Huiqiang Jiang, Qi Chen, Jing Liu, Bailu Ding, Xiao Yan, Jiawei Jiang, Chen Chen, Mingxing Zhang, Yuqing Yang, Fan Yang, Mao Yang, 5 May 2025, RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference, https://arxiv.org/abs/2505.02922
  • Kelly Hong, Anton Troynikov, Jeff Huber, July 14, 2025, Context Rot: How Increasing Input Tokens Impacts LLM Performance, Chroma Technical Report, https://research.trychroma.com/context-rot
  • Ethan Ding, Aug 01, 2025, tokens are getting more expensive: "language models will get cheaper by 10x" will not save ai subscriptions from the short squeeze, https://ethanding.substack.com/p/ai-subscriptions-get-short-squeezed
  • Ranran Zhen, Juntao Li, Yixin Ji, Zhenlin Yang, Tong Liu, Qingrong Xia, Xinyu Duan, Zhefeng Wang, Baoxing Huai, Min Zhang, 28 Apr 2025, Taming the Titans: A Survey of Efficient LLM Inference Serving, https://arxiv.org/abs/2504.19720 (Surver of various inference and serving optimizations, such as parallelism, offloading, scheduling, length prediction, KV cache compression, and prefill-decode phase disaggregation.)
  • Kimi Team: Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, (many more authors), 23 Jun 2025 (v3), Kimi-VL Technical Report https://arxiv.org/abs/2504.07491 https://github.com/MoonshotAI/Kimi-VL
  • Ammar Ahmed, Sheng Di, Franck Cappello, Zirui Liu, Jingoo Han, Ali Anwar, 1 Aug 2025, Systematic Evaluation of Optimization Techniques for Long-Context Language Models, https://arxiv.org/abs/2508.00305
  • Goeric Huybrechts, Srikanth Ronanki, Sai Muralidhar Jayanthi, Jack Fitzgerald, Srinivasan Veeravanallur, 18 Jul 2025, Document Haystack: A Long Context Multimodal Image/Document Understanding Vision LLM Benchmark, https://arxiv.org/abs/2507.15882
  • Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu, Toby Boyd, Brad Hekman, Aaron Parisi, Chaoyi Zhang, Kornraphop Kawintiranon, Tania Bedrax-Weiss, Oliver Wang, Ya Xu, Ollie Purkiss, Uri Mendlovic, Ila\"i Deutel, Nam Nguyen, Adam Langley, Flip Korn, Lucia Rossazza, Alexandre Ram\'e, Sagar Waghmare, Helen Miller, Nathan Byrd, Ashrith Sheshan, Raia Hadsell Sangnie Bhardwaj, Pawel Janus, Tero Rissa, Dan Horgan, Sharon Silver, Ayzaan Wahid, Sergey Brin, Yves Raimond, Klemen Kloboves, et al. (3255 additional authors not shown), 22 Jul 2025, Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities, https://arxiv.org/abs/2507.06261
  • Saptarshi Mitra, Rachid Karami, Haocheng Xu, Sitao Huang, Hyoukjun Kwon, 19 Jul 2025, Characterizing State Space Model (SSM) and SSM-Transformer Hybrid Language Model Performance with Long Context Length, https://arxiv.org/abs/2507.12442
  • Dachuan Shi, Yonggan Fu, Xiangchi Yuan, Zhongzhi Yu, Haoran You, Sixu Li, Xin Dong, Jan Kautz, Pavlo Molchanov, Yingyan (Celine) Lin, 14 Jul 2025, LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models, https://arxiv.org/abs/2507.14204
  • Junqi Yin, Mijanur Palash, M. Paul Laiu, Muralikrishnan Gopalakrishnan Meena, John Gounley, Stephen M. de Bruyn Kops, Feiyi Wang, Ramanan Sankaran, Pei Zhang, 22 Jul 2025, Pixel-Resolved Long-Context Learning for Turbulence at Exascale: Resolving Small-scale Eddies Toward the Viscous Limit, https://arxiv.org/abs/2507.16697
  • Dayou Du, Shijie Cao, Jianyi Cheng, Luo Mai, Ting Cao, Mao Yang, 14 Aug 2025, BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache, https://arxiv.org/abs/2503.18773
  • Shuhai Zhang, Zeng You, Yaofo Chen, Zhiquan Wen, Qianyue Wang, Zhijie Qiu, Yuanqing Li and Mingkui Tan, 14 Aug 2025, Curse of High Dimensionality Issue in Transformer for Long-context Modeling, https://arxiv.org/abs/2505.22107
  • Lizhe Fang, Yifei Wang, Zhaoyang Liu, Chenheng Zhang, Stefanie Jegelka, Jinyang Gao, Bolin Ding, Yisen Wang, 27 Jul 2025, What is Wrong with Perplexity for Long-context Language Modeling?, https://arxiv.org/abs/2410.23771
  • Hyeonseok Moon, Heuiseok Lim, 30 Jul 2025, NeedleChain: Measuring Intact Long-Context Reasoning Capability of Large Language Models, https://arxiv.org/abs/2507.22411
  • Xianda Zheng, Zijian Huang, Meng-Fen Chiang, Michael J. Witbrock, Kaiqi Zhao, 2 Aug 2025, KCR: Resolving Long-Context Knowledge Conflicts via Reasoning in LLMs, https://arxiv.org/abs/2508.01273
  • Yaofo Chen, Zeng You, Shuhai Zhang, Haokun Li, Yirui Li, Yaowei Wang, Mingkui Tan, 4 Aug 2025, Core Context Aware Transformers for Long Context Language Modeling, https://arxiv.org/abs/2412.12465
  • Alexander Golubev, Maria Trofimova, Sergei Polezhaev, Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Sergey Abramov, Andrei Andriushchenko, Filipp Fisin, Sergei Skvortsov, Boris Yangel, 5 Aug 2025, Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning, https://arxiv.org/abs/2508.03501
  • Herbert Ullrich, Jan Drchal, 5 Aug 2025, AIC CTU@FEVER 8: On-premise fact checking through long context RAG, https://arxiv.org/abs/2508.04390
  • Seonghwan Choi, Beomseok Kang, Dongwon Jo, Jae-Joon Kim, 12 Aug 2025, Retrospective Sparse Attention for Efficient Long-Context Generation, https://arxiv.org/abs/2508.09001
  • Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov, 13 Aug 2025, RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression, https://arxiv.org/abs/2502.14051
  • Zhihao Zhan, Jianan Zhao, Zhaocheng Zhu, Jian Tang, 16 Aug 2025, Overcoming Long-Context Limitations of State-Space Models via Context-Dependent Sparse Attention, https://arxiv.org/abs/2507.00449
  • Shaohua Duan, Xinze Li, Zhenghao Liu, Xiaoyuan Yi, Yukun Yan, Shuo Wang, Yu Gu, Ge Yu, Maosong Sun, 19 Aug 2025, Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization, https://arxiv.org/abs/2508.13993
  • Skatje Myers, Dmitriy Dligach, Timothy A. Miller, Samantha Barr, Yanjun Gao, Matthew Churpek, Anoop Mayampurath, Majid Afshar, 20 Aug 2025, Evaluating Retrieval-Augmented Generation vs. Long-Context Input for Clinical Reasoning over EHRs, https://arxiv.org/abs/2508.14817
  • Dong Liu, Yanxuan Yu, 21 Aug 2025, SemToken: Semantic-Aware Tokenization for Efficient Long-Context Language Modeling, https://arxiv.org/abs/2508.15190
  • Bingyang Wu, Zili Zhang, Yinmin Zhong, Guanzhe Huang, Yibo Zhu, Xuanzhe Liu, Xin Jin, 24 Aug 2025, TokenLake: A Unified Segment-level Prefix Cache Pool for Fine-grained Elastic Long-Context LLM Serving, https://arxiv.org/abs/2508.17219
  • Yanming Liu, Xinyue Peng, Jiannan Cao, Yanxin Shen, Tianyu Du, Sheng Cheng, Xun Wang, Jianwei Yin, Xuhong Zhang, 15 Aug 2025, Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding, https://arxiv.org/abs/2410.01671
  • Duygu Altinok, 18 Aug 2025, Whispering Context: Distilling Syntax and Semantics for Long Speech Transcripts, https://arxiv.org/abs/2508.13376
  • Prathamesh Kokate, Mitali Sarnaik, Manavi Khopade, Mukta Takalikar, and Raviraj Joshi, 24 Aug 2025, Efficient Zero-Shot Long Document Classification by Reducing Context Through Sentence Ranking, https://arxiv.org/abs/2508.17490
  • Dulhan Jayalath, James Bradley Wendt, Nicholas Monath, Sandeep Tata, Beliz Gunel, 24 Aug 2025, PRISM: Efficient Long-Range Reasoning With Short-Context LLMs, https://arxiv.org/abs/2412.18914
  • Mo Yu, Tsz Ting Chung, Chulun Zhou, Tong Li, Rui Lu, Jiangnan Li, Liyan Xu, Haoshu Lu, Ning Zhang, Jing Li, Jie Zhou, 14 Aug 2025, PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts, https://arxiv.org/abs/2508.09848
  • Chao Ma, Yikai Hou, Xiang Li, Yinggang Sun, Haining Yu, Zhou Fang, Jiaxing Qu, 4 Sep 2025, Breaking the Context Bottleneck on Long Time Series Forecasting, https://arxiv.org/abs/2412.16572
  • Seganrasan Subramanian, Abhigya Verma, 4 Sep 2025, Modular Techniques for Synthetic Long-Context Data Generation in Language Model Training and Evaluation, https://arxiv.org/abs/2509.01185
  • Zihao Huang, Yu Bao, Qiyang Min, Siyan Chen, Ran Guo, Hongzhi Huang, Defa Zhu, Yutao Zeng, Banggu Wu, Xun Zhou, Siyuan Qiao, 26 Aug 2025, UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning, https://arxiv.org/abs/2508.18756
  • Renat Sergazinov, Shao-An Yin, 30 Aug 2025, Chunked TabPFN: Exact Training-Free In-Context Learning for Long-Context Tabular Data, https://arxiv.org/abs/2509.00326
  • Kesen Wang, Daulet Toibazar, Pedro J. Moreno, 2 Sep 2025, A-SEA3L-QA: A Fully Automated Self-Evolving, Adversarial Workflow for Arabic Long-Context Question-Answer Generation, https://arxiv.org/abs/2509.02864
  • Qianchao Zhu, Jiangfei Duan, Chang Chen, Siran Liu, Guanyu Feng, Xin Lv, Xiao Chuanfu, Dahua Lin, Chao Yang, 3 Sep 2025, SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention, https://arxiv.org/abs/2406.15486
  • Song Yu, Xiaofei Xu, Ke Deng, Li Li, Lin Tian, 8 Sep 2025, Tree of Agents: Improving Long-Context Capabilities of Large Language Models through Multi-Perspective Reasoning, https://arxiv.org/abs/2509.06436
  • Xiaoqiang Lin, Aritra Ghosh, Bryan Kian Hsiang Low, Anshumali Shrivastava, Vijai Mohan, 1 Sep 2025, REFRAG: Rethinking RAG based Decoding, https://www.arxiv.org/abs/2509.01092 https://www.alphaxiv.org/pdf/2509.01092 (Separates the attention computations across RAG chunks, which is effectively the same as "fused KV" or "concatenated KV" approaches with pre-computed per-chunk KV caches.)
  • Microsoft, 17 Sep, 2025, GPT-5 vs GPT-4.1: choosing the right model for your use case https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-models/how-to/model-choice-guide
  • Wei Wu, Zhuoshi Pan, Chao Wang, Liyi Chen, Yunchu Bai, Tianfu Wang, Kun Fu, Zheng Wang, Hui Xiong, 9 Sep 2025, TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection, https://arxiv.org/abs/2411.02886
  • Jielin Qiu, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Jianguo Zhang, Haolin Chen, Shiyu Wang, Ming Zhu, Liangwei Yang, Juntao Tan, Zhepeng Cen, Cheng Qian, Shelby Heinecke, Weiran Yao, Silvio Savarese, Caiming Xiong, Huan Wang, 11 Sep 2025, LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering, https://arxiv.org/abs/2509.09614
  • Junlong Jia, Xing Wu, Chaochen Gao, Ziyang Chen, Zijia Lin, Zhongzhi Li, Weinong Wang, Haotian Xu, Donghui Jin, Debing Zhang, Binghui Guo, 19 Sep 2025, LiteLong: Resource-Efficient Long-Context Data Synthesis for LLMs, https://arxiv.org/abs/2509.15568
  • Jinwen Tang, Qiming Guo, Wenbo Sun and Yi Shang, 19 Sep 2025, A Layered Multi-Expert Framework for Long-Context Mental Health Assessments, https://arxiv.org/abs/2501.13951
  • Chihiro Taguchi, Seiji Maekawa, Nikita Bhutani, 16 Sep 2025, Efficient Context Selection for Long-Context QA: No Tuning, No Iteration, Just Adaptive-$k$, https://arxiv.org/abs/2506.08479
  • Ye Qiao, Sitao Huang, 17 Sep 2025, Q-ROAR: Outlier-Aware Rescaling for RoPE Position Interpolation in Quantized Long-Context LLMs, https://arxiv.org/abs/2509.14391
  • Sami Ul Haq, Chinonso Cynthia Osuji, Sheila Castilho, Brian Davis, 17 Sep 2025, Long-context Reference-based MT Quality Estimation, https://arxiv.org/abs/2509.13980
  • Yuxiang Zhang, Jiangming Shu, Ye Ma, Xueyuan Lin, Shangxi Wu, Jitao Sang, 14 Oct 2025, Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks, https://arxiv.org/abs/2510.12635
  • Baisub Lee, Sanghyun Byun, Mohanad Odema, Jung Guack, Jacob Song, Woo Seong Chung, 14 Oct 2025, APCE: Adaptive Progressive Context Expansion for Long Context Processing, https://arxiv.org/abs/2510.12051
  • Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, Jiecao Chen, 13 Oct 2025, Scaling Long-Horizon LLM Agent via Context-Folding, https://arxiv.org/abs/2510.11967
  • Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A. Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, Saravan Rajmohan, 1 Oct 2025, ACON: Optimizing Context Compression for Long-horizon LLM Agents, https://arxiv.org/abs/2510.00615
  • Bosung Kim and Prithviraj Ammanabrolu, 1 Oct 2025, Beyond Needle(s) in the Embodied Haystack: Environment, Architecture, and Training Considerations for Long Context Reasoning, https://arxiv.org/abs/2505.16928
  • Yingming Zheng, Hanqi Li, Kai Yu and Lu Chen, 24 Sep 2025, When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models, https://arxiv.org/abs/2509.18762
  • Tenghui Li and Guoxu Zhou and Xuyang Zhao and Yuning Qiu and Qibin Zhao, 25 Oct 2025, Efficient Low Rank Attention for Long-Context Inference in Large Language Models, https://arxiv.org/abs/2510.23649
  • Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, Pengjun Xie, Fei Huang, Siheng Chen, Jingren Zhou, Yong Jiang, 28 Oct 2025, AgentFold: Long-Horizon Web Agents with Proactive Context Management, https://arxiv.org/abs/2510.24699
  • Dongwon Jo, Jiwon Song, Yulhwa Kim, Jae-Joon Kim, 28 Oct 2025, FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation, https://arxiv.org/abs/2502.01068
  • J Rosser, Jos\'e Luis Redondo Garc\'ia, Gustavo Penha, Konstantina Palla, Hugues Bouchard, 22 Oct 2025, Stream: Scaling up Mechanistic Interpretability to Long Context in LLMs via Sparse Attention, https://arxiv.org/abs/2510.19875
  • Ling Team, Bin Han, Caizhi Tang, Chen Liang, Donghao Zhang, Fan Yuan, Feng Zhu, Jie Gao, Jingyu Hu, Longfei Li, Meng Li, Mingyang Zhang, Peijie Jiang, Peng Jiao, Qian Zhao, Qingyuan Yang, Wenbo Shen, Xinxing Yang, Yalin Zhang, Yankun Ren, Yao Zhao, Yibo Cao, Yixuan Sun, Yue Zhang, Yuchen Fang, Zibin Lin, Zixuan Cheng, Jun Zhou, 23 Oct 2025, Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning, https://arxiv.org/abs/2510.19338
  • Sheikh Jubair, Arwa Omayrah, Amal Alshammari, Alhanoof Althnian, Abdulhamed Alothaimen, Norah A. Alzahrani, Shahad D. Alzaidi, Nora Al-Twairesh, Abdulmohsen Al-Thubaity, 19 Oct 2025, LC-Eval: A Bilingual Multi-Task Evaluation Benchmark for Long-Context Understanding, https://arxiv.org/abs/2510.16783
  • Haozhen Zhang, Tao Feng, Pengrui Han, Jiaxuan You, 20 Oct 2025, AcademicEval: Live Long-Context LLM Benchmark, https://arxiv.org/abs/2510.17725
  • Anmol Mekala, Anirudh Atmakuru, Yixiao Song, Marzena Karpinska, Mohit Iyyer, 20 Sep 2025, Does quantization affect models' performance on long-context tasks?, https://arxiv.org/abs/2505.20276
  • Taejong Joo, Diego Klabjan, 25 Oct 2025, Technical Debt in In-Context Learning: Diminishing Efficiency in Long Context, https://arxiv.org/abs/2502.04580
  • Varshini Reddy, Rik Koncel-Kedziorski, Viet Dac Lai, Michael Krumdick, Charles Lovering, Chris Tanner, 24 Oct 2025, DocFinQA: A Long-Context Financial Reasoning Dataset, https://arxiv.org/abs/2401.06915
  • Taejong Joo, Shu Ishida, Ivan Sosnovik, Bryan Lim, Sahand Rezaei-Shoshtari, Adam Gaier, Robert Giaquinto, 26 Sep 2025, Graph of Agents: Principled Long Context Modeling by Emergent Multi-Agent Collaboration, https://arxiv.org/abs/2509.21848
  • Seong-Woong Shim, Myunsoo Kim, Jae Hyeon Cho, Byung-Jun Lee, 26 Sep 2025, Beyond RAG vs. Long-Context: Learning Distraction-Aware Retrieval for Efficient Knowledge Grounding, https://arxiv.org/abs/2509.21865
  • Zhuo Yang, Daolang Wang, Lingli Ge, Beilun Wang, Tianfan Fu, Yuqiang Li, 26 Sep 2025, Reasoning BO: Enhancing Bayesian Optimization with Long-Context Reasoning Power of LLMs, https://arxiv.org/abs/2505.12833
  • Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Mengdi Zhang, Jian Shao, Yueting Zhuang, 26 Sep 2025, InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models, https://arxiv.org/abs/2503.06692
  • Yingfa Chen, Yutong Wu, Chenyang Song, Zhen Leng Thai, Xingyu Shen, Xu Han, Zhiyuan Liu, Maosong Sun, 26 Sep 2025, Cost-Optimal Grouped-Query Attention for Long-Context Modeling, https://arxiv.org/abs/2503.09579
  • Peize He, Zichen Wen, Yubo Wang, Yuxuan Wang, Xiaoqian Liu, Jiajie Huang, Zehui Lei, Zhuangcheng Gu, Xiangqi Jin, Jiabing Yang, Kai Li, Zhifei Liu, Weijia Li, Cunxiang Wang, Conghui He, Linfeng Zhang, 8 Oct 2025, AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs, https://arxiv.org/abs/2510.07293
  • Yunhao Fang, Weihao Yu, Shu Zhong, Qinghao Ye, Xuehan Xiong, Lai Wei, 8 Oct 2025, Artificial Hippocampus Networks for Efficient Long-Context Modeling, https://arxiv.org/abs/2510.07318
  • Haoran Li, Yingjie Qin, Baoyuan Ou, Lai Xu, Ruiwen Xu, 8 Oct 2025, HoPE: Hybrid of Position Embedding for Long Context Vision-Language Models, https://arxiv.org/abs/2505.20444
  • Junlong Jia, Ziyang Chen, Xing Wu, Chaochen Gao, Zijia Lin, Debing Zhang, Songlin Hu, Binghui Guo, 26 Sep 2025, EntropyLong: Effective Long-Context Training via Predictive Uncertainty, https://arxiv.org/abs/2510.02330
  • Xuan Xu, Haolun Li, Zhongliang Yang, Beilin Chu, Jia Song, Moxuan Xu, Linna Zhou, 3 Oct 2025, Topic Modeling as Long-Form Generation: Can Long-Context LLMs revolutionize NTM via Zero-Shot Prompting?, https://arxiv.org/abs/2510.03174
  • Tao Bu, Qiangang Wang, Bowen Zeng, Hanwen Sun, Yunpeng Huang, Chun Cao, Jingwei Xu, 19 Oct 2025, Long-Context Attention Benchmark: From Kernel Efficiency to Distributed Context Parallelism, https://arxiv.org/abs/2510.17896
  • Yonghao Zhuang, Junda Chen, Bo Pang, Yi Gu, Yibo Zhu, Yimin Jiang, Ion Stoica, Eric Xing, Hao Zhang, 20 Oct 2025, Efficient Long-context Language Model Training by Core Attention Disaggregation, https://arxiv.org/abs/2510.18121
  • Wenxuan Li, Chengruidong Zhang, Huiqiang Jiang, Yucheng Li, Yuqing Yang, Lili Qiu, 21 Oct 2025, MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training, https://arxiv.org/abs/2510.18830
  • Miao Li, Alexander Gurung, Irina Saparina, Mirella Lapata, 25 Sep 2025, Who Gets Cited Most? Benchmarking Long-Context Language Models on Scientific Articles, https://arxiv.org/abs/2509.21028
  • Shiju Wang, Yujie Wang, Ao Sun, Fangcheng Fu, Zijian Zhu, Bin Cui, Xu Han, Kaisheng Ma, 25 Sep 2025, Data-Centric Elastic Pipeline Parallelism for Efficient Long-Context LLM Training, https://arxiv.org/abs/2509.21275
  • Yaorui Shi, Yuxin Chen, Siyuan Wang, Sihang Li, Hengxing Cai, Qi Gu, Xiang Wang, An Zhang, 27 Sep 2025, Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents, https://arxiv.org/abs/2509.23040
  • Min Liu, Deepak Pathak, Ananye Agarwal, 28 Sep 2025, LocoFormer: Generalist Locomotion via Long-context Adaptation, https://arxiv.org/abs/2509.23745
  • Pavlo Vasylenko, Hugo Pitorro, Andr\'e F. T. Martins, Marcos Treviso, 27 Sep 2025, Long-Context Generalization with Sparse Attention, https://arxiv.org/abs/2506.16640
  • Yuatyong Chaichana, Pittawat Taveekitworachai, Warit Sirichotedumrong, Potsawee Manakul, Kunat Pipatanakul, 17 Oct 2025, Extending Audio Context for Long-Form Understanding in Large Audio-Language Models, https://arxiv.org/abs/2510.15231
  • Siddharth Chaudhary, Dev Patel, Maheep Chaudhary, Bennett Browning, 16 Oct 2025, Hydra: A Modular Architecture for Efficient Long-Context Reasoning, https://arxiv.org/abs/2508.15099
  • Naman Gupta, Shreeyash Gowaikar, Arun Iyer, Kirankumar Shiragur, Ramakrishna B Bairi, Rishikesh Maurya, Ritabrata Maiti, Sankarshan Damle, Shachee Mishra Gupta, 6 Oct 2025, COSMIR: Chain Orchestrated Structured Memory for Iterative Reasoning over Long Context, https://arxiv.org/abs/2510.04568
  • Xin Liu, Xudong Wang, Pei Liu, Guoming Tang, 5 Oct 2025, ZSMerge: Zero-Shot KV Cache Compression for Memory-Efficient Long-Context LLMs, https://arxiv.org/abs/2503.10714
  • Guangya Wan, Mingyang Ling, Xiaoqi Ren, Rujun Han, Sheng Li, and Zizhao Zhang, 9 Oct 2025, COMPASS: Enhancing Agent Long-Horizon Reasoning with Evolving Context, https://arxiv.org/abs/2510.08790
  • Mufei Li, Dongqi Fu, Limei Wang, Si Zhang, Hanqing Zeng, Kaan Sancak, Ruizhong Qiu, Haoyu Wang, Xiaoxin He, Xavier Bresson, Yinglong Xia, Chonglin Sun, Pan Li, 10 Oct 2025, Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation, https://arxiv.org/abs/2510.07414
  • Zhuo Chen, Oriol Mayn\'e i Comas, Zhuotao Jin, Di Luo, Marin Solja\v{c}i\'c, 24 Oct 2025, L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling, https://arxiv.org/abs/2503.04725
  • Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang, Murali Annavarm, 10 Oct 2025, DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning, https://arxiv.org/abs/2510.09883
  • Chenyu Jiang, Zhenkun Cai, Ye Tian, Zhen Jia, Yida Wang, Chuan Wu, 12 Oct 2025, DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism, https://arxiv.org/abs/2510.10620
  • Yongqiang Yao, Jingru Tan, Kaihuan Liang, Feizhao Zhang, Jiahao Hu, Shuo Wu, Yazhe Niu, Ruihao Gong, Dahua Lin, Ningyi Xu, 13 Oct 2025, Hierarchical Balance Packing: Towards Efficient Supervised Fine-tuning for Long-Context LLM, https://arxiv.org/abs/2503.07680
  • Huashan Sun, Shengyi Liao, Yansen Han, Yu Bai, Yang Gao, Cheng Fu, Weizhou Shen, Fanqi Wan, Ming Yan, Ji Zhang, Fei Huang, 12 Oct 2025, SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization, https://arxiv.org/abs/2505.11166
  • Soyeong Jeong, Taehee Jung, Sung Ju Hwang, Joo-Kyung Kim, Dongyeop Kang, 8 Oct 2025, When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs, https://arxiv.org/abs/2510.07499
  • Jaeseong Lee, seung-won hwang, Aurick Qiao, Gabriele Oliaro, Ye Wang, Samyam Rajbhandari, 8 Oct 2025, OWL: Overcoming Window Length-Dependence in Speculative Decoding for Long-Context Inputs, https://arxiv.org/abs/2510.07535
  • Yuzhe Gu, Xiyu Liang, Jiaojiao Zhao, Enmao Diao, 9 Oct 2025, OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference, https://arxiv.org/abs/2510.07651
  • Zeyu Liu, Souvik Kundu, Lianghao Jiang, Anni Li, Srikanth Ronanki, Sravan Bodapati, Gourav Datta, Peter A. Beerel, 22 Sep 2025, LAWCAT: Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling, https://arxiv.org/abs/2509.18467
  • Shi Chen, Zhengjiang Lin, Yury Polyanskiy, Philippe Rigollet, 7 Oct 2025, Critical attention scaling in long-context transformers, https://arxiv.org/abs/2510.05554
  • Zecheng Tang, Baibei Ji, Juntao Li, Lijun Wu, Haijia Gui, Min Zhang, 7 Oct 2025, Revisiting Long-context Modeling from Context Denoising Perspective, https://arxiv.org/abs/2510.05862

Research on Quadratic Attention Cost

Linearizing the attention algorithm to avoid the quadratic cost of attention processing is an area with a massive research base with numerous algorithms proposed. Faster attention algorithms include sparse attention and Flash Attention. See research on attention optimization methods.

Research on Positional Encoding Optimization

One of the less obvious bottlenecks for long contexts is the positional encoding algorithm. See research on positional encoding optimizations and removing positional encoding.

Research on Context Length

Research on making "longer" Transformer models includes:

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: