Aussie AI
Long and Ultralong Context LLMs
-
Last Updated 2 March, 2025
-
by David Spuler, Ph.D.
What are Long Context LLMs?
Long context LLMs are model architectures that can accept input texts that are long, consisting of many tokens. Long context models are generally defined as those beyond 100k tokens, such as 128k, which has become common in major models. The latest SOTA models that can handle over a million tokens in their context window are called "ultralong models" in recent terminology.
Context window size is the number of input tokens that a model can process. Early models, even ChatGPT, had small context windows of about 2048. Each token is usually a part-word or a whole word, so this meant it could process inputs of about 1,000-2,000 words.
Why Long Context?
Why seek a longer input size? Or output size? Well, because a report is going to be 5,000 words, and a full-length novel is 100,000 words, or even 200,000 words if it's in the "epic sci-fi" genre. These are all tokens that must be output (for generation) or input (for editing).
Context is lots of other things, too. It's not just single documents in the "context" of an LLM, and the input and output of an LLM may include:
- The conversational history for a chatbot
- Chunks of multiple documents in a RAG system
- Video streams of almost unfathomable sizes
- All of the PDF receipt documents on your laptop (an AI tax return app)
- An entire software repository of many code files (an AI coding assistant)
- Combinations of text and images (multimodal LLMs)
There are many reasons we might want an LLM to be able to handle a lot of input, or to create a lot of output. But the main problem has been doing this many tokens efficiently.
Newer models have been increasing the context window size. For example, GPT-4 has a 32k window size, which is 32,000 tokens, which will handle a small novella or novellette of maybe 15,000-20,000 words. Anthropic reportedly has a Claude model with a 100k context window. MPT has an open-source model called "MPT-StoryWriter-65k+" with a 65,000 token window size.
Optimizing the Context Window Size
Why is there this context size limitation? One of the main bottlenecks is the "quadratic" cost of the self-attention mechanism. And there are various ways to optimize attention to overcome this limitation. However, it's not the only bottleneck. Alperovich (2023) offers the "secret sauce" for long contexts as fixing three main bottlenecks:
- Quadratic attention cost (in the input token size)
- Quadratic size of internal tensors (in the model dimension)
- Positional embedding cost.
Some of the techniques with relevance to allowing the processing and creation of longer token sequences include:
- Attention optimization
- Autoregression optimizations
- Tokenization algorithms
- Token pruning
- Embeddings pruning
- Length pruning
- Positional encoding optimization
Ultra-Long Context Models
Ultra-long context models are LLMs that accept a context width of more than 1M tokens. Although the largest models are not yet accepting such a long context (although 128k is common), there are a few commercial models that accept over 1M tokens in their context window.
Reserach papers on ultra-long context LLMs:
- MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li, Junbin Xie, Junhao Xu, Junjie Yan, Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Zheng, Linbo Chai, Long Xing, Meizhi Ju, Mingyuan Chi, Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei Li, Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang, Qin Wang, Qiuhui Li, Ruitao Leng, Shengmin Shi, Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang, Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu Cheng, Wenkai Li, Xiangjun Song, Xiao Su, Xiaodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min, Xun Zou, Xuyang Shen, Yan Gong, Yingjie Zhu, Yipeng Zhou, Yiran Zhong, Yongyi Hu, Yuanxiang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan Huang, Yunji Li, Yunpeng Huang, Yunzhi Xu, Yuxin Mao, Zehan Li, Zekang Li, Zewei Tao, Zewen Ying, Zhaoyang Cong, Zhen Qin, Zhenhua Fan, Zhihang Yu, Zhuo Jiang, Zijia Wu, 14 Jan 2025, MiniMax-01: Scaling Foundation Models with Lightning Attention, https://arxiv.org/abs/2501.08313 https://github.com/MiniMax-AI (Content window over 1 million tokens.)
- MiniMax, Jan 2025, MiniMax-01: Scaling Foundation Models with Lightning Attention, https://filecdn.minimax.chat/_Arxiv_MiniMax_01_Report.pdf
- MiniMax, Jan 2025, MiniMax-01 is Now Open-Source: Scaling Lightning Attention for the AI Agent Era, https://www.minimaxi.com/en/news/minimax-01-series-2
- Jonathan Roberts, Kai Han, Samuel Albanie, 7 Nov 2024, Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks? https://arxiv.org/abs/2411.05000
- Lili Yu, Dániel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, Mike Lewis, 19 May 2023, MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers, https://arxiv.org/abs/2305.07185
- Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami, 4 Apr 2024 (v3), KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization, https://arxiv.org/abs/2401.18079
- Amey Agrawal, Junda Chen, Íñigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, Esha Choukse, 25 Sep 2024, Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations, https://arxiv.org/abs/2409.17264
- Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jongsoo Park, Jianyu Huang, 4 Nov 2024, Context Parallelism for Scalable Million-Token Inference, https://arxiv.org/abs/2411.01783
- Qwen Team, November 15, 2024, Extending the Context Length to 1M Tokens! https://qwenlm.github.io/blog/qwen2.5-turbo/ (Qwen extended to long context via sparse attention.)
- Demis Hassabis, Jan 2025, X post: Announcing Gemini 2.0 Flash https://x.com/demishassabis/status/1881844417746632910 (Gemini 2.0 Flash from Google is a Large Reasoning Model with a 1M ultra-long context.)
- Manpreet Singh, Feb 2025, Goodbye RAG? Gemini 2.0 Flash Have Just Killed It! https://ai.gopubby.com/goodbye-rag-gemini-2-0-flash-have-just-killed-it-96301113c01f
- Heejun Lee, Geon Park, Jaduk Suh, Sung Ju Hwang, 13 Feb 2025, InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU, https://arxiv.org/abs/2502.08910 (Using dynamic token pruning and KV cache data offloading to CPU memory.)
- Tong Wu, Junzhe Shen, Zixia Jia, Yuxuan Wang, Zilong Zheng, 26 Feb 2025, From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens, https://arxiv.org/abs/2502.18890 (Extending speculative decoding to address three bottlenecks in ultra-long context: frequent model reloading, KV cache size, and repetitive output content generation. Uses techniques such as KV cache eviction and decoding penalties to avoid repetition.) https://github.com/bigai-nlco/TokenSwift
Length Generalization: Accuracy with Longer Contexts
Speed is not the only problem with long contexts. The vanilla Transformers are also not particularly good at generalizing their results with a long context window. This ability is known as "length generalization" (or "length extrapolation"), and improving the accuracy in long inputs and longer outputs is an area of active research.
One of the methods being analyzed to improve length generalization is called "scratchpad" or "chain-of-thought" algorithms. The idea is that the AI inference engine emits rough summaries to an internal scratchpad at regular intervals, which are merged into subsequent inference, thereby the AI helps itself keep track of its own thoughts over a longer output sequence.
Research papers on "length generalization" include:
- Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, Behnam Neyshabur, Nov 2022, Exploring Length Generalization in Large Language Models, https://arxiv.org/abs/2207.04901
- Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. May 2023. The impact of positional encoding on length generalization in transformers. arXiv preprint arXiv:2305.19466, https://arxiv.org/abs/2305.19466
- H Jin, X Han, J Yang, Z Jiang, CY Chang, X Hu, Oct 2023, GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length, arXiv preprint arXiv:2310.00576, https://browse.arxiv.org/pdf/2310.00576.pdf
- Shuaiqi Liu, 2024, Neural Abstractive Summarization for Long Documents, Ph.D. Thesis, The Hong Kong Polytechnic University, https://theses.lib.polyu.edu.hk/bitstream/200/12810/3/7260.pdf
- Mackenzie Morehead, Apr 16, 2024, Is Attention All You Need? https://www.mackenziemorehead.com/is-attention-all-you-need/
- Chonghua Wang, Haodong Duan, Songyang Zhang, Dahua Lin, Kai Chen, 10 Apr 2024 (v2), Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks, https://arxiv.org/abs/2404.06480 Code: https://github.com/open-compass/Ada-LEval
- Jishnu Ray Chowdhury, Cornelia Caragea, May 2023, Monotonic Location Attention for Length Generalization, https://arxiv.org/abs/2305.20019
- Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe, May 2023, Let's Verify Step by Step, https://arxiv.org/abs/2305.20050
- Nan Yang, Laicheng Zhong, Fan Huang, Dong Yuan, Wei Bao, Feb 2023, Random Padding Data Augmentation, https://arxiv.org/abs/2302.08682
- Benjamin Newman, John Hewitt, Percy Liang, Christopher D. Manning, Oct 2020, The EOS Decision and Length Extrapolation, https://arxiv.org/abs/2010.07174
- Anian Ruoss, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Róbert Csordás, Mehdi Bennani, Shane Legg, Joel Veness, May 2023, Randomized Positional Encodings Boost Length Generalization of Transformers, https://arxiv.org/abs/2305.16843
- Benjamin Newman, John Hewitt, Percy Liang, Christopher D. Manning, Oct 2020, The EOS Decision and Length Extrapolation, https://arxiv.org/abs/2010.07174v1
- Mirelle Bueno, Carlos Gemmell, Jeffrey Dalton, Roberto Lotufo, Rodrigo Nogueira, Nov 2022, Induced Natural Language Rationales and Interleaved Markup Tokens Enable Extrapolation in Large Language Models, https://arxiv.org/abs/2208.11445 Code: https://github.com/MirelleB/induced-rationales-markup-tokens
- Kai Shen, Junliang Guo, Xu Tan, Siliang Tang, Rui Wang, Jiang Bian, Feb 2023, A Study on ReLU and Softmax in Transformer, https://arxiv.org/abs/2302.06461
- David Chiang, Peter Cholak, Feb 2022, Overcoming a Theoretical Limitation of Self-Attention, https://arxiv.org/abs/2202.12172
- Amirkeivan Mohtashami, Martin Jaggi, May 2023, Landmark Attention: Random-Access Infinite Context Length for Transformers, https://arxiv.org/abs/2305.16300
- kaiokendev.github.io, Sep 2023 (accessed), Extending Context is Hard…but not Impossible, https://kaiokendev.github.io/context
- Saurav Pawar, S.M Towhidul Islam Tonmoy, S M Mehedi Zaman, Vinija Jain, Aman Chadha, Amitava Das, 15 Jan 2024, The What, Why, and How of Context Length Extension Techniques in Large Language Models -- A Detailed Survey, https://arxiv.org/abs/2401.07872
- Akarsh Kumar, 2023, Long-Range Memory Transformers, Massachusetts Institute of Technology, https://akarshkumar.com/downloads/nlp_longrange_memory_transformer.pdf https://github.com/akarshkumar0101/transformer-memory/tree/master/skip (Separates gradient information into explicitly long-range and short-range components.)
- David Spuler, March 2024, Chapter 20. Attention, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- MohammadReza Ebrahimi, Sunny Panchal, Roland Memisevic, 10 Aug 2024, Your Context Is Not an Array: Unveiling Random Access Limitations in Transformers, https://arxiv.org/abs/2408.05506 (Explores how the sequential nature of token access in attention reduces the accuracy in long context analysis.)
- Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li, 13 Aug 2024, LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs, https://arxiv.org/abs/2408.07055
- David Spuler, March 2024, Length Generalization, in Generative AI in C++, https://www.aussieai.com/book/ch20-length-generalization
- Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, Boris Ginsburg, 6 Aug 2024 (v3), RULER: What's the Real Context Size of Your Long-Context Language Models? https://arxiv.org/abs/2404.06654 https://github.com/hsiehjackson/RULER
- Jonathan Roberts, Kai Han, Samuel Albanie, 7 Nov 2024, Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks? https://arxiv.org/abs/2411.05000
- Minghan Li, Eric Gaussier, Juntao Li, Guodong Zhou, 9 Nov 2024, KeyB2: Selecting Key Blocks is Also Important for Long Document Ranking with Large Language Models, https://arxiv.org/abs/2411.06254
- Qingfei Zhao, Ruobing Wang, Yukuo Cen, Daren Zha, Shicheng Tan, Yuxiao Dong, Jie Tang, 23 Oct 2024, LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering, https://arxiv.org/abs/2410.18050 https://github.com/QingFei1/LongRAG
- Yansheng Mao, Jiaqi Li, Fanxu Meng, Jing Xiong, Zilong Zheng, Muhan Zhang, 18 Dec 2024, LIFT: Improving Long Context Understanding Through Long Input Fine-Tuning, https://arxiv.org/abs/2412.13626
- Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li, 19 Dec 2024, LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks, https://arxiv.org/abs/2412.15204 https://longbench2.github.io/
- Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
- Haotian Xu, Xing Wu, Weinong Wang, Zhongzhi Li, Da Zheng, Boyuan Chen, Yi Hu, Shijia Kang, Jiaming Ji, Yingying Zhang, Zhijiang Guo, Yaodong Yang, Muhan Zhang, Debing Zhang, 20 Jan 2025, RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems? https://arxiv.org/abs/2501.11284 https://huggingface.co/RedStar-Reasoning
- Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Industry Articles on Long Context Length
Real-world industry models have started offering longer context windows:
- Anthropic, Introducing 100K Context Windows, May 11, 2023, https://www.anthropic.com/index/100k-context-windows
- The MosaicML NLP Team, May 5, 2023, Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs, https://www.mosaicml.com/blog/mpt-7b (See long context model MPT-7B-StoryWriter-65k+)
- Galina Alperovich, May 16, 2023, The Secret Sauce behind 100K context window in LLMs: all tricks in one place, https://blog.gopenai.com/how-to-speed-up-llms-and-use-100k-context-window-all-tricks-in-one-place-ffd40577b4c
-
Carl Franzen,
September 29, 2023,
Meta quietly unveils Llama 2 Long AI that beats GPT-3.5 Turbo and Claude 2 on some tasks,
Venture Beat,
https://venturebeat.com/ai/meta-quietly-releases-llama-2-long-ai-that-outperforms-gpt-3-5-and-claude-2-on-some-tasks//a>
(Context windows up to 32k.)
- Reddit User pseudonerv, June 2022, A simple way to "Extending Context to 8K"?! Reddit Local LLama group, https://www.reddit.com/r/LocalLLaMA/comments/14fgjqj/a_simple_way_to_extending_context_to_8k/
- Zichang Liu, April 2024, Ph.D. Thesis, Rice University, Houston, Texas, https://repository.rice.edu/server/api/core/bitstreams/a089344e-6f6b-44d2-a1c3-6cef2c303e86/content (Using sparsity to compress the KV cache for long context windows.)
- Yao Fu, 14 May 2024, Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis, https://arxiv.org/abs/2405.08944 (The KV cache size is the main bottleneck for long context processing, in both prefill and decoding phases, and includes analysis of different optimizations to address this.)
- Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, Chunting Zhou, 16 Apr 2024 (v2), Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length, https://arxiv.org/abs/2404.08801 Code: https://github.com/XuezheMax/megalodon
- Mackenzie Morehead, Apr 16, 2024, Is Attention All You Need? https://www.mackenziemorehead.com/is-attention-all-you-need/
- Tsendsuren Munkhdalai, Manaal Faruqui, Siddharth Gopal, 10 Apr 2024, Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention, https://arxiv.org/abs/2404.07143
- Chonghua Wang, Haodong Duan, Songyang Zhang, Dahua Lin, Kai Chen, 10 Apr 2024 (v2), Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks, https://arxiv.org/abs/2404.06480 Code: https://github.com/open-compass/Ada-LEval
- Aydar Bulatov, Yuri Kuratov, Yermek Kapushev, Mikhail Burtsev, March 2024, Beyond Attention: Breaking the Limits of Transformer Context Length with Recurrent Memory, Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38 No. 16: AAAI-24 Technical Tracks 16, https://ojs.aaai.org/index.php/AAAI/article/view/29722
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
- Yunpeng Huang, Jingwei Xu, Zixu Jiang, Junyu Lai, Zenan Li, Yuan Yao, Taolue Chen, Lijuan Yang, Zhou Xin, Xiaoxing Ma, Nov 2023, Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey, https://arxiv.org/abs/2311.12351 Project: https://github.com/Strivin0311/long-llms-learning
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis, Dec 2023, Efficient Streaming Language Models with Attention Sinks https://arxiv.org/abs/2309.17453 Code: https://github.com/mit-han-lab/streaming-llm
- Bowen Peng, Jeffrey Quesnelle, Honglu Fan, Enrico Shippole, Nov 2023, YaRN: Efficient Context Window Extension of Large Language Models, https://arxiv.org/abs/2309.00071 Code: https://huggingface.co/NousResearch/Yarn-Mistral-7b-128k
- kaiokendev.github.io, Sep 2023 (accessed), Extending Context is Hard…but not Impossible, https://kaiokendev.github.io/context
- Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. May 2023. The impact of positional encoding on length generalization in transformers. arXiv preprint arXiv:2305.19466, https://arxiv.org/abs/2305.19466 (Evaluates various positional encoding algorithms in decoder-only Transformers.)
- D Mamakas, P Tsotsi, I Androutsopoulos, 2022, Processing long legal documents with pre-trained transformers: Modding legalbert and longformer, https://arxiv.org/abs/2211.00974
- A Askari, S Verberne, O Alonso, S Marchesin, 2021, Combining Lexical and Neural Retrieval with Longformer-based Summarization for Effective Case Law Retrieval. DESIRES, 2021, https://desires.dei.unipd.it/2021/papers/paper-02.pdf.pdf
- Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating Wikipedia by Summarizing Long Sequences. In Proceedings of the 6th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1801.10198
- Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, 2019, https://arxiv.org/abs/1901.02860 Code: https://github.com/kimiyoung/transformer-xl
- Cao, Q.; Trivedi, H.; Balasubramanian, A.; and Balasubramanian, N. 2020. Deformer: Decomposing pre-trained transformers for faster question answering. arXiv preprint arXiv:2005.00697, https://arxiv.org/abs/2005.00697 Code: https://github.com/StonyBrookNLP/deformer
- 3 Feb 2024, Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models, Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, Armaghan Eshaghi, https://arxiv.org/abs/2402.02244 (A survey of processing long context length using methods such as positional encoding and approximate attention including Softmax-free attention.)
- Shuaiqi Liu, 2024, Neural Abstractive Summarization for Long Documents, Ph.D. Thesis, The Hong Kong Polytechnic University, https://theses.lib.polyu.edu.hk/bitstream/200/12810/3/7260.pdf
- Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, Lingpeng Kong, 27 Feb 2024, Training-Free Long-Context Scaling of Large Language Models, https://arxiv.org/abs/2402.17463 Code: https://github.com/hkunlp/chunkllama
- Lili Yu, Dániel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, Mike Lewis, 19 May 2023, MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers, https://arxiv.org/abs/2305.07185
- Emma Järvinen, 2023, Aalto University, Finland, Master’s thesis in Mathematics and Operations Research, Long-input summarization using Large Language Models, https://aaltodoc.aalto.fi/server/api/core/bitstreams/cd47964e-5b5e-4af0-9af4-731970184358/content
- Akruti Acharya, May 25, 2023, MEGABYTE, Meta AI’s New Revolutionary Model Architecture, Explained, https://encord.com/blog/meta-ai-megabyte-model-architecture-explained/
- Saurav Pawar, S.M Towhidul Islam Tonmoy, S M Mehedi Zaman, Vinija Jain, Aman Chadha, Amitava Das, 15 Jan 2024, The What, Why, and How of Context Length Extension Techniques in Large Language Models -- A Detailed Survey, https://arxiv.org/abs/2401.07872
- Jingyu Wang; Lu Zhang; Xueqing Li; Huazhong Yang; Yongpan Liu, Nov 2023, ULSeq-TA: Ultra-Long Sequence Attention Fusion Transformer Accelerator Supporting Grouped Sparse Softmax and Dual-Path Sparse LayerNorm, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (Early Access), https://ieeexplore.ieee.org/abstract/document/10304367
- Yuzhen Mao, Martin Ester, Ke Li, 2023, IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs, NeurIPS 2023, https://neurips2023-enlsp.github.io/papers/paper_24.pdf
- Akarsh Kumar, 2023, Long-Range Memory Transformers, Massachusetts Institute of Technology, https://akarshkumar.com/downloads/nlp_longrange_memory_transformer.pdf https://github.com/akarshkumar0101/transformer-memory/tree/master/skip (Separates gradient information into explicitly long-range and short-range components.)
- Bin Lin, Tao Peng, Chen Zhang, Minmin Sun, Lanbo Li, Hanyu Zhao, Wencong Xiao, Qi Xu, Xiafei Qiu, Shen Li, Zhigang Ji, Yong Li, Wei Lin, 5 Jan 2024, Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache, https://arxiv.org/abs/2401.02669 (Long context processing by a modification to the QKV caching mechanisms.)
- QuaLA-MiniLM: a Quantized Length Adaptive MiniLM, Shira Guskin, Moshe Wasserblat, Chang Wang, Haihao Shen, May 2023, https://arxiv.org/abs/2210.17114 (Intel labs paper. Low-bit quantization, distillation, and Length-Adaptive Transformer (LAT) technique. )
- Amazon, Oct 2023, MistralLite Model, https://huggingface.co/amazon/MistralLite
- DY Fu, S Arora, J Grogan, I Johnson, S Eyuboglu, Oct 2023, Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture https://arxiv.org/pdf/2310.12109.pdf
- C Hao, P Zhang, M Xie, D Zhao, 2023, Recurrent Transformers for Long Document Understanding, CCF International Conference on Natural Language, https://dl.acm.org/doi/abs/10.1007/978-3-031-44693-1_5
- Dacheng Li, Rulin Shao, Anze Xie, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica, Xuezhe Ma, Hao Zhang, Oct 2023, LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers, https://arxiv.org/abs/2310.03294v1
- Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. https://openreview.net/forum?id=YicbFdNTTy
- Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami, 4 Apr 2024 (v3), KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization, https://arxiv.org/abs/2401.18079
- S Yang, S Zhang, M Fang, F Yang, S Liu, 2022, A hierarchical representation model based on longformer and transformer for extractive summarization https://www.mdpi.com/2079-9292/11/11/1706
- J Ding, S Ma, L Dong, X Zhang, S Huang 2023, Longnet: Scaling transformers to 1,000,000,000 tokens, https://arxiv.org/abs/2307.02486
- Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 Code: https://github.com/HazyResearch/flash-attention (The original FlashAttention version 1, now superceded by FlashAttention 2, which uses tiling and memory-aware kernels to optimize attention.)
- David Spuler, March 2024, Chapter 20. Attention, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia, 8 Mar 2024 (v3), LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, https://arxiv.org/abs/2309.12307 Code: https://github.com/dvlab-research/LongLoRA
- Yingbing Huang, Lily Jiaxin Wan, Hanchen Ye, Manvi Jha, Jinghua Wang, Yuhong Li, Xiaofan Zhang, Deming Chen, 16 Jun 2024, New Solutions on LLM Acceleration, Optimization, and Application, https://arxiv.org/abs/2406.10903 (A survey of inference optimization methods and further analysis of Medusa-type speculative decoding and KV cache compression. Also explores hardware co-design, ML compilers and LLM-assisted code debugging.)
- Ziyan Jiang, Xueguang Ma, Wenhu Chen, June 2024, LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs, arXiv preprint arXiv:2406.15319, https://arxiv.org/abs/2406.15319 (Improved accuracy performance of RAG methods when using a long context LLM and longer chunk sizes for the retriever.)
- Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu, 24 Jun 2024, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers, https://arxiv.org/abs/2406.16747 (Sparse KV cache for memory-efficient decoding on long contexts by selecting KV pairs of salient tokens.)
- Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan, 26 Jun 2024, LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference, https://arxiv.org/abs/2406.18139 (KV cache compression in text and multimodal inference, prioritizing eviction of text over image tokens, and using new ways to merge evicted KV cache data into the retained KV cache, including averaging, pivotal tokens, and weighted averages, which is relevant to token merging and KV cache fusion
- Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun, 4 Jun 2024, Block Transformer: Global-to-Local Language Modeling for Fast Inference, https://arxiv.org/abs//2406.02657 Code: https://github.com/itsnamgyu/block-transformer (Impressive technique of combining tokens into blocks, then doing inference on the blocks, then unblocking to get tokens.)
- Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 28 Jun 2024, InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management, https://arxiv.org/abs/2406.19707 (KV caching optimization using salient token pruning for the attention layer.)
- Jiayi Yuan, Hongyi Liu, Shaochen (Henry)Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, Xia Hu, 1 Jul 2024, KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches, https://arxiv.org/abs/2407.01527 Code: https://github.com/henryzhongsc/longctx_bench (Survey and benchmarking of several KV cache compression and long context handling techniques.)
- Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu, 2 Jul 2024, MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention, https://arxiv.org/abs/2407.02490 Code: https://aka.ms/MInference
- An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Zeyu Cui, Zhenru Zhang, Zhihao Fan, 15 Jul 2024, Qwen2 Technical Report, https://arxiv.org/abs/2407.10671
- Siddharth Jha, Lutfi Eren Erdogan, Sehoon Kim, Kurt Keutzer, Amir Gholami, 11 Jul 2024, Characterizing Prompt Compression Methods for Long Context Inference, https://arxiv.org/abs/2407.08892
- Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
- Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi, 19 Jul 2024, LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference, https://arxiv.org/abs/2407.14057
- Emilia David, April 30, 2024, ChatGPT’s AI ‘memory’ can remember the preferences of paying customers, The Verge, https://www.theverge.com/2024/4/29/24144680/chatgpt-plus-memory-chatbot-subscription-details-preferences-personal-assistant
- Iz Beltagy, Matthew E. Peters, Arman Cohan, Dec 2020, Longformer: The Long-Document Transformer, arXiv preprint arXiv:2004.05150 (2020). https://arxiv.org/abs/2004.05150
- Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers. https://arxiv.org/abs/1904.10509
- J. Su. 2023, Rectified rotary position embeddings. https://github.com/bojone/rerope
- B Peng, J Quesnelle, H Fan, E Shippole, 2023 YaRN: Efficient Context Window Extension of Large Language Models, https://arxiv.org/pdf/2309.00071.pdf
- Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li, 13 Aug 2024, LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs, https://arxiv.org/abs/2408.07055 Code: https://github.com/THUDM/LongWriter
- Zhenyu Li, Yike Zhang, Tengyu Pan, Yutao Sun, Zhichao Duan, Junjie Fang, Rong Han, Zixuan Wang, Jianyong Wang, 21 Aug 2024, FocusLLM: Scaling LLM's Context by Parallel Decoding, https://arxiv.org/abs/2408.11745 Code: https://github.com/leezythu/FocusLLM
- Guangxuan Xiao, May 2024, Efficient Deployment Algorithms for Large Language Models, Masters Thesis, MIT, https://dspace.mit.edu/bitstream/handle/1721.1/156332/xiao-xgx-sm-eecs-2024-thesis.pdf
- Lilly Kumari, Anthony Rowe, Shengjie Wang, Jeff Bilmes, 2024, BumbleBee: Dynamic KV-Cache Streaming Submodular Summarization for Infinite-Context Transformers, COLM 2024, https://openreview.net/pdf?id=8w0RApM5yG (KV cache compression via "summaries" of the KV cache data.)
- Magic Team, August 29, 2024, 100M Token Context Windows: Research update on ultra-long context models, our partnership with Google Cloud, and new funding, https://magic.dev/blog/100m-token-context-windows
- Pai Zeng, Zhenyu Ning, Jieru Zhao, Weihao Cui, Mengwei Xu, Liwei Guo, Xusheng Chen, Yizhou Shan, 27 May 2024 (v2), The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving, https://arxiv.org/abs/2405.11299
- Insu Han, Rajesh Jayaram, Amin Karbasi, Vahab Mirrokni, David P. Woodruff, Amir Zandieh, 1 Dec 2023 (v3), HyperAttention: Long-context Attention in Near-Linear Time, https://arxiv.org/abs/2310.05869
- Wei Chen, Zhiyuan Li, Shuo Xin, Yihao Wang, 28 Aug 2024, Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models, https://arxiv.org/abs/2408.15518 https://huggingface.co/NexaAIDev/Dolphin (Using vision transformer architecture to process longer text.)
- KVCache.AI, 2024, K Transformers: A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations, https://github.com/kvcache-ai/ktransformers
- Danlong Yuan, Jiahao Liu, Bei Li, Huishuai Zhang, Jingang Wang, Xunliang Cai, Dongyan Zhao, 9 Aug 2024 (v2), ReMamba: Equip Mamba with Effective Long-Sequence Modeling, https://arxiv.org/abs/2408.15496
- David Spuler, March 2024, Long Context Research, in Generative AI in C++, https://www.aussieai.com/book/ch20-long-context-research
- Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, Boris Ginsburg, 6 Aug 2024 (v3), RULER: What's the Real Context Size of Your Long-Context Language Models? https://arxiv.org/abs/2404.06654 https://github.com/hsiehjackson/RULER
- Michael Nuñez, September 4, 2024, 500,000 tokens: How Anthropic’s Claude Enterprise is pushing AI boundaries, https://venturebeat.com/ai/500000-tokens-how-anthropics-claude-enterprise-is-pushing-ai-boundaries/
- jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, Juanzi Li, 4 Sep 2024, LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA, https://arxiv.org/abs/2409.02897
- Tan Yu, Anbang Xu, Rama Akkiraju, 3 Sep 2024, In Defense of RAG in the Era of Long-Context Language Models, https://arxiv.org/abs/2409.01666
- Asif Razzaq, September 5, 2024, Yi-Coder Released by 01.AI: A Powerful Small-Scale Code LLM Series, Delivering Exceptional Performance in Code Generation, Editing, and Long-Context Comprehension, https://www.marktechpost.com/2024/09/05/yi-coder-released-by-01-ai-a-powerful-small-scale-code-llm-series-delivering-exceptional-performance-in-code-generation-editing-and-long-context-comprehension/
- Woomin Song, Seunghyuk Oh, Sangwoo Mo, Jaehyung Kim, Sukmin Yun, Jung-Woo Ha, Jinwoo Shin, 16 Apr 2024, Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs, https://arxiv.org/abs/2404.10308 https://github.com/alinlab/HOMER
- Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, Jie Zhang, 8 Sep 2024, InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference, https://arxiv.org/abs/2409.04992
- Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo, 11 Sep 2024, Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU, https://arxiv.org/abs/2409.09086
- Kiran Vodrahalli, Santiago Ontanon, Nilesh Tripuraneni, Kelvin Xu, Sanil Jain, Rakesh Shivanna, Jeffrey Hui, Nishanth Dikkala, Mehran Kazemi, Bahare Fatemi, Rohan Anil, Ethan Dyer, Siamak Shakeri, Roopali Vij, Harsh Mehta, Vinay Ramasesh, Quoc Le, Ed Chi, Yifeng Lu, Orhan Firat, Angeliki Lazaridou, Jean-Baptiste Lespiau, Nithya Attaluri, Kate Olszewska, 20 Sep 2024 (v2), Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries, https://arxiv.org/abs/2409.12640 (Long context model evaluation dataset.)
- Zeyu Zhang, Haiying Shen, 23 Sep 2024, CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts, https://arxiv.org/abs/2409.15104 (Sparse attention and overlapped communication with computation and disaggregates prefill/decoding with chunked prefill, with a novel QKV splitting approach focused on the Q values.)
- Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, Shafiq Joty, 25 Sep 2024, Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction, https://arxiv.org/abs/2409.17422 https://github.com/SalesforceAIResearch/GemFilter (Use the early layers of a model to choose the most relevant tokens, similar to early exiting, and then compress the input token sequences based on the importance of these tokens. Notably, this reduces latency and also increases accuracy on long contexts.)
- Yun Joon Soh, Hanxian Huang, Yuandong Tian, Jishen Zhao, 3 Sep 2024, You Only Use Reactive Attention Slice For Long Context Retrieval, https://arxiv.org/abs/2409.13695
- Amey Agrawal, Junda Chen, Íñigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, Esha Choukse, 25 Sep 2024, Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations, https://arxiv.org/abs/2409.17264
- Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei, 7 Oct 2024, Differential Transformer, https://arxiv.org/abs/2410.05258
- Zixuan Li, Jing Xiong, Fanghua Ye, Chuanyang Zheng, Xun Wu, Jianqiao Lu, Zhongwei Wan, Xiaodan Liang, Chengming Li, Zhenan Sun, Lingpeng Kong, Ngai Wong, 3 Oct 2024, UncertaintyRAG: Span-Level Uncertainty Enhanced Long-Context Modeling for Retrieval-Augmented Generation, https://arxiv.org/abs/2410.02719
- Minsoo Kim, Kyuhong Shim, Jungwook Choi, Simyung Chang, 2 Oct 2024, InfiniPot: Infinite Context Processing on Memory-Constrained LLMs, https://arxiv.org/abs/2410.01518 (Length-wise KV cache pruning by analyzing token importance.)
- Suyu Ge, Xihui Lin, Yunan Zhang, Jiawei Han, Hao Peng, 2 Oct 2024, A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts, https://arxiv.org/abs/2410.01485
- Yuxiang Huang, Binhang Yuan, Xu Han, Chaojun Xiao, Zhiyuan Liu, 2 Oct 2024, Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads, https://arxiv.org/abs/2410.01805
- Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky, 6 Oct 2024, Inference Scaling for Long-Context Retrieval Augmented Generation, https://arxiv.org/abs/2410.04343
- Jing Xiong, Jianghan Shen, Fanghua Ye, Chaofan Tao, Zhongwei Wan, Jianqiao Lu, Xun Wu, Chuanyang Zheng, Zhijiang Guo, Lingpeng Kong, Ngai Wong, 4 Oct 2024, UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference, https://arxiv.org/abs/2410.03090
- Bowen Jin, Jinsung Yoon, Jiawei Han, Sercan O. Arik, 8 Oct 2024, Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG, https://arxiv.org/abs/2410.05983
- Qingfei Zhao, Ruobing Wang, Yukuo Cen, Daren Zha, Shicheng Tan, Yuxiao Dong, Jie Tang, 23 Oct 2024, LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering, https://arxiv.org/abs/2410.18050 https://github.com/QingFei1/LongRAG
- Anonymous authors, Oct 2024, LooongLlava: Scaling Multi-Modal LLMs to 1000 Images Efficiently Via a Hybrid Architecture, https://openreview.net/pdf?id=wqA7QmpUwa
- Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen, 28 Oct 2024, ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference, https://arxiv.org/abs/2410.21465 https://github.com/bytedance/ShadowKV
- Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jongsoo Park, Jianyu Huang, 4 Nov 2024, Context Parallelism for Scalable Million-Token Inference, https://arxiv.org/abs/2411.01783
- Barhoumi Mosbeh, Nov 2024, Late Chunking In Long Context Embedding Models, https://pub.towardsai.net/late-chunking-in-long-context-embedding-models-caf1c1209042
- Jonathan Roberts, Kai Han, Samuel Albanie, 7 Nov 2024, Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks? https://arxiv.org/abs/2411.05000
- Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Monishwaran Maheswaran, June Paik, Michael W. Mahoney, Kurt Keutzer, Amir Gholami, 14 Nov 2024, Squeezed Attention: Accelerating Long Context Length LLM Inference, https://arxiv.org/abs/2411.09688 https://github.com/SqueezeAILab/SqueezedAttention (This is like a combination of semantic caching and prefix KV caching, and close to fused KV caching.)
- Qwen Team, November 15, 2024, Extending the Context Length to 1M Tokens! https://qwenlm.github.io/blog/qwen2.5-turbo/ (Qwen extended to long context via sparse attention.)
- Zhuohan Gu, Jiayi Yao, Kuntai Du, Junchen Jiang, 21 Nov 2024 (v2), LLMSteer: Improving Long-Context LLM Inference by Steering Attention on Reused Contexts, https://arxiv.org/abs/2411.13009
- M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
- Junqi Ge, Ziyi Chen, Jintao Lin, Jinguo Zhu, Xihui Liu, Jifeng Dai, Xizhou Zhu, 12 Dec 2024, V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding, https://arxiv.org/abs/2412.09616 https://github.com/OpenGVLab/V2PE
- Jérôme DIAZ, Dec 2024, Why Retrieval-Augmented Generation Is Still Relevant in the Era of Long-Context Language Models. In this article we will explore why 128K tokens (and more) models can’t fully replace using RAG. https://towardsdatascience.com/why-retrieval-augmented-generation-is-still-relevant-in-the-era-of-long-context-language-models-e36f509abac5
- Zhuowan Li, Cheng Li, Mingyang Zhang, Qiaozhu Mei, Michael Bendersky, 17 Oct 2024 (v2), Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach, https://arxiv.org/abs/2407.16833
- Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang, 20 Nov 2023 (v3), Lost in the Middle: How Language Models Use Long Contexts, https://arxiv.org/abs/2307.03172 (Information is best placed at the start, or otherwise at the end, of a long context.)
- Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu, 13 Dec 2024, SCBench: A KV Cache-Centric Analysis of Long-Context Methods, https://arxiv.org/abs/2412.10319 https://aka.ms/SCBench
- Zeyuan Yang, Delin Chen, Xueyang Yu, Maohao Shen, Chuang Gan, 12 Dec 2024, VCA: Video Curious Agent for Long Video Understanding, https://arxiv.org/abs/2412.10471
- Hongjin Qian, Zheng Liu, Peitian Zhang, Zhicheng Dou, Defu Lian, 18 Dec 2024 (v2), Boosting Long-Context Management via Query-Guided Activation Refilling, https://arxiv.org/abs/2412.12486 (Maintaining two KV caches, one global, one local.)
- Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, Deyu Zhou, 18 Dec 2024, SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation, https://arxiv.org/abs/2412.13649 (Different KV cache optimizations for prefill and decoding phases.)
- Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li, 19 Dec 2024, LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks, https://arxiv.org/abs/2412.15204 https://longbench2.github.io/
- Chao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-Zhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song, Bo Zheng, Cheng-Lin Liu, 24 Dec 2024, LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating, https://arxiv.org/abs/2412.18424
- Y Li, H Jiang, Q Wu, X Luo, S Ahn, C Zhang, AH Abdi, Dec 2024, SharedContextBench: Evaluating Long-Context Methods in KV Cache Reuse, 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024), https://neurips2024-enlsp.github.io/papers/paper_93.pdf (Evaluating model performance with KV cache compression.)
- Jiaan Wang, Fandong Meng, Yunlong Liang, Jie Zhou, 23 Dec 2024, DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought, https://arxiv.org/abs/2412.17498 https://github.com/krystalan/DRT-o1 (Examines similes and metaphors in literature using long CoT.)
- Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
- Hyucksung Kwon, Kyungmo Koo, Janghyeon Kim, Woongkyu Lee, Minjae Lee, Hyungdeok Lee, Yousub Jung, Jaehan Park, Yosub Song, Byeongsu Yang, Haerang Choi, Guhyun Kim, Jongsoon Won, Woojae Shin, Changhyun Kim, Gyeongcheol Shin, Yongkee Kwon, Ilkon Kim, Euicheol Lim, John Kim, Jungwook Choi, 28 Dec 2024, LoL-PIM: Long-Context LLM Decoding with Scalable DRAM-PIM System, https://arxiv.org/abs/2412.20166
- MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li, Junbin Xie, Junhao Xu, Junjie Yan, Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Zheng, Linbo Chai, Long Xing, Meizhi Ju, Mingyuan Chi, Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei Li, Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang, Qin Wang, Qiuhui Li, Ruitao Leng, Shengmin Shi, Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang, Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu Cheng, Wenkai Li, Xiangjun Song, Xiao Su, Xiaodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min, Xun Zou, Xuyang Shen, Yan Gong, Yingjie Zhu, Yipeng Zhou, Yiran Zhong, Yongyi Hu, Yuanxiang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan Huang, Yunji Li, Yunpeng Huang, Yunzhi Xu, Yuxin Mao, Zehan Li, Zekang Li, Zewei Tao, Zewen Ying, Zhaoyang Cong, Zhen Qin, Zhenhua Fan, Zhihang Yu, Zhuo Jiang, Zijia Wu, 14 Jan 2025, MiniMax-01: Scaling Foundation Models with Lightning Attention, https://arxiv.org/abs/2501.08313 https://github.com/MiniMax-AI (Content window over 1 million tokens.)
- Kuicai Dong, Yujing Chang, Xin Deik Goh, Dexun Li, Ruiming Tang, Yong Liu, 15 Jan 2025, MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents, https://arxiv.org/abs/2501.08828
- Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
- Weizhi Fei, Xueyan Niu, Guoqing Xie, Yingqing Liu, Bo Bai, Wei Han, 22 Jan 2025, Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference, https://arxiv.org/abs/2501.12959 (Input token scanning efficiencly using early exit during prefill to prune tokens for the decoding phase.)
- Zhan Ling, Kang Liu, Kai Yan, Yifan Yang, Weijian Lin, Ting-Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen, 25 Jan 2025, LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion, https://arxiv.org/abs/2501.15089
- Cristian Leo, Feb 2025, Don’t Do RAG: Cache is the future: CAG or RAG? Let’s explore Cached Augmented Generation, its math, and trade-offs. Let’s dig into its research paper to see what it excels at, and how you could leverage it. https://levelup.gitconnected.com/dont-do-rag-cache-is-the-future-d1e995f0c76f
- Nathaniel Tomczak, Sanmukh Kuppannagari, 31 Jan 2025, Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques, https://arxiv.org/abs/2502.01659 (Approaching attention optimization as a graph-theoretical problem.)
- Pinxue Zhao, Hailin Zhang, Fangcheng Fu, Xiaonan Nie, Qibin Liu, Fang Yang, Yuanbo Peng, Dian Jiao, Shuaipeng Li, Jinbao Xue, Yangyu Tao, and Bin Cui. 2025. MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training. Proc. ACM Manag. Data 3, 1, Article 53 (February 2025), 28 pages. https://doi.org/10.1145/3709703 https://dl.acm.org/doi/abs/10.1145/3709703
- Jack Wallen, Feb. 13, 2025, How I feed my files to a local AI for better, more relevant responses Msty is one of the best apps for interacting with the Ollama local AI tool and it contains a feature you'll want to use to help provide contextuality to its responses. https://www.zdnet.com/article/how-i-feed-my-files-to-a-local-ai-for-better-more-relevant-responses/
- Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, (authors omitted), 22 Jan 2025, Kimi k1.5: Scaling Reinforcement Learning with LLMs, https://arxiv.org/abs/2501.12599 (Includes a "length penalty" to address token reduction.)
- Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Neo Y. Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, Jiezhong Qiu, 18 Feb 2025, MoBA: Mixture of Block Attention for Long-Context LLMs, https://arxiv.org/abs/2502.13189 https://github.com/MoonshotAI/MoBA
- Konstantin Donhauser, Charles Arnal, Mohammad Pezeshki, Vivien Cabannes, David Lopez-Paz, Kartik Ahuja, 11 Feb 2025, Unveiling Simplicities of Attention: Adaptive Long-Context Head Identification, https://arxiv.org/abs/2502.09647 (Analysis of how attention works in long context scenarios.)
- Weihao Liu, Ning Wu, Shiping Yang, Wenbiao Ding, Shining Liang, Ming Gong, Dongmei Zhang, 19 Feb 2025, MuDAF: Long-Context Multi-Document Attention Focusing through Contrastive Learning on Attention Heads, https://arxiv.org/abs/2502.13963
- Ning Shang, Li Lyna Zhang, Siyuan Wang, Gaokai Zhang, Gilsinia Lopez, Fan Yang, Weizhu Chen, Mao Yang, 27 Feb 2025, LongRoPE2: Near-Lossless LLM Context Window Scaling, https://arxiv.org/abs/2502.20082 https://github.com/microsoft/LongRoPE (Addresses RopE issues with long context optimization.)
- Avanika Narayan, Dan Biderman, Sabri Eyuboglu, Avner May, Scott Linderman, James Zou, Christopher Re, 21 Feb 2025, Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models, https://arxiv.org/abs/2502.15964 (Reading long documents using on-device small models, by breaking the document into small chunks processed by local LLMs, and only using the cloud LLMs for finalization tasks.)
- Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Research on Quadratic Attention Cost
Linearizing the attention algorithm to avoid the quadratic cost of attention processing is an area with a massive research base with numerous algorithms proposed. Faster attention algorithms include sparse attention and Flash Attention. See research on attention optimization methods.
Research on Positional Encoding Optimization
One of the less obvious bottlenecks for long contexts is the positional encoding algorithm. See research on positional encoding optimizations and removing positional encoding.
Research on Context Length
Research on making "longer" Transformer models includes:
- Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation Ofir Press, Noah A. Smith, Mike Lewis, Apr 2022, https://arxiv.org/abs/2108.12409 (Alibi for longer context length.)
- Siyu Ding, Junyuan Shang, Shuohuan Wang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang, May 2021, ERNIE-Doc: A Retrospective Long-Document Modeling Transformer, https://arxiv.org/abs/2012.15688
- Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33, 2020. https://arxiv.org/abs/2007.14062 (Sparse linear attention algorithm.)
- Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling. In The International Conference on Learning Representations (ICLR), 2020. https://arxiv.org/abs/1911.05507
- Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, 2019, https://arxiv.org/abs/1901.02860, Code: https://github.com/kimiyoung/transformer-xl
- Longformer: The Long-Document Transformer Iz Beltagy, Matthew E. Peters, Arman Cohan, Dec 2020, arXiv preprint arXiv:2004.05150 (2020). https://arxiv.org/abs/2004.05150
- Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers. https://arxiv.org/abs/1904.10509
- Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, Jianfeng Gao, May 2021, Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding, https://arxiv.org/abs/2103.15358, Code: https://github.com/microsoft/vision-longformer
- Qiuhui Chen, Yi Hong, May 2023, Longformer: Longitudinal Transformer for Alzheimer's Disease Classification with Structural MRIs, https://arxiv.org/abs/2302.00901, Code: https://github.com/Qybc/LongFormer
- Chun-Fu Chen, Quanfu Fan, Rameswar Panda, Aug 2021, CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification, https://arxiv.org/abs/2103.14899, Code: https://github.com/IBM/CrossViT
- Qiuhui Chen, Yi Hong, May 2023, Longformer: Longitudinal Transformer for Alzheimer's Disease Classification with Structural MRIs, https://arxiv.org/abs/2302.00901, Code: https://github.com/Qybc/LongFormer
- Qiqi Zhou, Yichen Zhu, July 2023, Make A Long Image Short: Adaptive Token Length for Vision Transformers https://arxiv.org/abs/2307.02092
- Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li, Aug 2023, LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding, https://arxiv.org/abs/2308.14508, Code: https://github.com/THUDM/LongBench
- Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Nanning Zheng, Furu Wei, July 2023, LongNet: Scaling Transformers to 1,000,000,000 Tokens, https://arxiv.org/abs/2307.02486
- Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia, Sep 2023, LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, https://arxiv.org/abs/2309.12307 (An efficient low-rank attention method as part of long context optimizations.)
- Ivan Sekulić, Amir Soleimani, Mohammad Aliannejadi, Fabio Crestani, Sep 2020, Longformer for MS MARCO Document Re-ranking Task, https://arxiv.org/abs/2009.09392, Code: https://github.com/isekulic/longformer-marco
- Yikuan Li, Ramsey M. Wehbe, Faraz S. Ahmad, Hanyin Wang, Yuan Luo, Apr 2022, Clinical-Longformer and Clinical-BigBird: Transformers for long clinical sequences, https://arxiv.org/abs/2201.11838, Code: https://github.com/luoyuanlab/Clinical-Longformer
- Anant Khandelwal, 2021, Fine-Tune Longformer for Jointly Predicting Rumor Stance and Veracity, CODS-COMAD '21: Proceedings of the 3rd ACM India Joint International Conference on Data Science & Management of Data (8th ACM IKDD CODS & 26th COMAD), January 2021, Pages 10–19, https://dl.acm.org/doi/abs/10.1145/3430984.3431007 https://arxiv.org/abs/2007.07803
- Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, Sinong Wang, Sep 2023, LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models, https://arxiv.org/abs/2308.16137
- Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. How long can open-source llms truly promise on context length?, June 2023. https://lmsys.org/blog/2023-06-29-longchat/
- Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian, June 2023, Extending Context Window of Large Language Models via Positional Interpolation, https://arxiv.org/abs/2306.15595 (Introduces "positional interpolation" for long contexts, extending to 32k windows.)
- Y Chen, Y Li, A Xu, Q Sun, X Chen, C Xu, 2023, WAG-NAT: Window Attention and Generator Based Non-Autoregressive Transformer for Time Series Forecasting, ICANN 2023: Artificial Neural Networks and Machine Learning, pp. 293–304, https://link.springer.com/chapter/10.1007/978-3-031-44223-0_24, Code: https://github.com/cybisolated/WAG-NAT
- kaiokendev, 2023, Things im learning while training superhot. https://kaiokendev.github.io/til#extending-context-to-8k
- Together AI, Jul 28, 2023, Preparing for the era of 32K context: Early learnings and explorations, https://together.ai/blog/llama-2-7b-32k (Uses position interpolation and Flash Attention.)
- Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Xiangru Tang, Bolun Wang, Johan S. Wind, Stansilaw Wozniak, Ruichong Zhang, Zhenyuan Zhang, Qihang Zhao, Peng Zhou, Jian Zhu, Rui-Jie Zhu, May 2023, RWKV: Reinventing RNNs for the Transformer Era, https://arxiv.org/pdf/2305.13048.pdf, Code: https://github.com/BlinkDL/RWKV-LM (Hybrid RNN-Transformer that replaces QKV attention with Receptance Weighted Key Value (RWKV)).
- Georgi Gerganov, June 2023 Extending context size via RoPE scaling #1965, Llama.cpp project, https://github.com/ggerganov/llama.cpp/discussions/1965
- Hao Liu, Matei Zaharia, Pieter Abbeel, Oct 2023, Ring Attention with Blockwise Transformers for Near-Infinite Context, https://arxiv.org/abs/2310.01889
- H Jin, X Han, J Yang, Z Jiang, CY Chang, X Hu, Oct 2023, GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length, arXiv preprint arXiv:2310.00576, https://browse.arxiv.org/pdf/2310.00576.pdf
- Davis Yoshida, Allyson Ettinger, and Kevin Gimpel. 2020. Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size. CoRR abs/2008.07027 (2020). arXiv:2008.07027 https://arxiv.org/abs/2008.07027 (Hybrid RNN-Transformer architecture with increased context size.)
- Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed, Oct 2023, Mistral 7B, https://arxiv.org/abs/2310.06825, Code: https://mistral.ai/news/announcing-mistral-7b/ (Uses grouped-query attention and sliding window attention.)
- S Ren, Q Jia, KQ Zhu, arXiv preprint arXiv:2310.08152, Context Compression for Auto-regressive Transformers with Sentinel Tokens, Oct 2023, https://arxiv.org/pdf/2310.08152.pdf, Code: https://github.com/DRSY/KV_Compression
- Tri Dao, Daniel Haziza, Francisco Massa, Grigory Sizov, October 13, 2023, Flash-Decoding for long-context inference, PyTorch Blog, https://pytorch.org/blog/flash-decoding/
- Alibaba Qwen Team, Sep 2023, Qwen Technical Report, https://arxiv.org/pdf/2309.16609.pdf
- Jesse Mu, Xiang Lisa Li, and Noah Goodman. July 2023. Learning to compress prompts with gist tokens. arXiv preprint arXiv:2304.08467. https://arxiv.org/abs/2304.08467 (Prompt compression.)
- Aydar Bulatov, Yuri Kuratov, Yermek Kapushev, Mikhail Burtsev, March 2024, Beyond Attention: Breaking the Limits of Transformer Context Length with Recurrent Memory, Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38 No. 16: AAAI-24 Technical Tracks 16, https://ojs.aaai.org/index.php/AAAI/article/view/29722
- Shuaiqi Liu, 2024, Neural Abstractive Summarization for Long Documents, Ph.D. Thesis, The Hong Kong Polytechnic University, https://theses.lib.polyu.edu.hk/bitstream/200/12810/3/7260.pdf
More AI Research
Read more about: