Aussie AI
Long Context
-
Last Updated 3 December, 2024
-
by David Spuler, Ph.D.
Context window size is the number of input tokens that a model can process. Early models, even ChatGPT, had small context windows of about 2048. Each token is usually a part-word or a whole word, so this meant it could process inputs of about 1,000-2,000 words.
Why seek a longer input size? Well, because a report is going to be 5,000 words, and a full-length novel is 100,000 words, or even 200,000 words if it's in the "epic sci-fi" genre.
Newer models have been increasing the context window size. For example, GPT-4 has a 32k window size, which is 32,000 tokens, which will handle a small novella or novellette of maybe 15,000-20,000 words. Anthropic reportedly has a Claude model with a 100k context window. MPT has an open-source model called "MPT-StoryWriter-65k+" with a 65,000 token window size.
Why is there this context size limitation? One of the main bottlenecks is the "quadratic" cost of the self-attention mechanism. And there are various ways to optimize attention to overcome this limitation. However, it's not the only bottleneck. Alperovich (2023) offers the "secret sauce" for long contexts as fixing three main bottlenecks:
- Quadratic attention cost (in the input token size)
- Quadratic size of internal tensors (in the model dimension)
- Positional embedding cost.
Some of the techniques with relevance to allowing the processing and creation of longer token sequences include:
- Attention optimization
- Autoregression optimizations
- Tokenization algorithms
- Token pruning
- Embeddings pruning
- Length pruning
- Positional encoding optimization
Length Generalization: Accuracy with Longer Contexts
Speed is not the only problem with long contexts. The vanilla Transformers are also not particularly good at generalizing their results with a long context window. This ability is known as "length generalization" (or "length extrapolation"), and improving the accuracy in long inputs and longer outputs is an area of active research.
One of the methods being analyzed to improve length generalization is called "scratchpad" or "chain-of-thought" algorithms. The idea is that the AI inference engine emits rough summaries to an internal scratchpad at regular intervals, which are merged into subsequent inference, thereby the AI helps itself keep track of its own thoughts over a longer output sequence.
Research papers on "length generalization" include:
- Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, Behnam Neyshabur, Nov 2022, Exploring Length Generalization in Large Language Models, https://arxiv.org/abs/2207.04901
- Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. May 2023. The impact of positional encoding on length generalization in transformers. arXiv preprint arXiv:2305.19466, https://arxiv.org/abs/2305.19466
- H Jin, X Han, J Yang, Z Jiang, CY Chang, X Hu, Oct 2023, GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length, arXiv preprint arXiv:2310.00576, https://browse.arxiv.org/pdf/2310.00576.pdf
- Shuaiqi Liu, 2024, Neural Abstractive Summarization for Long Documents, Ph.D. Thesis, The Hong Kong Polytechnic University, https://theses.lib.polyu.edu.hk/bitstream/200/12810/3/7260.pdf
- Mackenzie Morehead, Apr 16, 2024, Is Attention All You Need? https://www.mackenziemorehead.com/is-attention-all-you-need/
- Chonghua Wang, Haodong Duan, Songyang Zhang, Dahua Lin, Kai Chen, 10 Apr 2024 (v2), Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks, https://arxiv.org/abs/2404.06480 Code: https://github.com/open-compass/Ada-LEval
- Jishnu Ray Chowdhury, Cornelia Caragea, May 2023, Monotonic Location Attention for Length Generalization, https://arxiv.org/abs/2305.20019
- Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe, May 2023, Let's Verify Step by Step, https://arxiv.org/abs/2305.20050
- Nan Yang, Laicheng Zhong, Fan Huang, Dong Yuan, Wei Bao, Feb 2023, Random Padding Data Augmentation, https://arxiv.org/abs/2302.08682
- Benjamin Newman, John Hewitt, Percy Liang, Christopher D. Manning, Oct 2020, The EOS Decision and Length Extrapolation, https://arxiv.org/abs/2010.07174
- Anian Ruoss, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Róbert Csordás, Mehdi Bennani, Shane Legg, Joel Veness, May 2023, Randomized Positional Encodings Boost Length Generalization of Transformers, https://arxiv.org/abs/2305.16843
- Benjamin Newman, John Hewitt, Percy Liang, Christopher D. Manning, Oct 2020, The EOS Decision and Length Extrapolation, https://arxiv.org/abs/2010.07174v1
- Mirelle Bueno, Carlos Gemmell, Jeffrey Dalton, Roberto Lotufo, Rodrigo Nogueira, Nov 2022, Induced Natural Language Rationales and Interleaved Markup Tokens Enable Extrapolation in Large Language Models, https://arxiv.org/abs/2208.11445 Code: https://github.com/MirelleB/induced-rationales-markup-tokens
- Kai Shen, Junliang Guo, Xu Tan, Siliang Tang, Rui Wang, Jiang Bian, Feb 2023, A Study on ReLU and Softmax in Transformer, https://arxiv.org/abs/2302.06461
- David Chiang, Peter Cholak, Feb 2022, Overcoming a Theoretical Limitation of Self-Attention, https://arxiv.org/abs/2202.12172
- Amirkeivan Mohtashami, Martin Jaggi, May 2023, Landmark Attention: Random-Access Infinite Context Length for Transformers, https://arxiv.org/abs/2305.16300
- kaiokendev.github.io, Sep 2023 (accessed), Extending Context is Hard…but not Impossible, https://kaiokendev.github.io/context
- Saurav Pawar, S.M Towhidul Islam Tonmoy, S M Mehedi Zaman, Vinija Jain, Aman Chadha, Amitava Das, 15 Jan 2024, The What, Why, and How of Context Length Extension Techniques in Large Language Models -- A Detailed Survey, https://arxiv.org/abs/2401.07872
- Akarsh Kumar, 2023, Long-Range Memory Transformers, Massachusetts Institute of Technology, https://akarshkumar.com/downloads/nlp_longrange_memory_transformer.pdf https://github.com/akarshkumar0101/transformer-memory/tree/master/skip (Separates gradient information into explicitly long-range and short-range components.)
- David Spuler, March 2024, Chapter 20. Attention, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- MohammadReza Ebrahimi, Sunny Panchal, Roland Memisevic, 10 Aug 2024, Your Context Is Not an Array: Unveiling Random Access Limitations in Transformers, https://arxiv.org/abs/2408.05506 (Explores how the sequential nature of token access in attention reduces the accuracy in long context analysis.)
- Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li, 13 Aug 2024, LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs, https://arxiv.org/abs/2408.07055
- David Spuler, March 2024, Length Generalization, in Generative AI in C++, https://www.aussieai.com/book/ch20-length-generalization
- Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, Boris Ginsburg, 6 Aug 2024 (v3), RULER: What's the Real Context Size of Your Long-Context Language Models? https://arxiv.org/abs/2404.06654 https://github.com/hsiehjackson/RULER
- Jonathan Roberts, Kai Han, Samuel Albanie, 7 Nov 2024, Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks? https://arxiv.org/abs/2411.05000
- Minghan Li, Eric Gaussier, Juntao Li, Guodong Zhou, 9 Nov 2024, KeyB2: Selecting Key Blocks is Also Important for Long Document Ranking with Large Language Models, https://arxiv.org/abs/2411.06254
Industry Articles on Long Context Length
Real-world industry models have started offering longer context windows:
- Anthropic, Introducing 100K Context Windows, May 11, 2023, https://www.anthropic.com/index/100k-context-windows
- The MosaicML NLP Team, May 5, 2023, Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs, https://www.mosaicml.com/blog/mpt-7b (See long context model MPT-7B-StoryWriter-65k+)
- Galina Alperovich, May 16, 2023, The Secret Sauce behind 100K context window in LLMs: all tricks in one place, https://blog.gopenai.com/how-to-speed-up-llms-and-use-100k-context-window-all-tricks-in-one-place-ffd40577b4c
-
Carl Franzen,
September 29, 2023,
Meta quietly unveils Llama 2 Long AI that beats GPT-3.5 Turbo and Claude 2 on some tasks,
Venture Beat,
https://venturebeat.com/ai/meta-quietly-releases-llama-2-long-ai-that-outperforms-gpt-3-5-and-claude-2-on-some-tasks//a>
(Context windows up to 32k.)
- Reddit User pseudonerv, June 2022, A simple way to "Extending Context to 8K"?! Reddit Local LLama group, https://www.reddit.com/r/LocalLLaMA/comments/14fgjqj/a_simple_way_to_extending_context_to_8k/
- Zichang Liu, April 2024, Ph.D. Thesis, Rice University, Houston, Texas, https://repository.rice.edu/server/api/core/bitstreams/a089344e-6f6b-44d2-a1c3-6cef2c303e86/content (Using sparsity to compress the KV cache for long context windows.)
- Yao Fu, 14 May 2024, Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis, https://arxiv.org/abs/2405.08944 (The KV cache size is the main bottleneck for long context processing, in both prefill and decoding phases, and includes analysis of different optimizations to address this.)
- Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, Chunting Zhou, 16 Apr 2024 (v2), Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length, https://arxiv.org/abs/2404.08801 Code: https://github.com/XuezheMax/megalodon
- Mackenzie Morehead, Apr 16, 2024, Is Attention All You Need? https://www.mackenziemorehead.com/is-attention-all-you-need/
- Tsendsuren Munkhdalai, Manaal Faruqui, Siddharth Gopal, 10 Apr 2024, Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention, https://arxiv.org/abs/2404.07143
- Chonghua Wang, Haodong Duan, Songyang Zhang, Dahua Lin, Kai Chen, 10 Apr 2024 (v2), Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks, https://arxiv.org/abs/2404.06480 Code: https://github.com/open-compass/Ada-LEval
- Aydar Bulatov, Yuri Kuratov, Yermek Kapushev, Mikhail Burtsev, March 2024, Beyond Attention: Breaking the Limits of Transformer Context Length with Recurrent Memory, Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38 No. 16: AAAI-24 Technical Tracks 16, https://ojs.aaai.org/index.php/AAAI/article/view/29722
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
- Yunpeng Huang, Jingwei Xu, Zixu Jiang, Junyu Lai, Zenan Li, Yuan Yao, Taolue Chen, Lijuan Yang, Zhou Xin, Xiaoxing Ma, Nov 2023, Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey, https://arxiv.org/abs/2311.12351 Project: https://github.com/Strivin0311/long-llms-learning
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis, Dec 2023, Efficient Streaming Language Models with Attention Sinks https://arxiv.org/abs/2309.17453 Code: https://github.com/mit-han-lab/streaming-llm
- Bowen Peng, Jeffrey Quesnelle, Honglu Fan, Enrico Shippole, Nov 2023, YaRN: Efficient Context Window Extension of Large Language Models, https://arxiv.org/abs/2309.00071 Code: https://huggingface.co/NousResearch/Yarn-Mistral-7b-128k
- kaiokendev.github.io, Sep 2023 (accessed), Extending Context is Hard…but not Impossible, https://kaiokendev.github.io/context
- Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. May 2023. The impact of positional encoding on length generalization in transformers. arXiv preprint arXiv:2305.19466, https://arxiv.org/abs/2305.19466 (Evaluates various positional encoding algorithms in decoder-only Transformers.)
- D Mamakas, P Tsotsi, I Androutsopoulos, 2022, Processing long legal documents with pre-trained transformers: Modding legalbert and longformer, https://arxiv.org/abs/2211.00974
- A Askari, S Verberne, O Alonso, S Marchesin, 2021, Combining Lexical and Neural Retrieval with Longformer-based Summarization for Effective Case Law Retrieval. DESIRES, 2021, https://desires.dei.unipd.it/2021/papers/paper-02.pdf.pdf
- Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating Wikipedia by Summarizing Long Sequences. In Proceedings of the 6th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1801.10198
- Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, 2019, https://arxiv.org/abs/1901.02860 Code: https://github.com/kimiyoung/transformer-xl
- Cao, Q.; Trivedi, H.; Balasubramanian, A.; and Balasubramanian, N. 2020. Deformer: Decomposing pre-trained transformers for faster question answering. arXiv preprint arXiv:2005.00697, https://arxiv.org/abs/2005.00697 Code: https://github.com/StonyBrookNLP/deformer
- 3 Feb 2024, Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models, Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, Armaghan Eshaghi, https://arxiv.org/abs/2402.02244 (A survey of processing long context length using methods such as positional encoding and approximate attention including Softmax-free attention.)
- Shuaiqi Liu, 2024, Neural Abstractive Summarization for Long Documents, Ph.D. Thesis, The Hong Kong Polytechnic University, https://theses.lib.polyu.edu.hk/bitstream/200/12810/3/7260.pdf
- Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, Lingpeng Kong, 27 Feb 2024, Training-Free Long-Context Scaling of Large Language Models, https://arxiv.org/abs/2402.17463 Code: https://github.com/hkunlp/chunkllama
- Lili Yu, Dániel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, Mike Lewis, 19 May 2023, MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers, https://arxiv.org/abs/2305.07185
- Emma Järvinen, 2023, Aalto University, Finland, Master’s thesis in Mathematics and Operations Research, Long-input summarization using Large Language Models, https://aaltodoc.aalto.fi/server/api/core/bitstreams/cd47964e-5b5e-4af0-9af4-731970184358/content
- Akruti Acharya, May 25, 2023, MEGABYTE, Meta AI’s New Revolutionary Model Architecture, Explained, https://encord.com/blog/meta-ai-megabyte-model-architecture-explained/
- Saurav Pawar, S.M Towhidul Islam Tonmoy, S M Mehedi Zaman, Vinija Jain, Aman Chadha, Amitava Das, 15 Jan 2024, The What, Why, and How of Context Length Extension Techniques in Large Language Models -- A Detailed Survey, https://arxiv.org/abs/2401.07872
- Jingyu Wang; Lu Zhang; Xueqing Li; Huazhong Yang; Yongpan Liu, Nov 2023, ULSeq-TA: Ultra-Long Sequence Attention Fusion Transformer Accelerator Supporting Grouped Sparse Softmax and Dual-Path Sparse LayerNorm, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (Early Access), https://ieeexplore.ieee.org/abstract/document/10304367
- Yuzhen Mao, Martin Ester, Ke Li, 2023, IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs, NeurIPS 2023, https://neurips2023-enlsp.github.io/papers/paper_24.pdf
- Akarsh Kumar, 2023, Long-Range Memory Transformers, Massachusetts Institute of Technology, https://akarshkumar.com/downloads/nlp_longrange_memory_transformer.pdf https://github.com/akarshkumar0101/transformer-memory/tree/master/skip (Separates gradient information into explicitly long-range and short-range components.)
- Bin Lin, Tao Peng, Chen Zhang, Minmin Sun, Lanbo Li, Hanyu Zhao, Wencong Xiao, Qi Xu, Xiafei Qiu, Shen Li, Zhigang Ji, Yong Li, Wei Lin, 5 Jan 2024, Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache, https://arxiv.org/abs/2401.02669 (Long context processing by a modification to the QKV caching mechanisms.)
- QuaLA-MiniLM: a Quantized Length Adaptive MiniLM, Shira Guskin, Moshe Wasserblat, Chang Wang, Haihao Shen, May 2023, https://arxiv.org/abs/2210.17114 (Intel labs paper. Low-bit quantization, distillation, and Length-Adaptive Transformer (LAT) technique. )
- Amazon, Oct 2023, MistralLite Model, https://huggingface.co/amazon/MistralLite
- DY Fu, S Arora, J Grogan, I Johnson, S Eyuboglu, Oct 2023, Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture https://arxiv.org/pdf/2310.12109.pdf
- C Hao, P Zhang, M Xie, D Zhao, 2023, Recurrent Transformers for Long Document Understanding, CCF International Conference on Natural Language, https://dl.acm.org/doi/abs/10.1007/978-3-031-44693-1_5
- Dacheng Li, Rulin Shao, Anze Xie, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica, Xuezhe Ma, Hao Zhang, Oct 2023, LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers, https://arxiv.org/abs/2310.03294v1
- Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. https://openreview.net/forum?id=YicbFdNTTy
- Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami, 4 Apr 2024 (v3), KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization, https://arxiv.org/abs/2401.18079
- S Yang, S Zhang, M Fang, F Yang, S Liu, 2022, A hierarchical representation model based on longformer and transformer for extractive summarization https://www.mdpi.com/2079-9292/11/11/1706
- J Ding, S Ma, L Dong, X Zhang, S Huang 2023, Longnet: Scaling transformers to 1,000,000,000 tokens, https://arxiv.org/abs/2307.02486
- Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 Code: https://github.com/HazyResearch/flash-attention (The original FlashAttention version 1, now superceded by FlashAttention 2, which uses tiling and memory-aware kernels to optimize attention.)
- David Spuler, March 2024, Chapter 20. Attention, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia, 8 Mar 2024 (v3), LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, https://arxiv.org/abs/2309.12307 Code: https://github.com/dvlab-research/LongLoRA
- Yingbing Huang, Lily Jiaxin Wan, Hanchen Ye, Manvi Jha, Jinghua Wang, Yuhong Li, Xiaofan Zhang, Deming Chen, 16 Jun 2024, New Solutions on LLM Acceleration, Optimization, and Application, https://arxiv.org/abs/2406.10903 (A survey of inference optimization methods and further analysis of Medusa-type speculative decoding and KV cache compression. Also explores hardware co-design, ML compilers and LLM-assisted code debugging.)
- Ziyan Jiang, Xueguang Ma, Wenhu Chen, June 2024, LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs, arXiv preprint arXiv:2406.15319, https://arxiv.org/abs/2406.15319 (Improved accuracy performance of RAG methods when using a long context LLM and longer chunk sizes for the retriever.)
- Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu, 24 Jun 2024, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers, https://arxiv.org/abs/2406.16747 (Sparse KV cache for memory-efficient decoding on long contexts by selecting KV pairs of salient tokens.)
- Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan, 26 Jun 2024, LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference, https://arxiv.org/abs/2406.18139 (KV cache compression in text and multimodal inference, prioritizing eviction of text over image tokens, and using new ways to merge evicted KV cache data into the retained KV cache, including averaging, pivotal tokens, and weighted averages, which is relevant to token merging and KV cache fusion
- Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun, 4 Jun 2024, Block Transformer: Global-to-Local Language Modeling for Fast Inference, https://arxiv.org/abs//2406.02657 Code: https://github.com/itsnamgyu/block-transformer (Impressive technique of combining tokens into blocks, then doing inference on the blocks, then unblocking to get tokens.)
- Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 28 Jun 2024, InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management, https://arxiv.org/abs/2406.19707 (KV caching optimization using salient token pruning for the attention layer.)
- Jiayi Yuan, Hongyi Liu, Shaochen (Henry)Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, Xia Hu, 1 Jul 2024, KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches, https://arxiv.org/abs/2407.01527 Code: https://github.com/henryzhongsc/longctx_bench (Survey and benchmarking of several KV cache compression and long context handling techniques.)
- Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu, 2 Jul 2024, MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention, https://arxiv.org/abs/2407.02490 Code: https://aka.ms/MInference
- An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Zeyu Cui, Zhenru Zhang, Zhihao Fan, 15 Jul 2024, Qwen2 Technical Report, https://arxiv.org/abs/2407.10671
- Siddharth Jha, Lutfi Eren Erdogan, Sehoon Kim, Kurt Keutzer, Amir Gholami, 11 Jul 2024, Characterizing Prompt Compression Methods for Long Context Inference, https://arxiv.org/abs/2407.08892
- Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
- Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi, 19 Jul 2024, LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference, https://arxiv.org/abs/2407.14057
- Emilia David, April 30, 2024, ChatGPT’s AI ‘memory’ can remember the preferences of paying customers, The Verge, https://www.theverge.com/2024/4/29/24144680/chatgpt-plus-memory-chatbot-subscription-details-preferences-personal-assistant
- Iz Beltagy, Matthew E. Peters, Arman Cohan, Dec 2020, Longformer: The Long-Document Transformer, arXiv preprint arXiv:2004.05150 (2020). https://arxiv.org/abs/2004.05150
- Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers. https://arxiv.org/abs/1904.10509
- J. Su. 2023, Rectified rotary position embeddings. https://github.com/bojone/rerope
- B Peng, J Quesnelle, H Fan, E Shippole, 2023 YaRN: Efficient Context Window Extension of Large Language Models, https://arxiv.org/pdf/2309.00071.pdf
- Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li, 13 Aug 2024, LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs, https://arxiv.org/abs/2408.07055 Code: https://github.com/THUDM/LongWriter
- Zhenyu Li, Yike Zhang, Tengyu Pan, Yutao Sun, Zhichao Duan, Junjie Fang, Rong Han, Zixuan Wang, Jianyong Wang, 21 Aug 2024, FocusLLM: Scaling LLM's Context by Parallel Decoding, https://arxiv.org/abs/2408.11745 Code: https://github.com/leezythu/FocusLLM
- Guangxuan Xiao, May 2024, Efficient Deployment Algorithms for Large Language Models, Masters Thesis, MIT, https://dspace.mit.edu/bitstream/handle/1721.1/156332/xiao-xgx-sm-eecs-2024-thesis.pdf
- Lilly Kumari, Anthony Rowe, Shengjie Wang, Jeff Bilmes, 2024, BumbleBee: Dynamic KV-Cache Streaming Submodular Summarization for Infinite-Context Transformers, COLM 2024, https://openreview.net/pdf?id=8w0RApM5yG (KV cache compression via "summaries" of the KV cache data.)
- Magic Team, August 29, 2024, 100M Token Context Windows: Research update on ultra-long context models, our partnership with Google Cloud, and new funding, https://magic.dev/blog/100m-token-context-windows
- Pai Zeng, Zhenyu Ning, Jieru Zhao, Weihao Cui, Mengwei Xu, Liwei Guo, Xusheng Chen, Yizhou Shan, 27 May 2024 (v2), The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving, https://arxiv.org/abs/2405.11299
- Insu Han, Rajesh Jayaram, Amin Karbasi, Vahab Mirrokni, David P. Woodruff, Amir Zandieh, 1 Dec 2023 (v3), HyperAttention: Long-context Attention in Near-Linear Time, https://arxiv.org/abs/2310.05869
- Wei Chen, Zhiyuan Li, Shuo Xin, Yihao Wang, 28 Aug 2024, Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models, https://arxiv.org/abs/2408.15518 https://huggingface.co/NexaAIDev/Dolphin (Using vision transformer architecture to process longer text.)
- KVCache.AI, 2024, K Transformers: A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations, https://github.com/kvcache-ai/ktransformers
- Danlong Yuan, Jiahao Liu, Bei Li, Huishuai Zhang, Jingang Wang, Xunliang Cai, Dongyan Zhao, 9 Aug 2024 (v2), ReMamba: Equip Mamba with Effective Long-Sequence Modeling, https://arxiv.org/abs/2408.15496
- David Spuler, March 2024, Long Context Research, in Generative AI in C++, https://www.aussieai.com/book/ch20-long-context-research
- Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, Boris Ginsburg, 6 Aug 2024 (v3), RULER: What's the Real Context Size of Your Long-Context Language Models? https://arxiv.org/abs/2404.06654 https://github.com/hsiehjackson/RULER
- Michael Nuñez, September 4, 2024, 500,000 tokens: How Anthropic’s Claude Enterprise is pushing AI boundaries, https://venturebeat.com/ai/500000-tokens-how-anthropics-claude-enterprise-is-pushing-ai-boundaries/
- jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, Juanzi Li, 4 Sep 2024, LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA, https://arxiv.org/abs/2409.02897
- Tan Yu, Anbang Xu, Rama Akkiraju, 3 Sep 2024, In Defense of RAG in the Era of Long-Context Language Models, https://arxiv.org/abs/2409.01666
- Asif Razzaq, September 5, 2024, Yi-Coder Released by 01.AI: A Powerful Small-Scale Code LLM Series, Delivering Exceptional Performance in Code Generation, Editing, and Long-Context Comprehension, https://www.marktechpost.com/2024/09/05/yi-coder-released-by-01-ai-a-powerful-small-scale-code-llm-series-delivering-exceptional-performance-in-code-generation-editing-and-long-context-comprehension/
- Woomin Song, Seunghyuk Oh, Sangwoo Mo, Jaehyung Kim, Sukmin Yun, Jung-Woo Ha, Jinwoo Shin, 16 Apr 2024, Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs, https://arxiv.org/abs/2404.10308 https://github.com/alinlab/HOMER
- Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, Jie Zhang, 8 Sep 2024, InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference, https://arxiv.org/abs/2409.04992
- Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo, 11 Sep 2024, Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU, https://arxiv.org/abs/2409.09086
- Kiran Vodrahalli, Santiago Ontanon, Nilesh Tripuraneni, Kelvin Xu, Sanil Jain, Rakesh Shivanna, Jeffrey Hui, Nishanth Dikkala, Mehran Kazemi, Bahare Fatemi, Rohan Anil, Ethan Dyer, Siamak Shakeri, Roopali Vij, Harsh Mehta, Vinay Ramasesh, Quoc Le, Ed Chi, Yifeng Lu, Orhan Firat, Angeliki Lazaridou, Jean-Baptiste Lespiau, Nithya Attaluri, Kate Olszewska, 20 Sep 2024 (v2), Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries, https://arxiv.org/abs/2409.12640 (Long context model evaluation dataset.)
- Zeyu Zhang, Haiying Shen, 23 Sep 2024, CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts, https://arxiv.org/abs/2409.15104 (Sparse attention and overlapped communication with computation and disaggregates prefill/decoding with chunked prefill, with a novel QKV splitting approach focused on the Q values.)
- Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, Shafiq Joty, 25 Sep 2024, Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction, https://arxiv.org/abs/2409.17422 https://github.com/SalesforceAIResearch/GemFilter (Use the early layers of a model to choose the most relevant tokens, similar to early exiting, and then compress the input token sequences based on the importance of these tokens. Notably, this reduces latency and also increases accuracy on long contexts.)
- Yun Joon Soh, Hanxian Huang, Yuandong Tian, Jishen Zhao, 3 Sep 2024, You Only Use Reactive Attention Slice For Long Context Retrieval, https://arxiv.org/abs/2409.13695
- Amey Agrawal, Junda Chen, Íñigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, Esha Choukse, 25 Sep 2024, Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations, https://arxiv.org/abs/2409.17264
- Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei, 7 Oct 2024, Differential Transformer, https://arxiv.org/abs/2410.05258
- Zixuan Li, Jing Xiong, Fanghua Ye, Chuanyang Zheng, Xun Wu, Jianqiao Lu, Zhongwei Wan, Xiaodan Liang, Chengming Li, Zhenan Sun, Lingpeng Kong, Ngai Wong, 3 Oct 2024, UncertaintyRAG: Span-Level Uncertainty Enhanced Long-Context Modeling for Retrieval-Augmented Generation, https://arxiv.org/abs/2410.02719
- Minsoo Kim, Kyuhong Shim, Jungwook Choi, Simyung Chang, 2 Oct 2024, InfiniPot: Infinite Context Processing on Memory-Constrained LLMs, https://arxiv.org/abs/2410.01518 (Length-wise KV cache pruning by analyzing token importance.)
- Suyu Ge, Xihui Lin, Yunan Zhang, Jiawei Han, Hao Peng, 2 Oct 2024, A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts, https://arxiv.org/abs/2410.01485
- Yuxiang Huang, Binhang Yuan, Xu Han, Chaojun Xiao, Zhiyuan Liu, 2 Oct 2024, Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads, https://arxiv.org/abs/2410.01805
- Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky, 6 Oct 2024, Inference Scaling for Long-Context Retrieval Augmented Generation, https://arxiv.org/abs/2410.04343
- Jing Xiong, Jianghan Shen, Fanghua Ye, Chaofan Tao, Zhongwei Wan, Jianqiao Lu, Xun Wu, Chuanyang Zheng, Zhijiang Guo, Lingpeng Kong, Ngai Wong, 4 Oct 2024, UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference, https://arxiv.org/abs/2410.03090
- Bowen Jin, Jinsung Yoon, Jiawei Han, Sercan O. Arik, 8 Oct 2024, Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG, https://arxiv.org/abs/2410.05983
- Qingfei Zhao, Ruobing Wang, Yukuo Cen, Daren Zha, Shicheng Tan, Yuxiao Dong, Jie Tang, 23 Oct 2024, LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering, https://arxiv.org/abs/2410.18050 https://github.com/QingFei1/LongRAG
- Anonymous authors, Oct 2024, LooongLlava: Scaling Multi-Modal LLMs to 1000 Images Efficiently Via a Hybrid Architecture, https://openreview.net/pdf?id=wqA7QmpUwa
- Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen, 28 Oct 2024, ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference, https://arxiv.org/abs/2410.21465 https://github.com/bytedance/ShadowKV
- Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jongsoo Park, Jianyu Huang, 4 Nov 2024, Context Parallelism for Scalable Million-Token Inference, https://arxiv.org/abs/2411.01783
- Barhoumi Mosbeh, Nov 2024, Late Chunking In Long Context Embedding Models, https://pub.towardsai.net/late-chunking-in-long-context-embedding-models-caf1c1209042
- Jonathan Roberts, Kai Han, Samuel Albanie, 7 Nov 2024, Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks? https://arxiv.org/abs/2411.05000
- Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Monishwaran Maheswaran, June Paik, Michael W. Mahoney, Kurt Keutzer, Amir Gholami, 14 Nov 2024, Squeezed Attention: Accelerating Long Context Length LLM Inference, https://arxiv.org/abs/2411.09688 https://github.com/SqueezeAILab/SqueezedAttention (This is like a combination of semantic caching and prefix KV caching, and close to fused KV caching.)
- Qwen Team, November 15, 2024, Extending the Context Length to 1M Tokens! https://qwenlm.github.io/blog/qwen2.5-turbo/ (Qwen extended to long context via sparse attention.)
- Zhuohan Gu, Jiayi Yao, Kuntai Du, Junchen Jiang, 21 Nov 2024 (v2), LLMSteer: Improving Long-Context LLM Inference by Steering Attention on Reused Contexts, https://arxiv.org/abs/2411.13009
- M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
Research on Quadratic Attention Cost
Linearizing the attention algorithm to avoid the quadratic cost of attention processing is an area with a massive research base with numerous algorithms proposed. Faster attention algorithms include sparse attention and Flash Attention. See research on attention optimization methods.
Research on Positional Encoding Optimization
One of the less obvious bottlenecks for long contexts is the positional encoding algorithm. See research on positional encoding optimizations and removing positional encoding.
Research on Context Length
Research on making "longer" Transformer models includes:
- Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation Ofir Press, Noah A. Smith, Mike Lewis, Apr 2022, https://arxiv.org/abs/2108.12409 (Alibi for longer context length.)
- Siyu Ding, Junyuan Shang, Shuohuan Wang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang, May 2021, ERNIE-Doc: A Retrospective Long-Document Modeling Transformer, https://arxiv.org/abs/2012.15688
- Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33, 2020. https://arxiv.org/abs/2007.14062 (Sparse linear attention algorithm.)
- Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling. In The International Conference on Learning Representations (ICLR), 2020. https://arxiv.org/abs/1911.05507
- Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, 2019, https://arxiv.org/abs/1901.02860, Code: https://github.com/kimiyoung/transformer-xl
- Longformer: The Long-Document Transformer Iz Beltagy, Matthew E. Peters, Arman Cohan, Dec 2020, arXiv preprint arXiv:2004.05150 (2020). https://arxiv.org/abs/2004.05150
- Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers. https://arxiv.org/abs/1904.10509
- Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, Jianfeng Gao, May 2021, Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding, https://arxiv.org/abs/2103.15358, Code: https://github.com/microsoft/vision-longformer
- Qiuhui Chen, Yi Hong, May 2023, Longformer: Longitudinal Transformer for Alzheimer's Disease Classification with Structural MRIs, https://arxiv.org/abs/2302.00901, Code: https://github.com/Qybc/LongFormer
- Chun-Fu Chen, Quanfu Fan, Rameswar Panda, Aug 2021, CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification, https://arxiv.org/abs/2103.14899, Code: https://github.com/IBM/CrossViT
- Qiuhui Chen, Yi Hong, May 2023, Longformer: Longitudinal Transformer for Alzheimer's Disease Classification with Structural MRIs, https://arxiv.org/abs/2302.00901, Code: https://github.com/Qybc/LongFormer
- Qiqi Zhou, Yichen Zhu, July 2023, Make A Long Image Short: Adaptive Token Length for Vision Transformers https://arxiv.org/abs/2307.02092
- Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li, Aug 2023, LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding, https://arxiv.org/abs/2308.14508, Code: https://github.com/THUDM/LongBench
- Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Nanning Zheng, Furu Wei, July 2023, LongNet: Scaling Transformers to 1,000,000,000 Tokens, https://arxiv.org/abs/2307.02486
- Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia, Sep 2023, LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, https://arxiv.org/abs/2309.12307 (An efficient low-rank attention method as part of long context optimizations.)
- Ivan Sekulić, Amir Soleimani, Mohammad Aliannejadi, Fabio Crestani, Sep 2020, Longformer for MS MARCO Document Re-ranking Task, https://arxiv.org/abs/2009.09392, Code: https://github.com/isekulic/longformer-marco
- Yikuan Li, Ramsey M. Wehbe, Faraz S. Ahmad, Hanyin Wang, Yuan Luo, Apr 2022, Clinical-Longformer and Clinical-BigBird: Transformers for long clinical sequences, https://arxiv.org/abs/2201.11838, Code: https://github.com/luoyuanlab/Clinical-Longformer
- Anant Khandelwal, 2021, Fine-Tune Longformer for Jointly Predicting Rumor Stance and Veracity, CODS-COMAD '21: Proceedings of the 3rd ACM India Joint International Conference on Data Science & Management of Data (8th ACM IKDD CODS & 26th COMAD), January 2021, Pages 10–19, https://dl.acm.org/doi/abs/10.1145/3430984.3431007 https://arxiv.org/abs/2007.07803
- Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, Sinong Wang, Sep 2023, LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models, https://arxiv.org/abs/2308.16137
- Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. How long can open-source llms truly promise on context length?, June 2023. https://lmsys.org/blog/2023-06-29-longchat/
- Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian, June 2023, Extending Context Window of Large Language Models via Positional Interpolation, https://arxiv.org/abs/2306.15595 (Introduces "positional interpolation" for long contexts, extending to 32k windows.)
- Y Chen, Y Li, A Xu, Q Sun, X Chen, C Xu, 2023, WAG-NAT: Window Attention and Generator Based Non-Autoregressive Transformer for Time Series Forecasting, ICANN 2023: Artificial Neural Networks and Machine Learning, pp. 293–304, https://link.springer.com/chapter/10.1007/978-3-031-44223-0_24, Code: https://github.com/cybisolated/WAG-NAT
- kaiokendev, 2023, Things im learning while training superhot. https://kaiokendev.github.io/til#extending-context-to-8k
- Together AI, Jul 28, 2023, Preparing for the era of 32K context: Early learnings and explorations, https://together.ai/blog/llama-2-7b-32k (Uses position interpolation and Flash Attention.)
- Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Xiangru Tang, Bolun Wang, Johan S. Wind, Stansilaw Wozniak, Ruichong Zhang, Zhenyuan Zhang, Qihang Zhao, Peng Zhou, Jian Zhu, Rui-Jie Zhu, May 2023, RWKV: Reinventing RNNs for the Transformer Era, https://arxiv.org/pdf/2305.13048.pdf, Code: https://github.com/BlinkDL/RWKV-LM (Hybrid RNN-Transformer that replaces QKV attention with Receptance Weighted Key Value (RWKV)).
- Georgi Gerganov, June 2023 Extending context size via RoPE scaling #1965, Llama.cpp project, https://github.com/ggerganov/llama.cpp/discussions/1965
- Hao Liu, Matei Zaharia, Pieter Abbeel, Oct 2023, Ring Attention with Blockwise Transformers for Near-Infinite Context, https://arxiv.org/abs/2310.01889
- H Jin, X Han, J Yang, Z Jiang, CY Chang, X Hu, Oct 2023, GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length, arXiv preprint arXiv:2310.00576, https://browse.arxiv.org/pdf/2310.00576.pdf
- Davis Yoshida, Allyson Ettinger, and Kevin Gimpel. 2020. Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size. CoRR abs/2008.07027 (2020). arXiv:2008.07027 https://arxiv.org/abs/2008.07027 (Hybrid RNN-Transformer architecture with increased context size.)
- Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed, Oct 2023, Mistral 7B, https://arxiv.org/abs/2310.06825, Code: https://mistral.ai/news/announcing-mistral-7b/ (Uses grouped-query attention and sliding window attention.)
- S Ren, Q Jia, KQ Zhu, arXiv preprint arXiv:2310.08152, Context Compression for Auto-regressive Transformers with Sentinel Tokens, Oct 2023, https://arxiv.org/pdf/2310.08152.pdf, Code: https://github.com/DRSY/KV_Compression
- Tri Dao, Daniel Haziza, Francisco Massa, Grigory Sizov, October 13, 2023, Flash-Decoding for long-context inference, PyTorch Blog, https://pytorch.org/blog/flash-decoding/
- Alibaba Qwen Team, Sep 2023, Qwen Technical Report, https://arxiv.org/pdf/2309.16609.pdf
- Jesse Mu, Xiang Lisa Li, and Noah Goodman. July 2023. Learning to compress prompts with gist tokens. arXiv preprint arXiv:2304.08467. https://arxiv.org/abs/2304.08467 (Prompt compression.)
- Aydar Bulatov, Yuri Kuratov, Yermek Kapushev, Mikhail Burtsev, March 2024, Beyond Attention: Breaking the Limits of Transformer Context Length with Recurrent Memory, Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38 No. 16: AAAI-24 Technical Tracks 16, https://ojs.aaai.org/index.php/AAAI/article/view/29722
- Shuaiqi Liu, 2024, Neural Abstractive Summarization for Long Documents, Ph.D. Thesis, The Hong Kong Polytechnic University, https://theses.lib.polyu.edu.hk/bitstream/200/12810/3/7260.pdf
More AI Research
Read more about: