Aussie AI

Transformer Component Memory Optimization

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Transformer Component Memory Optimization

Some of the memory management improvements that are specific to Transformer components and low-level programming include:

  • Memory-efficient Transformer components (e.g. Flash Attention)
  • Kernel fusion
  • KV cache management
  • Shared KV caches

KV Cache Management: We have examined the speedup available from KV caching in Chapter 29 (Caching). However, this internal cache also has specific characteristics with regard to its memory requirements. Unlike many other parts of the Transformer, the KV cache's memory requirements can grow and shrink dynamically. There are various research papers on how to optimally manage the KV cache and its memory needs. The KV cache can also be shared across queries and across different computations for the same prompt.

Kernel Fusion: Kernel fusion is the merging of two sequential operations into one combined operation. Although this is usually done to improve speed, it has the secondary memory benefit of avoiding the need to store the interim calculations after the first operator for re-loading by the second operator. See Chapter 31 for more about kernel fusion.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++