Aussie AI

What is Length Pruningand?

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

What is Length Pruning?

Length pruning refers to applying pruning on model dimension corresponding to the user's input sequence and the propagation of that sequence through the model. There is much confusing terminology in the research on length pruning and token pruning, but I have attempted to categorize the main types of pruning along the “lengthwise” model dimension is as follows:

  • Token pruning
  • Embeddings pruning
  • Prompt compression

Other non-pruning AI model optimization techniques that operate on the same “lengthwise” dimension include:

  • Long context window optimizations
  • Length generalization
  • Input padding removal
  • Batching of multiple prompt queries (without padding)
  • Attention linearization optimizations
  • Non-autoregressive optimizations

Length pruning is structured weight pruning on one of the three axes of pruning. The other two axes are width pruning (e.g. attention head pruning) and depth pruning (e.g. layer pruning and early exit). All three types of pruning are mostly orthogonal to each other and can be combined into triple pruning.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++