Aussie AI Blog

Sequential Speculative Decoding

25th August, 2024

by David Spuler, Ph.D.

Summary

This article presents the concept of broadening the optimization of speculative decoding into use for sequential execution platforms. Early exiting of layers is identified as one type of optimization that can benefit from draft tokens in sequential mode. In addition to the use of early exit, we examine a number of other inference optimizations to see whether they can be combined with speculative decoding, so as to give an overall speedup for on-device inference.

Speculative decoding is usually considered as a parallel programming optimization. However, from an information theoretic point of view, the improvement arises because additional information is discovered using a minimal cost (i.e., drafting), and such information is then used to efficiently direct the resources of the overarching inference algorithm. Most papers on speculative decoding aim to improve the extra information, minimizing drafting cost and maximizing its accuracy for higher acceptance rates, whereas we examine broadening the use of this extra information beyond parallelism.

Broadly, the question is how to leverage the additional information in the draft token sequence to speed up any type of LLM inference execution. Most forms of speculative decoding do so by improving parallelization of the verifier model, and this works extremely well. However, if we take a step back, the perspective is clear that parallelization is only one of many possible LLM inference efficiency improvements in the literature. Examination of other LLM optimizations broadens the possibilities for speedups to sequential execution. We examine several of these inference optimization techniques: early exiting and layer skipping (dynamic layer pruning), width pruning (attention head pruning), input token pruning (lengthwise pruning), and embedding dimension pruning.

What is (Parallel) Speculative Decoding?

Speculative decoding, also called speculative sampling, is a well-known research optimization that reduces latency by doing more parallel execution. The idea is a "draft-and-verify" strategy. To start off, a small model generates a "draft" sequence of multiple tokens, and then a larger model can "verify" multiple tokens in parallel. If the verification succeeds, then we've done multiple tokens in the same time it would have used for a single token, thereby speeding up the wall clock time. But if the verifier rejects the draft tokens, then we've done some extra overhead. The optimization relies on a high "acceptance rate" for the draft tokens, so that there is an overall speedup.

Speculative decoding is not a theoretical algorithm, but is widely used in industry. In fact, there are many further optimizations of speculative decoding, mostly about making the draft model either faster (less extra overhead) or more accurate (higher acceptance rates).

Several variations of speculative decoding have been advanced with differences in how the draft token sequences are created. Whereas classic speculative decoding involves training a smaller LLM as the draft model, it has been noted that any method of generating likely draft tokens can be used, and the term "generalized speculative decoding" describes such broader cases. For example, heuristic methods can be used to generate predicted draft tokens without any LLM. Self-speculative decoding uses the large verifier model with an early exit optimization as equivalent to a smaller draft model, thereby avoiding the need to train a new draft model. Aggressive decoding, which applies only to the Grammatical Error Correction (GEC) use case, applies the input document's token sequence as draft tokens. Prompt lookup decoding uses draft tokens from within the input prompt or its context. Retrieval lookup decoding uses text sections in a retrieved document chunk as its draft.

One of the best things about speculative decoding is that it is a "lossless" optimization. It's one of those rare inference optimizations that doesn't reduce accuracy of the model. The output is the same as the larger LLM would have output without using the optimization, or if not exactly the same, is at least one of the possible valid outputs that the larger model could have produced. Overall, it's faster, but not less accurate.

But that's all in parallel.

Is Sequential Speculative Decoding Possible?

Sequential speculative decoding is the general idea of attempting to use the same draft-and-verify optimizations for sequential execution, rather than parallel programming. Almost all of the prior research on speculative decoding has been about using more GPU power in parallel to reduce user latency. However, the advent of on-device inference, such as chips with a Neural Processing Unit (NPU), there isn't such an amount of spare GPU juice. In particular, this applies to "AI phones" and "AI PCs" that are all the rage now.

If we examine the issue naively, we could try to use draft tokens, and then do the best we can to verify the drafts with the full model running in sequential mode, accelerated by SIMD CPU instructions or NPU capabilities. But that's not a win, and it will actually run slower. In cases where the draft is rejected, we've added more computations and more latency, than would have occurred without the draft. And even in cases where the draft is accepted, it doesn't appear that we've done any better than a simple autoregressive decoding algorithm running on a CPU, NPU, or other sequential platform.

To see the potential benefit, we need to step back a little and think about the draft tokens more generally. What they offer is extra information about the likelihood of the correct next token. Rather than examine multiple draft tokens, which we cannot parallelize anyway, maybe there is benefit from just a single draft token. Hence, the idea is now to see whether a single draft token can somehow help optimize the decoding step for the next token of a larger model.

The optimization won't work for various types of "static" model optimizations. I can't see any way to use a draft token to improve model compression such as quantization, static pruning, weight sharing, and so on. We don't know which token it will be, so it's more of a dynamic optimization.

The question is then to examine all the types of "adaptive inference" optimizations. Can any of them use an extra draft token to go faster? Some of the various techniques whereby computation is reduced dynamically include:

Depth pruning of layers — early exit, dynamic layer pruning, layer skipping, layer fusion.
Width pruning across layers and sub-layer components — attention head pruning, slimmable networks, FFN pruning, channel pruning, filter pruning, etc.
Lengthwise pruning on the token prompt dimension — dynamic token pruning, token merging, token skipping, salient tokens, token sinks, etc.
Embedding dimension pruning — dynamic reduction along the internal dimension of the model.

There's a lot of options!

Related Work on Sequential Speculative Decoding

Despite its prominence in both industry and research as an LLM inference optimization, speculative decoding has not been fully examined for sequential execution optimization, with the vast majority of papers discussing parallel GPU execution. Our review of the literature on speculative decoding found only one paper, Xu et al (2023), with discussion of sequential execution and on-device inference with NPUs.

Their optimization is to use non-autoregressive verification, so that the larger model verifies multiple draft tokens in parallel. Other aspects of their research included managing memory for on-device platforms, and carefully scheduling the draft model versus the verifier model. Our approach is an additional optimization that uses the draft tokens with adaptive inference, such as early exiting, and is thus orthogonal to this research.

Early Exit and Speculative Decoding

Early exit seems like a good candidate for further optimization in combination with speculative decoding. Indeed, there is already research on using early exit with speculative decoding, albeit mostly applicable to parallel execution. Early exit has been combined with speculative decoding in two main ways:

(a) the draft model, and

(b) the verifier model.

Extensions to speculative decoding's draft model introduced self-speculative decoding, where an early exited version of the verifier model was used as the smaller draft model. This has several benefits over training a separate small draft model, including: (a) no need for a separate draft LLM, and (b) faster pipelined computation because early layers of the verifier model are effectively pre-executed during the drafting phase. However, this idea is not an improvement for sequential execution.

The second type of integration is to use early exit in the verifier model, which is closest to what we propose here. Various research has combined speculative decoding with early exit in the verifier model to gain a secondary speedup. However, this is only added in an orthogonal manner that allows both gains, but without using the extra information from drafting to improve performance beyond using the two optimizations separately. Hence, this use of early exit is not improved by speculative decoding itself.

Improving Early Exit with Draft Tokens

Careful consideration of the various techniques has identified early exiting as the optimization most likely to further benefit from draft-and-verify strategy. Embedding dimension pruning seemed promising, but is perhaps unworkable, as discussed below. No real opportunity was identified in width pruning (e.g., attention head pruning) or in lengthwise token pruning. However, the issues are complex, and the search for more ways is left to the reader.

For simplicity, let us initially consider the case where the draft model produces only a single draft token. The question is whether any of the optimizations can use this extra draft token as additional information to allow a further speedup.

How is early exit optimized by a draft token?

The first point is that early exit is already an optimization, so we are looking for a further optimization. As discussed above, some papers use early exit with speculative decoding, but in a way that is simply orthogonal. The benefit is two-fold, but neither technique is sped up by the other.

The way to use a draft token to benefit early exiting is to use it as "extra information" in the exit decision criteria. Early exit requires evaluation of criteria to determine whether or not to exit at a given layer. For example, we might exit a layer if the predicted token has a probability of 80%. In other words, we evaluate the "confidence" of the model in the best token at that layer, and decide whether we are confident enough to exit, or whether to process further layers.

When we have a draft token, we can modify the decision criteria to consider the probability of that draft token at a layer. Since we know already what the draft model suggests as the best token, we can then have an increased confidence in that particular token at an exit decision. Hence, we might have a modified early exit criteria in the larger verifier model, such that:

1. If the layer logits have any token at 80% probability, we exit.

2. Otherwise, if the layer logits have the draft token at 50% probability, we exit.

There are various different ways to merge the probability of the particular draft token, versus the other tokens, into our early exit decision logic. The benefit of the draft token is that it allows us to set the thresholds lower, which allows exiting more frequently, and thereby reduces the number of layer executed by the model. The improvement is greater than using a simple early exit method without the draft token.

Evaluation

An important consideration is to question: is this any better than just using early exit in the large model? If we are going to the trouble of calculating the unembeddings for the early exit decision logic, why do we need the draft token? The answer is that we actually have two separate source of information in this approach:

(a) the draft token, and

(b) the predicted logits for the early-exit layer.

By combining them in the decision to accept or reject a token, this gives us greater accuracy than early-exit alone. Another opportunity for greater efficiency arises, because we can set the early-exit acceptance criteria lower than we otherwise would, based on the knowledge that the draft token has already been recommended by a draft model. Furthermore, the availability of a draft token that must be highly-probable in the logit distributions of any layers allows for an earlier rejection at a lower layer in some cases. Finally, this method can speed up inference using even a single draft token, whereas most speculative decoding methods require accurate drafting of two or more tokens for a benefit from parallel verification.

In the types of speculative decoding using a small LLM for the draft model, we would also have information about the probability of the draft token (and others), as predicted by the smaller LLM. This gives additional information that could also be incorporated into our acceptance logic. In other words, since there are two LLMs with predictions about this token, and other tokens, the early exit criteria could consider both sets of probabilities.

Limitations

The idea of using the draft tokens in early exit criteria has many advantages, but is an imperfect optimization. Several limitations of this improvement to early exit using speculative decoding include:

Accuracy loss (not lossless)
Circular logic in self-speculative decoding
Only a single draft token is used
Cost overhead of early exit decisions
KV cache recomputation for skipped layers

An important distinction arises in this optimized sequential approach versus the traditional parallelized version of speculative decoding. The use of a full verifier model in speculative decoding usually ensures that the overall technique is "lossless" and would be identical to running the verifier model alone. However, our approach with early exiting can cause the verifier model to approve a token that might have been superceded by the latter layers of the large model. Hence, there are cases where the output is "lossy" and is an approximation of the token sequence that the full verifier model would have computed, if it had run in a full autoregressive mode. Nevertheless, two distinct methods must combine to suggest the same token, with approval of the draft token requiring both: (a) generation from the draft model and (b) approval by the partially-executed verifier model. Hence, this cooperation ensures that the overall accuracy should still be higher than either the draft model alone or the early-exited large model alone.

A circularity in the logic of this optimization in the case of the self-speculative decoding method should be mentioned. This arises if the draft model is an early-exited version of the larger verifier model, as in self-speculative decoding, when our use of the draft tokens in the verifier's early exit criteria is somewhat redundant. In this case, the benefit of having two distinct methods cooperating to recommend a particular token is lost, and the system would be effectively the same as the large model with early-exit alone.

Only a single draft token is used by this early exit optimization in sequential execution. Hence, the speedup is limited and does not merge the computations of multiple future tokens. Thus, the overall speedup is not as large as is possible from the parallel versions of speculative decoding.

Adding early exit to a model is not without cost overhead because it requires "unembedding" logic to be performed at any layer where early exit is being evaluated. Unembedding involves an unembedding matrix and Softmax normalization to convert the embedding activation values into output token probabilities. Furthermore, there is the evaluation of the decision criteria on whether to exit at this layer, which can also involve somewhat costly computations. We can mitigate this cost by not testing each layer, since the unembedding and decision criteria cost incurred is the same at each layer that is a candidate for early exiting.

Early exit also has a problem when the model is also using the optimization of "KV caching." This optimization is so prevalent that it's almost always used. The problem is that any layer that is skipped by early exiting then won't have its KV cache computed when that layer is next processed for the next token. There are various solutions in the literature such as propagation of the prior layer's KV cache or recomputation dynamically.

Embedding Dimension Pruning

Embedding dimension pruning involves reducing the number of values in the internal model dimension, or equivalently, the length of the embedding vectors, throughout all the model's layerwise computations. This area seems promising at first thought, based on the following logic: If we have a single draft token, we can quickly determine which embedding categories relate to that token, usually as a row from the embedding matrix. Hence, one optimization that suggests itself is to truncate this list to a subset, say 100 embedding categories, and then only perform model computations on the slices of data for those top-100 categories in the embedding vectors that are calculated (i.e., the activations). This reduced computation can certainly be performed, and it would be much faster than full inference, because it excludes a great many tensor slices.

However, the problem arises at the end of the computations. How do we know whether to accept or reject the draft token? We have calculated numbers for each of its top-100 embedding categories, but we have zeros for all other embedding categories. There is no way to know whether any other embedding values would have been larger (or smaller), nor how they would have compared to each other. There is no way to know from these partial computations whether or not the draft token is better or worse than any other tokens. The only way to address this seems to be to increase the number of embedding categories to calculate, but it's not possible to reliably know which extra ones to prioritize, and we end up needing to compute them all for full accuracy. Hence, this approach seems to be a non-starter.

Conclusions

This article introduces a new sequential speculative decoding optimization for LLM inference, using the draft-and-verify strategy in combination with early exiting. The improvement is based on the following techniques:

By combining early exit of the verifier model with speculative decoding concepts, many layers of computation in the large verifier model can be avoided, and inference is faster not only in parallel, but also notably with sequential execution.
Draft tokens are used directly in the decision criteria of the early exit optimization, allowing greater accuracy than naive early exit methods alone.
Although the new algorithm becomes "lossy," the combination of two systems recommending the same token alleviates this risk.

It should be noted that much of this analysis, although focused on sequential execution, is still applicable to optimizing parallel execution of LLMs, such as for data center multi-GPU platforms. Software optimizations such as early exit are also applicable to parallel execution, and mostly orthogonal to parallelization. However, the software-based techniques herein may be more directly valuable to on-device inference where parallel processing capacity is limited.

Extensions

Many of the speedups to parallel speculative decoding may also be valid for sequential execution and could further extend this early exit method. These modifications include methods for faster drafting, using heuristic drafters rather than LLMs, multiple draft tokens, or other non-LLM sources of draft tokens, such as prompt lookup decoding, retrieval lookup decoding, and aggressive decoding in grammatical error correction.

We have discovered an extra speedup to early exiting of layers, which is a type of dynamic pruning, via the draft-and-verify idea in speculative decoding. Other types of dynamic pruning, such as embedding dimension pruning and lengthwise pruning, seemed less promising. In addition to dynamic pruning, various other general types of LLM optimizations were considered without success including both static model compression methods and adaptive inference. No obvious way was discovered to use draft tokens to speed up inference optimization techniques such as quantization, weight sharing, knowledge distillation, KV caching, or various other methods.

Given the success of speculative decoding in parallel execution, further research is warranted into its application for sequential execution on resource-constrained platforms such as phones, PCs, and other edge devices. One improvement to our algorithm would be to consider predicted probabilities of all the tokens from the draft LLM, and incorporate these values into the early exit criteria. Furthermore, although we identified early exit as one candidate technique that can be further optimized using draft tokens, and examined some other methods with negative results, there are literally dozens more such techniques in the literature, and there are likely several additional ways that draft tokens could be used to optimize inference for on-device platforms.

References

Daliang Xu, Wangsong Yin, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe Liu, Sep 2023, Llmcad: Fast and scalable on-device large language model inference. CoRR, abs/2309.04255, https://arxiv.org/abs/2309.04255