Notable efficiency inference

SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection

Shikhar Shukla

Published: May 4, 2026 — 17:55 UTC
Summary length: 450 words
Relevance score: 80%

Problem
This paper addresses the limitations of fixed speculation lengths in speculative decoding for large language model (LLM) inference, a method that accelerates token generation by using a smaller draft model to propose candidate tokens for verification by a larger target model. Existing systems predominantly utilize a static speculation length (typically set to 4), which does not account for variations in task types or the compression level of the target model. The authors present SpecKV, a novel adaptive controller that dynamically selects the speculation length based on real-time signals from the draft model. This work is a preprint and has not yet undergone peer review.

Method
The core contribution of this paper is the SpecKV framework, which employs a lightweight multi-layer perceptron (MLP) to adaptively select the speculation length ( \gamma ) at each decoding step. The authors conducted extensive profiling across four task categories, four speculation lengths, and three compression levels (FP16, INT8, NF4), resulting in a dataset of 5,112 step-level records. Key features extracted include per-step acceptance rates, draft model entropy, and draft model confidence. The MLP is trained to maximize expected tokens per speculation step, leveraging the correlation between draft model confidence/entropy and acceptance rates (approximately 0.56). The adaptive selection process incurs a minimal overhead of 0.34 ms per decision, representing less than 0.5% of the total step time.

Results
SpecKV demonstrates a significant performance improvement over the fixed ( \gamma = 4 ) baseline, achieving a 56.0% increase in efficiency. This improvement is statistically validated with a p-value of less than 0.001, determined through a paired bootstrap test. The results indicate that the optimal speculation length varies with the compression level of the target model, underscoring the necessity of adaptive mechanisms in speculative decoding.

Limitations
The authors acknowledge that the performance gains of SpecKV may be contingent on the specific architectures of the draft and target models used in their experiments. Additionally, the study is limited to the task categories and compression levels tested, which may not generalize to all LLM applications. The reliance on draft model signals for speculation length selection may also introduce biases if the draft model is not sufficiently representative of the target model’s capabilities.

Why it matters
The implications of this work are significant for the optimization of LLM inference, particularly in resource-constrained environments where latency and computational efficiency are critical. By enabling adaptive speculation lengths, SpecKV enhances the flexibility and performance of speculative decoding, paving the way for more efficient LLM applications across diverse tasks and model configurations. This research contributes to the broader discourse on improving inference speed without sacrificing model accuracy, which is essential for real-time applications in natural language processing.

Authors: Shikhar Shukla
Source: arXiv:2605.02888
URL: https://arxiv.org/abs/2605.02888v1

Author Turing Wire editorial staff

Source

arXiv cs.AI https://arxiv.org/abs/2605.02888v1