Notable efficiency inference Meta

MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference

Kexin Chu, Yang Zhou, Wei Zhang

Published: May 28, 2026 — 16:50 UTC

Problem
This preprint addresses the issue of non-deterministic outputs in temperature-zero BF16 LLM inference, where identical requests can yield different tokens depending on their context within a batch. Existing solutions, such as batch-invariant operators and LLM-42’s per-token verification, introduce overhead even when the majority of decoding steps are stable. The authors investigate whether verification can be selectively applied to only those tokens that exhibit variability, thereby optimizing the verification process.

Method
The core technical contribution is the MarginGate framework, which leverages the observation that batch-induced token flips are sparse across various models. The authors identify that on the MATH500 benchmark, Llama-3.1-8B experiences token flips in only 0.48% of synchronous decode steps, with other models maintaining a flip rate between 0.3% and 1.3%. MarginGate employs a verifier policy that distinguishes between high-margin and low-margin decoding steps. High-margin steps are processed without verification, while low-margin steps undergo verification to detect and correct mismatches by replacing the current key/value (K/V) column. The framework is evaluated on four datasets, with calibration performed on MATH500 and transfer to GSM8K, SharedGPT, and HumanEval.

Results
MarginGate achieves 100% sequence-level deterministic decoding for Llama-3.1-8B and Qwen2.5-14B, with verifier trigger rates of 18.56% and 15.05%, respectively. This results in a significant reduction in latency, with a 2.23x and 1.99x decrease compared to the always-on verification approach used in LLM-42. Additionally, on the more challenging DSR1-Distill-Qwen-7B, the same policy achieves determinism with a higher trigger rate of 49.50%. These results demonstrate the effectiveness of the proposed method in maintaining deterministic outputs while minimizing computational overhead.

Limitations
The authors acknowledge that the effectiveness of MarginGate may vary across different model architectures and datasets, as the evaluation is limited to five models. They do not address potential scalability issues when applied to larger models or more complex tasks. Furthermore, the reliance on margin-based verification may not generalize well to all types of LLMs or inference scenarios, particularly those with different token distributions or decoding strategies.

Why it matters
The implications of this work are significant for the deployment of LLMs in production environments where deterministic outputs are critical. By reducing the computational burden associated with verification while maintaining high accuracy, MarginGate presents a viable solution for enhancing the efficiency of LLM inference. This approach could pave the way for more robust and scalable LLM applications, particularly in real-time systems where latency is a concern. Future research may explore the generalizability of MarginGate across diverse architectures and its integration with other optimization techniques.

Authors: Kexin Chu, Yang Zhou, Wei Zhang
Source: arXiv:2605.30218
URL: https://arxiv.org/abs/2605.30218v1

By Callan Zhang · May 28, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.LG