Notable efficiency inference

Pair-In, Pair-Out: Latent Multi-Token Prediction for Efficient LLMs

Wenhui Tan, Minghao Li, Xiaoqian Ma, Siqi Fan, Xiusheng Huang, Liujie Zhang

Published: May 26, 2026 — 16:31 UTC

Problem
This paper addresses the inefficiencies in autoregressive decoding for large language models (LLMs), particularly in the context of long chain-of-thought reasoning. Existing approaches have either focused on input-side latent compression or output-side speculative decoding and multi-token prediction (MTP), but these methods have been developed independently. Furthermore, output-side methods typically require a costly verification step to validate draft tokens generated by MTP, which adds to the overall inference cost. This work is presented as a preprint and has not yet undergone peer review.

Method
The authors propose a novel framework called Pair-In, Pair-Out (PIPO), which integrates both latent compression and MTP into a unified architecture. In PIPO, a latent compressor reduces two input tokens into a single latent representation, while the MTP head expands one hidden state into an additional output token. To eliminate the need for a verification pass, PIPO introduces a lightweight confidence head that assesses the reliability of draft tokens. This confidence head is trained using On-Policy Distillation (OPD), which aligns with the rejection-sampling criterion of speculative decoding, allowing for efficient training with minimal additional computational overhead. The architecture leverages Qwen3.5-4B and 9B model backbones, although specific training compute details are not disclosed.

Results
PIPO demonstrates significant improvements over standard decoding methods across multiple benchmarks, including AIME 2025, GPQA-Diamond, LiveCodeBench v6, and LongBench v2. The results indicate an increase in pass@4 scores by up to 7.15 points compared to regular decoding. Additionally, PIPO achieves substantial latency reductions, with 2.64× speedup in first-token latency and 2.07× speedup in per-token latency. These results suggest that PIPO effectively enhances both the efficiency and reliability of LLM inference.

Limitations
The authors acknowledge that while PIPO reduces the need for verification, the reliance on the confidence head may introduce a new point of failure if not properly calibrated. They do not discuss the potential impact of model size on performance or the generalizability of the approach across different architectures. Additionally, the paper does not provide extensive ablation studies to isolate the contributions of individual components within the PIPO framework.

Why it matters
The integration of latent compression and MTP through PIPO represents a significant advancement in optimizing LLM inference, particularly for applications requiring extensive reasoning capabilities. By reducing both latency and the need for verification, PIPO could facilitate more efficient deployment of LLMs in real-time applications, such as conversational agents and interactive systems. This work opens avenues for further research into hybrid architectures that combine input and output optimizations, potentially leading to more scalable and responsive AI systems.

Authors: Wenhui Tan, Minghao Li, Xiaoqian Ma, Siqi Fan, Xiusheng Huang, Liujie Zhang, Ruihua Song, Weihang Chen
Source: arXiv:2605.27255
URL: https://arxiv.org/abs/2605.27255v1

By Callan Zhang · May 26, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.AI