Notable efficiency inference

Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs

Jiahui Niu, Kefan Gu, Yucheng Zhao, Shengwen Liang, Tiancai Wang, Xing Hu

Published: May 13, 2026 — 16:57 UTC

Problem
This paper addresses the latency limitations of diffusion-based vision-language-action models (dVLAs) in real-time applications, particularly in embodied intelligence tasks. The authors identify a significant gap in the ability to perform high-frequency replanning without incurring the high computational costs associated with full inference calls. The work is presented as a preprint and has not yet undergone peer review.

Method
The authors propose Realtime-VLA FLASH, a speculative inference framework designed to enhance the efficiency of dVLAs. The core technical contribution involves a two-tiered inference approach: a lightweight draft model that generates speculative actions and a verification mechanism that utilizes the main model’s Action Expert to confirm these actions. The framework incorporates a phase-aware fallback mechanism that allows the system to revert to full inference when the speculative actions are deemed unreliable. This architecture enables the system to maintain low-latency performance while ensuring reliability in decision-making. The training compute requirements are not explicitly disclosed, but the emphasis is on reducing the number of full inference calls during the replanning process.

Results
The experimental evaluation is conducted on the LIBERO benchmark, where Realtime-VLA FLASH demonstrates a significant reduction in task-level average inference latency from 58.0 ms (full inference) to 19.1 ms, achieving a speedup factor of 3.04x. The framework replaces many full-inference rounds with speculative rounds as fast as 7.8 ms, effectively maintaining task performance. Additionally, the authors validate the practical applicability of their approach through real-world experiments on conveyor-belt sorting tasks, showcasing its effectiveness in latency-critical scenarios.

Limitations
The authors acknowledge that while FLASH significantly reduces latency, the reliance on speculative inference may introduce risks of incorrect actions if the draft model’s predictions are inaccurate. The fallback mechanism mitigates this risk but does not eliminate it entirely. Furthermore, the paper does not discuss the scalability of the approach to more complex environments or tasks beyond those tested. The potential computational overhead of maintaining the Action Expert during verification is also not addressed.

Why it matters
The implications of this work are substantial for the field of embodied AI, particularly in applications requiring real-time decision-making. By enabling high-frequency replanning with reduced latency, Realtime-VLA FLASH paves the way for more responsive and efficient systems in dynamic environments. This framework could influence future research on optimizing inference in dVLAs and similar architectures, potentially leading to advancements in robotics, autonomous systems, and interactive AI applications.

Authors: Jiahui Niu, Kefan Gu, Yucheng Zhao, Shengwen Liang, Tiancai Wang, Xing Hu, Ying Wang, Huawei Li
Source: arXiv:2605.13778
URL: https://arxiv.org/abs/2605.13778v1

By Callan Zhang · May 13, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CV