Notable agents robotics

From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model

Bing Hu, Zaijing Li, Rui Shao, Junda Chen, April Hua Liu, Wei-Shi Zheng

Published: May 21, 2026 — 16:14 UTC

Problem
This paper addresses the limitations of existing Vision-Language-Action (VLA) models, particularly their performance degradation under distribution shifts. Current methods often rely on action-centric latent variables to construct behavior representations, which are hindered by short-horizon temporal fragmentation and static execution alignment. These issues lead to inconsistent behaviors in complex scenarios. The authors propose a novel framework, BehaviorVLA, to learn temporally coherent behavioral representations, thereby improving generalization across varying environments. This work is presented as a preprint and has not yet undergone peer review.

Method
The core technical contribution of this paper is the BehaviorVLA framework, which consists of two symmetric components: the Visuomotor Behavior Encoder (VBE) and the Phase-conditioned Behavior Decoder (PBD). The VBE employs a causal Mamba-based architecture to aggregate long-horizon trajectory information into a unified behavior representation, effectively capturing temporal coherence. The PBD decodes this representation into precise actions by dynamically aligning task-level priors with real-time execution progress. The training process leverages a combination of supervised learning and reinforcement learning techniques, although specific details regarding the loss functions and training compute are not disclosed.

Results
BehaviorVLA achieves state-of-the-art performance on several benchmarks: 58% success rate on RoboTwin 2.0, 98% on LIBERO, and an average length of 4.36 on CALVIN. Notably, in real-world sim-to-real transfer scenarios, BehaviorVLA matches the performance of the OpenVLA-OFT model while utilizing only 50% of the demonstration data, indicating a significant improvement in data efficiency and generalization capabilities. These results suggest that the proposed framework effectively addresses the limitations of previous VLA models.

Limitations
The authors acknowledge that while BehaviorVLA improves upon existing methods, it may still face challenges in highly dynamic environments where rapid adaptation is required. Additionally, the reliance on a causal Mamba-based architecture may limit the framework’s applicability to other types of architectures or tasks. The paper does not discuss the computational overhead introduced by the dual-component structure, which could impact scalability in real-world applications.

Why it matters
The implications of this work are significant for the development of robust VLA models capable of generalizing across diverse environments. By addressing the shortcomings of temporal coherence and execution alignment, BehaviorVLA sets a new standard for behavior representation learning in VLA systems. This advancement could facilitate more effective human-robot interaction, autonomous navigation, and other applications where understanding and executing complex tasks in varying contexts is crucial. Future research may build upon this framework to explore further enhancements in data efficiency and adaptability.

Authors: Bing Hu, Zaijing Li, Rui Shao, Junda Chen, April Hua Liu, Wei-Shi Zheng, Liqiang Nie
Source: arXiv:2605.22671
URL: https://arxiv.org/abs/2605.22671v1

By Callan Zhang · May 21, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CV