Notable agents robotics

Cross-Domain Human Action Recognition from Multiview Motion and Textual Descriptions

Yannick Porto, Renato Martins, Thomas Chalumeau, Cedric Demonceaux

Published: May 21, 2026 — 16:39 UTC

Problem
This paper addresses the limitations of existing Zero-Shot Action Recognition (ZSAR) models, particularly their inability to generalize to novel action categories and variations in human body orientation and camera viewpoints during inference. Most ZSAR methods assume that actions are observed under similar geometric conditions as during training, which is often not the case in real-world applications. The authors propose a solution to enhance the robustness of ZSAR systems against these domain shifts, thereby improving their applicability in practical scenarios. This work is presented as a preprint and has not yet undergone peer review.

Method
The authors introduce an orientation-aware action recognition framework that integrates motion cues from multiple camera viewpoints with textual descriptions of actions. The core technical contribution is the development of an orientation-aware motion encoding network, which is designed to extract distinct motion features based on the orientation of the human subject. Additionally, the authors adapt a specific orientation-aware text prompt that aligns with the learned motion features during inference. The training process leverages a diverse dataset that includes various orientations and viewpoints, although specific details regarding the dataset size and training compute are not disclosed.

Results
The proposed method demonstrates significant improvements in ZSAR performance across multiple benchmarks. On the NTU-RGB+D dataset, the model achieves a recognition accuracy of X%, surpassing the previous state-of-the-art by Y%. Similarly, on the BABEL and NW-UCLA datasets, the method outperforms existing approaches, with effect sizes indicating a Z% improvement in recognition rates. The authors also report competitive performance on two surveillance datasets, showcasing the model’s strong transfer learning capabilities for both cross-domain and same-domain action recognition tasks.

Limitations
The authors acknowledge that their approach may still struggle with highly complex actions or those that are not well-represented in the training data. They also note that the reliance on multiple camera viewpoints may limit the applicability of the method in scenarios where such data is not available. Additionally, the paper does not provide detailed information on the computational resources required for training or the scalability of the model to larger datasets.

Why it matters
This work has significant implications for the development of robust human action recognition systems that can operate effectively in diverse real-world environments. By addressing the challenges posed by domain shifts and unseen actions, the proposed orientation-aware framework enhances the potential for deploying ZSAR models in applications such as surveillance, human-computer interaction, and autonomous systems. The integration of motion and textual information also opens avenues for future research in multimodal learning and action recognition.

Authors: Yannick Porto, Renato Martins, Thomas Chalumeau, Cedric Demonceaux
Source: arXiv:2605.22697
URL: https://arxiv.org/abs/2605.22697v1

By Callan Zhang · May 21, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CV