Notable agents robotics

Action Motifs: Self-Supervised Hierarchical Representation of Human Body Movements

Genki Kinoshita, Shu Nakamura, Ryo Kawahara, Shohei Nobuhara, Yasutomo Kawanishi, Ko Nishino

Original source

arXiv cs.CV

https://arxiv.org/abs/2604.28173v1

Problem
This preprint addresses the gap in effective human behavior modeling through a hierarchical representation of human body movements. Existing methods often fail to leverage the compositionality of human actions, leading to suboptimal representations that do not generalize well across different actions. The authors propose a novel framework that captures atomic joint movements and their temporal compositions, which is essential for understanding complex human behaviors.

Method
The core technical contribution is the A4Mer architecture, a nested latent Transformer designed to learn a hierarchical representation of human movements from 3D pose data in a fully self-supervised manner. A4Mer segments 3D pose sequences into variable-length segments, each represented as a latent token termed Action Atoms. These Action Atoms are then used to derive Action Motifs, which encapsulate meaningful temporal patterns of body movements. The model employs a unified pretext task of masked token prediction within the latent spaces of Action Atoms and Action Motifs. The training process utilizes a large-scale Action Motif Dataset (AMD), which includes multi-view human behavior videos with comprehensive SMPL annotations, achieved through innovative camera placements on the feet to mitigate occlusions.

Results
A4Mer demonstrates significant improvements over baseline methods in various human behavior modeling tasks. Specifically, it achieves state-of-the-art performance in action recognition, motion prediction, and motion interpolation. The authors report a notable increase in accuracy metrics, although specific numerical results and comparisons to named baselines are not detailed in the abstract. The effectiveness of the hierarchical representation is underscored by its ability to extract meaningful Action Motifs that enhance the performance of downstream tasks.

Limitations
The authors acknowledge that the reliance on self-supervised learning may limit the model’s performance in scenarios where labeled data is available. Additionally, the complexity of the nested Transformer architecture may pose challenges in terms of computational efficiency and scalability. The dataset, while large, may still be limited in diversity, potentially affecting the generalizability of the learned representations. The authors do not address the potential impact of noise in pose estimation or the limitations of the SMPL model in capturing all human body variations.

Why it matters
This work has significant implications for advancing human behavior modeling, particularly in applications such as robotics, animation, and human-computer interaction. By providing a robust hierarchical representation that captures the compositional nature of human movements, A4Mer can facilitate more accurate and flexible models for understanding and predicting human actions. The introduction of the AMD dataset also contributes valuable resources for future research in this domain, potentially leading to further innovations in self-supervised learning techniques for motion analysis.

Authors: Genki Kinoshita, Shu Nakamura, Ryo Kawahara, Shohei Nobuhara, Yasutomo Kawanishi, Ko Nishino
Source: arXiv:2604.28173
URL: https://arxiv.org/abs/2604.28173v1

Published
Apr 30, 2026 — 17:55 UTC
Summary length
437 words
AI confidence
70%