Articulation in Prime: Primitive-Based Articulated Object Understanding from a Single Casual Video
Arslan Artykov, Tom Ravaud, Nicolás Violante-Grezzi, Vincent Lepetit
- Published
- May 18, 2026 — 16:52 UTC
Problem
This paper addresses the challenge of retrieving 3D kinematics of articulated objects from monocular video, a fundamental issue in computer vision. Existing methods often require complex setups or rely on cues like long-term point tracking and wide-baseline matching, which can fail under conditions of severe occlusion, rapid camera motion, or weak local features. Additionally, learning-based approaches typically struggle with generalization beyond their training categories. This work presents a category-agnostic optimization framework, which is particularly relevant as it is a preprint and has not yet undergone peer review.
Method
The authors propose a novel approach that frames articulated object understanding as a primitive-fitting problem. The core technical contribution involves using geometric primitives as a proxy representation, which mitigates the issues associated with unstable point tracks. The method organizes these primitives into coherent parts constrained by revolute and prismatic joints. The optimization framework jointly addresses part segmentation and joint parameter recovery, enabling the extraction of complex kinematics from a single casually captured video. A visibility-aware procedure is introduced to manage partial observations and occlusions, which are common in real-world scenarios. The authors also introduce two new benchmarks, AiP-synth and AiP-real, designed to evaluate performance under significant camera motion and heavy occlusions.
Results
The proposed method demonstrates superior performance compared to existing state-of-the-art techniques on the newly introduced AiP-synth and AiP-real benchmarks. While specific numerical results are not detailed in the abstract, the authors claim to outperform existing methods, indicating a significant improvement in handling occlusions and camera motion. The effectiveness of the visibility-aware procedure is particularly highlighted, suggesting that it contributes to the robustness of the kinematic recovery process.
Limitations
The authors acknowledge that their approach may still struggle with extreme occlusions or highly complex articulated structures that do not conform to the assumed joint constraints. Additionally, the reliance on geometric primitives may limit the method’s applicability to certain types of articulated objects that do not fit well within the defined primitive categories. The paper does not discuss the computational efficiency or scalability of the proposed method, which could be a concern for real-time applications.
Why it matters
This work has significant implications for downstream applications in robotics, augmented reality, and animation, where understanding articulated object motion from casual video input is crucial. By providing a robust framework for kinematic recovery that is less dependent on complex setups and more resilient to occlusions, this research could facilitate advancements in autonomous systems that require real-time understanding of dynamic environments. The introduction of the AiP-synth and AiP-real benchmarks also sets a new standard for evaluating articulated object understanding, potentially guiding future research in this domain.
Authors: Arslan Artykov, Tom Ravaud, Nicolás Violante-Grezzi, Vincent Lepetit
Source: arXiv:2605.18645v1
URL: https://arxiv.org/abs/2605.18645v1
By Turing Wire editorial staff · May 18, 2026 · Editorial standards →
Source: arXiv cs.CV