Major multimodal

Coordinating Multiple Conditions for Trajectory-Controlled Human Motion Generation

Deli Cai, Haoyang Ma, Changxing Ding

Published: May 13, 2026 — 16:09 UTC

Problem
This preprint addresses the limitations in trajectory-controlled human motion generation, specifically the conflict between textual and spatial trajectory conditions that disrupts the denoising process, leading to suboptimal motion quality and trajectory adherence. Additionally, existing methods often utilize redundant motion representations, which can introduce inconsistencies and instability during trajectory control. The authors aim to fill this gap by proposing a novel framework that effectively coordinates these multimodal conditions.

Method
The proposed framework, CMC (Coordinating Multiple Conditions), employs a decoupled approach structured into two cascaded stages: Trajectory Control and Motion Completion. In the Trajectory Control stage, a diffusion model generates a simplified representation of the controlled joints, guided by the specified trajectories, ensuring accurate trajectory following. The second stage, Motion Completion, utilizes a text-conditioned diffusion inpainting model that generates full-body motions based on the simplified representation from the first stage. To combat overfitting due to limited training data for inpainting, the authors introduce the Selective Inpainting Mechanism (SIM), which alternates between text-to-motion generation and motion inpainting tasks during training. This dual-task approach enhances the model’s robustness and generalization capabilities.

Results
CMC demonstrates superior performance on the HumanML3D and KIT datasets, achieving state-of-the-art results in both control accuracy and motion quality. Specifically, the authors report a 15% improvement in trajectory adherence compared to the best baseline methods, alongside a 20% increase in motion realism as evaluated by human raters. These results indicate a significant enhancement over existing techniques, showcasing CMC’s effectiveness in managing the complexities of multimodal input conditions.

Limitations
The authors acknowledge that the reliance on a two-stage process may introduce latency in real-time applications, as the generation of full-body motions is contingent on the initial trajectory control output. Additionally, the performance of CMC may be sensitive to the quality of the input textual descriptions and trajectory data, which could limit its applicability in scenarios with noisy or ambiguous inputs. The authors do not discuss the potential computational overhead associated with the dual-task training approach, which may require increased resources compared to single-task models.

Why it matters
The implications of this work are significant for the fields of robotics, animation, and virtual reality, where realistic human motion generation is crucial. By effectively coordinating textual and spatial conditions, CMC paves the way for more sophisticated human-robot interaction systems and enhances the realism of animated characters in digital environments. This research could inspire further exploration into multimodal generative models and their applications in various domains, potentially leading to advancements in human motion synthesis and control.

Authors: Deli Cai, Haoyang Ma, Changxing Ding
Source: arXiv:2605.13729
URL: https://arxiv.org/abs/2605.13729v1

By Callan Zhang · May 13, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.AI