Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades

Ashkan Taghipour1, Morteza Ghahremani2, Zinuo Li1, Hamid Laga3, Farid Boussaid1, Mohammed Bennamoun1
1The University of Western Australia 2Technical University of Munich 3Murdoch University

Abstract

Generating videos of complex human motions—such as flips, cartwheels, and martial arts—remains challenging for current video diffusion models. Text-only conditioning is temporally ambiguous for fine-grained motion control, while explicit pose-based controls, though effective, require users to supply complete skeleton sequences that are costly to generate for long, dynamic actions. We propose a two-stage cascaded framework that addresses both limitations. First, an autoregressive text-to-skeleton model generates 2D pose sequences from natural language descriptions, predicting each joint conditioned on previously generated poses to capture long-range temporal dependencies and inter-joint coordination in complex motions. Second, a pose-conditioned video diffusion model synthesizes videos from a reference image and the generated skeleton sequence, employing DiNO-ALF (Adaptive Layer Fusion), a multi-level reference encoder that preserves appearance and clothing details under large pose changes and self-occlusions. To address the lack of publicly available datasets for complex human motion video generation, we introduce a Blender-based synthetic dataset of 2,000 videos featuring diverse characters performing acrobatic and stunt-like motions, providing full control over appearance, motion, and environment. This dataset fills a critical gap, as existing benchmarks severely under-represent acrobatic and stunt-like motions, while also avoiding the copyright and privacy concerns of web-collected data. Experiments on our proposed synthetic dataset and the Motion-X Fitness benchmark demonstrated that our text-to-skeleton model outperformed prior methods on FID, R-precision, and motion diversity, while our pose-to-video model achieved the best results among all compared methods on VBench metrics for temporal consistency, motion smoothness, and subject preservation.

Method

Text-to-Skeleton Generation (Training)
Text-to-Skeleton Generation (Training)
Text-to-Skeleton Generation (Inference)
Text-to-Skeleton Generation (Inference)
Pose-Conditioned Video Generation
Pose-Conditioned Video Generation

Result Gallery