Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades

Abstract

Generating videos of complex human motions—such as flips, cartwheels, and martial arts—remains challenging for current video diffusion models. Text-only conditioning is temporally ambiguous for fine-grained motion control, while explicit pose-based controls, though effective, require users to supply complete skeleton sequences that are costly to generate for long, dynamic actions. We propose a two-stage cascaded framework that addresses both limitations. First, an autoregressive text-to-skeleton model generates 2D pose sequences from natural language descriptions, predicting each joint conditioned on previously generated poses to capture long-range temporal dependencies and inter-joint coordination in complex motions. Second, a pose-conditioned video diffusion model synthesizes videos from a reference image and the generated skeleton sequence, employing DiNO-ALF (Adaptive Layer Fusion), a multi-level reference encoder that preserves appearance and clothing details under large pose changes and self-occlusions. To address the lack of publicly available datasets for complex human motion video generation, we introduce a Blender-based synthetic dataset of 2,000 videos featuring diverse characters performing acrobatic and stunt-like motions, providing full control over appearance, motion, and environment. This dataset fills a critical gap, as existing benchmarks severely under-represent acrobatic and stunt-like motions, while also avoiding the copyright and privacy concerns of web-collected data. Experiments on our proposed synthetic dataset and the Motion-X Fitness benchmark demonstrated that our text-to-skeleton model outperformed prior methods on FID, R-precision, and motion diversity, while our pose-to-video model achieved the best results among all compared methods on VBench metrics for temporal consistency, motion smoothness, and subject preservation.

Method

Text-to-Skeleton Generation (Training)

Text-to-Skeleton Generation (Inference)

Pose-Conditioned Video Generation

Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades

Abstract

Method

Result Gallery