DimensionX: Create Any 3D and 4D Scenes from a Single Image with Decoupled Video Diffusion

Wenqiang Sun ^*, Shuo Chen ^*, Fangfu Liu^*, Zilong Chen, Yueqi Duan, Jun Zhu^†, Jun Zhang^†, Yikai Wang^†

^* Equal Contribution
^† Corresponding author

Paper Code Demo

ICCV 2025

TL;DR: Create 3D and 4D scenes from a single image with controllable video diffusion.

Any Camera Control Video Generation

Spatial-Temporal Fused Controllable Video Generation

Camera Static

Camera Orbit Right

Camera Orbit Left

Camera Zoom In

Prompt: Several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance, mid afternoon light with wispy clouds and a sun high in the distance creates a warm glow, the low camera view is stunning capturing the large furry mammal with beautiful photography, depth of field.

Single View 3D Generation (360 Degree Orbit)

Prompt: In the mesmerizing nightscape, a colossal whale glides gracefully through the star-studded sky, its vast, textured body illuminated by the soft, ethereal glow of the moon. The city below, a sprawling metropolis of towering skyscrapers, twinkles with countless lights, creating a captivating contrast between the urban jungle and the serene marine giant. The sky, painted in deep shades of blue and adorned with twinkling stars, adds a dreamlike quality to the scene. The whale, seemingly in motion, appears to be swimming through the clouds, its majestic form a surreal and awe-inspiring sight against the backdrop of the illuminated cityscape.

Sparse View 3D Scene Generation

Two Input Views.

Front Video

Novel View Video 1

Novel View Video 2

Novel View Video 3

Prompt: In a cozy, well-lit kitchen, a man in a black apron and blue cap is meticulously crafting a cocktail. He stands behind a white countertop, expertly pouring a rich, amber liquid from a shaker into a martini glass. The scene is filled with various bottles of alcohol, a juicer, and other bar tools, indicating a well-equipped home bar. The window behind him reveals a serene suburban view, adding a touch of calm to the focused atmosphere. His precise movements and the array of ingredients suggest a passion for mixology, creating a moment of artistry in an everyday setting.

Citation

@article{sun2024dimensionx,
title={DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion},
author={Sun, Wenqiang and Chen, Shuo and Liu, Fangfu and Chen, Zilong and Duan, Yueqi and Zhang, Jun and Wang, Yikai},
journal={arXiv preprint arXiv:2411.04928},
year={2024}
}

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Decoupled Video Diffusion

Video Demo

Any Camera Control Video Generation

Spatial-Temporal Fused Controllable Video Generation

Single View 3D Generation (360 Degree Orbit)

Sparse View 3D Scene Generation

4D Scene Generation

Pipeline

X Family

Citation