DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion

Wenqiang Sun *, Shuo Chen *, Fangfu Liu*, Zilong Chen, Yueqi Duan, Jun Zhang, Yikai Wang

* Equal Contribution
Corresponding author

Arxiv

TL;DR: Create 3D and 4D scenes from a single image with controllable video diffusion.

Video Demo

Any Camera Control Video Generation

Spatial-Temporal Fused Controllable Video Generation

Camera Static
Camera Orbit Right
Camera Orbit Left
Camera Zoom In

Prompt: Several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance, mid afternoon light with wispy clouds and a sun high in the distance creates a warm glow, the low camera view is stunning capturing the large furry mammal with beautiful photography, depth of field.

Single View 3D Generation (360 Degree Orbit)

Prompt: In the mesmerizing nightscape, a colossal whale glides gracefully through the star-studded sky, its vast, textured body illuminated by the soft, ethereal glow of the moon. The city below, a sprawling metropolis of towering skyscrapers, twinkles with countless lights, creating a captivating contrast between the urban jungle and the serene marine giant. The sky, painted in deep shades of blue and adorned with twinkling stars, adds a dreamlike quality to the scene. The whale, seemingly in motion, appears to be swimming through the clouds, its majestic form a surreal and awe-inspiring sight against the backdrop of the illuminated cityscape.

Sparse View 3D Scene Generation

Two Input Views.

4D Scene Generation

Front Video
Novel View Video 1
Novel View Video 2
Novel View Video 3

Prompt: In a cozy, well-lit kitchen, a man in a black apron and blue cap is meticulously crafting a cocktail. He stands behind a white countertop, expertly pouring a rich, amber liquid from a shaker into a martini glass. The scene is filled with various bottles of alcohol, a juicer, and other bar tools, indicating a well-equipped home bar. The window behind him reveals a serene suburban view, adding a touch of calm to the focused atmosphere. His precise movements and the array of ingredients suggest a passion for mixology, creating a moment of artistry in an everyday setting.

Pipeline

Pipeline of DimensionX. Our framework is mainly divided into three parts. (a) ST-Director for Controllable Video Generation. We introduce ST-Director to decompose the spatial and temporal parameters in video diffusion models by learning dimension-aware LoRA on our collected dimension-variant datasets. (b) 3D Scene Generation with S-Director. Given one view, a high-quality 3D scene can be recovered from the video frames generated by S-Director. (c) 4D Scene Generation with ST-Director. Given a single image, a temporal-variant video is produced by T-Director, from which a key frame is selected to generate a spatial-variant reference video. Guided by the reference video, per-frame spatial-variant videos are generated by S-Director, which are then combined into multi-view videos. Through the multi-loop refinement of T-Director, consistent multi-view videos are then passed to optimize the 4D scene.

X Family

ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion

More X Family coming soon...

Citation

@misc{sun2024dimensionxcreate3d4d,
    title={DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion}, 
    author={Wenqiang Sun and Shuo Chen and Fangfu Liu and Zilong Chen and Yueqi Duan and Jun Zhang and Yikai Wang},
    year={2024},
    eprint={2411.04928},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2411.04928}, 
}