TL;DR: We propose Context Forcing, a framework that enables consistent long-video generation by aligning student and teacher context lengths. Paired with a Slow-Fast Memory system, it achieves $2\text{--}10\times$ longer context information than current state-of-the-art methods.
Existing methods face an unavoidable trade-off:
Restricting the model to a short memory window minimizes error accumulation, but causes the model to lose track of previous subjects and scenes during long rollout.
Maintaining a long context preserves more previous information, but exposes the model to more errors. The video distribution progressively drifts away from the real manifold.
Prompt: A fairy tale-style illustration in soft pastel colors of a princess with long golden hair gently brushing it in a garden. She wears a flowing white gown with intricate floral patterns and a delicate crown adorned with gemstones. Her fair skin and expressive eyes reflect a mix of serenity and concentration. The garden is filled with blooming flowers and lush greenery, with a small pond in the background. A slight breeze rustles the leaves, adding a sense of natural movement. The princess stands in a medium shot, with a close-up of her face and hands.
Training paradigms for AR video diffusion models. (a) Self-forcing: A student matches a teacher capable of generating only 5s video using a 5s self-rollout. (b) Longlive: The student performs long rollouts supervised by a memoryless 5s teacher on random chunks. The teacher’s inability to see beyond its 5s window creates a student-teacher mismatch. (c) Context Forcing (Ours): The student is supervised by a long-context teacher aware of the full generation history, resolving the mismatch in (b).
Our method enables stable minute-level video generation with high subject/background consistency across diverse scenarios. Conversely, LongLive exhibits flashback artifacts(e.g., at 50s) and lacks the capacity to preserve consistent subject and background details throughout long sequences.
Prompt: A bustling downtown street at dusk, filled with cars and pedestrians moving through the scene. The street is lined with skyscrapers, their illuminated windows casting reflections on the pavement below. The camera captures a dynamic medium shot, showing the intersection of the street where people walk and vehicles pass, creating a lively and energetic atmosphere. The light from the buildings creates a warm glow, with the contrast between the bright lights and the fading daylight adding depth to the scene.
Prompt: A vibrant cartoon-style illustration depicting a kangaroo performing a lively disco dance. The kangaroo has a joyful expression, with large, expressive eyes and a mischievous grin. It wears a colorful sequined outfit with sparkles, including a glittery top and matching pants. Its tail is fluffed out and swaying rhythmically. The kangaroo moves with natural fluidity, one foot lifted and the other stepping forward. The background features a blurred dance floor with colorful lights and dancing figures, creating a festive atmosphere. The illustration has a smooth, hand-drawn style with exaggerated proportions. A dynamic close-up shot from a slightly elevated angle.
Our method enables minute-level video generation with minimal drifting and high subject and background consistency across diverse scenarios.
Prompt: A vibrant illustration in the style of a comic book, depicting a toy robot wearing purple overalls and cowboy boots taking a pleasant stroll in Johannesburg, South Africa, during a winter storm. The robot has a friendly, curious expression, its arms swinging gently as it walks down a bustling street. The background shows a cityscape with tall buildings, some partially obscured by heavy snowflakes and swirling winds. The streets are lined with cars and pedestrians sheltering under umbrellas, adding to the lively scene. The sky is dark and stormy, with lightning flashing intermittently. A dynamic medium shot from a slightly elevated angle, capturing the robot's movement and the bustling urban environment.
Prompt: A dynamic and lively moment captured in a vibrant pop art style, showing a young woman jumping up and down with joy, her movements full of energy and excitement. She dances energetically, her arms flailing and legs kicking in the air. Her face is filled with happiness and a wide smile. She wears a colorful floral dress that flows with her movements. The background features a blurred cityscape with hints of tall buildings and bright lights, giving the scene a bustling urban feel. A mid-shot from a slightly low angle, capturing the full range of her joyful dance.
Context Forcing and Context Management System. We use KV Cache as the context memory, and we organize it into three parts: sink, slow memory and fast memory. During contextual DMD training, the long teacher provides supervision to the long student by utilizing the same context memory mechanism.
While context forcing demonstrates improved consistency and reduced drifting in long-context scenarios, it is not immune to errors. Residual drifting issues persist in complex cases, and fidelity is not absolute, with fine-grained details occasionally being omitted. Future research can investigate more advanced context compression techniques to enhance information retention and improve robustness against error drifting.
@misc{chen2026contextforcingconsistentautoregressive,
title={Context Forcing: Consistent Autoregressive Video Generation with Long Context},
author={Shuo Chen and Cong Wei and Sun Sun and Ping Nie and Kai Zhou and Ge Zhang and Ming-Hsuan Yang and Wenhu Chen},
year={2026},
eprint={2602.06028},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.06028},
}