Loading...
Discovering amazing AI tools


Unified video model for understanding, high-fidelity generation, and precise free-form editing via a dual-stream architecture.

Unified video model for understanding, high-fidelity generation, and precise free-form editing via a dual-stream architecture.
UniVideo is a unified multimodal video framework that combines video understanding, generation, and editing in a single model. It uses a dual-stream design pairing a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) backbone for video generation, enabling accurate interpretation of complex multimodal instructions while preserving temporal and visual consistency. UniVideo is jointly trained across diverse video and image tasks and supports text/image-to-video generation, in-context video generation, visual-prompt-based generation, and free-form editing (including composition of tasks like editing plus style transfer). The unified instruction paradigm and joint training enable strong transfer and task composition capabilities, allowing it to handle novel editing requests (e.g., green-screening or material replacement) even without explicit video-editing supervision.

