Code Monkey home page Code Monkey logo

mira's Introduction

Mira: A Mini-step Towards Sora-like Long Video Generation

Zhaoyang Zhang1*, Ziyang Yuan1*, Xuan Ju1, Yiming Gao1, Xintao Wang1#, Chun Yuan, Ying Shan1,
1ARC Lab, Tencent PCG *Equal contribution #Project lead

Project Page MiraData Page

We introduce Mira (Mini-Sora), an initial foray into the realm of high-quality, long-duration video generation in the style of Sora. Mira stands out from existing text-to-video (T2V) generation frameworks in several key ways:

  • Extended sequence length: While most frameworks are limited to generating short videos (2 seconds / 16 frames), Mira is designed to produce significantly longer sequences, potentially lasting 10 seconds, 20 seconds, or more.

  • Enhanced dynamics: Mira has the capability to create videos with rich dynamics and intricate motions, setting it apart from the more static outputs of current video generation technologies.

  • Strong 3D consistency: Despite the intricate dynamics and object interactions, Mira ensures the 3D integrity of objects is preserved throughout the video, avoiding noticeable distortions.

Please acknowledge that our work on Mira is in the experimental phase. There are several areas where Sora still significantly outperforms Mira and other open-source T2V frameworks, including:

  • Interactive objects and environments: Sora supports the generation of videos where objects and surroundings engage in dynamic interactions, adding a layer of complexity and realism.

  • Sustained object consistency: Sora maintains consistent object shapes, even when they temporarily exit and re-enter the frame, ensuring continuity and coherence.

The Mira project is our endeavor to investigate and refine the entire data-model-training pipeline for Sora-like, lightweight T2V frameworks, and to preliminarily demonstrate the aforementioned Sora characteristics. Our goal is to foster innovation and democratize the field of content creation, paving the way for more accessible and advanced video generation tools.

Results

10s 384ร—240

mira-384-v0.mp4

Each individual video can be downloaded from here.

20s 128ร—80

mria-128-v0.mp4

๐Ÿ“ฐ Updates

Stay tuned! We are actively working on this project. Expect a steady stream of updates as we expand our dataset, enhance our annotation processes, and refine our model checkpoints. Keep an eye out for these upcoming updates, as we continue to make strides in our project's development.

[2024.04.01] ๐Ÿ”ฅ We're delighted to announce the release of Mira and MiraData-v0. This release offers a comprehensive open-source suite for data annotation and training pipelines, specifically tailored for the creation of long-duration videos with dynamic content and consistent quality. Our provided codes and checkpoints empower users to generate videos up to 20 seconds in 128x80 resolution and 10 seconds in 384x240 resolution. Dive into the future of video generation with Mira!

Installation

## create a conda enviroment
conda update -n base -c defaults conda 
conda create -y -n mira python=3.8.5 
source activate mira 

## install dependencies
pip install torch==2.0 torchvision torchaudio decord==0.6.0  \
einops==0.3.0  imageio==2.9.0 \
numpy omegaconf==2.1.1 opencv_python pandas \
Pillow==9.5.0 pytorch_lightning==1.9.0 PyYAML==6.0 setuptools==65.6.3  \
torchvision tqdm==4.65.0 transformers==4.25.1 moviepy av  tensorboardx \
&& pip install  timm scikit-learn  open_clip_torch==2.22.0 kornia simplejson easydict pynvml rotary_embedding_torch==0.3.1 triton  cached_property  \
&& pip install xformers==0.0.18 \
&& pip install taming-transformers fairscale deepspeed  diffusers

Training

Checkpoints

Name Model Size Data Resolution
128-v0.pt 1.1B Webvid(pretrain) + MiraData-v0 128x80, 120 frames
384-v0.pt 1.1B Webvid(pretrain) + MiraData-v0 384x240, 60 frames

Please download the above checkponits in our huggingface page.

Finetuning the Mira-v0 model on 128x80 resolution.

  • Add path to your datasets and the pretrain models in config_384_mira.yaml.
  • Then conduct the following commands:
## activate envrionment
conda activate mira


## Run training
bash configs/Mira/run_128_mira.sh 0

Finetuning the Mira-v0 model on 384x240 resolution.

  • Add path to your datasets and the pretrain models in config_128_mira.yaml.
  • Then conduct the following commands:
## activate envrionment
conda activate mira

## Run training
bash configs/Mira/run_384_mira.sh 0

Inference

Evaluate the Mira-v0 model on 128x80 resolution.

## activate envrionment
conda activate mira

## Run inference
bash configs/inference/run_text2video.sh

Evaluate the Mira-v0 model on 384x240 resolution.

## activate envrionment
conda activate mira

## Run inference
bash configs/inference/run_text2video_384.sh

Current Limitations

Mira-v0 represents our initial exploration into developing a Sora-like Text-to-Video (T2V) pipeline. Through this process, we have identified several areas for improvement in the current version:

  • Enhanced motion dynamics and scene intricacy at the expense of generic object generation. The Mira-v0 model, being fine-tuned on the potentially limited MiraData-v0, shows a reduced capability in generating a diverse range of objects compared to the WebVid-pretrained MiraDiT. However, it's worth noting that the Mira-v0 model has shown notable advancements in motion dynamics, scene detail, and three-dimensional consistency.
10s 384x240 10s 384x240 10s 384x240
A cute dog sniffing around the sandy coast. A serene underwater scene featuring a sea turtle swimming through a coral reef. A serene underwater scene featuring a sea turtle swimming through a coral reef. The turtle is with its greenish-brown shell.
  • Architecture design. The current ST-DiT-based model architecture lacks sophisticated spatial-temporal interactions

  • Reconstruction artifacts. We are dedicated to further tune the video VAE to mitigate reconstruction artifacts.

  • Sustained object consistency. Due to resource limitations, our present MiraDiT employs distinct modules for spatial and temporal processing, which may affect the stability of object representation in longer, dynamic video sequences.

  • At this stage, aspects such as image quality (resolution, clarity) and text alignment have not been our focus, but they remain important considerations for future updates.

mira's People

Contributors

zzyfd avatar mira-space avatar gymat avatar jiangyzy avatar kimbingng avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.