Code Monkey home page Code Monkey logo

vidtome's Introduction

VidToMe: Video Token Merging for Zero-Shot Video Editing (CVPR 2024)
Official Pytorch Implementation

Xirui Li, Chao Ma, Xiaokang Yang, and Ming-Hsuan Yang

Project Page | Paper | Summary Video

Also check VISION-SJTU/VidToMe

source_1.mp4

VidToMe merges similar self-attention tokens across frames, improving temporal consistency while reducing memory consumption.

Abstract

Diffusion models have made significant advances in generating high-quality images, but their application to video generation has remained challenging due to the complexity of temporal motion. Zero-shot video editing offers a solution by utilizing pre-trained image diffusion models to translate source videos into new ones. Nevertheless, existing methods struggle to maintain strict temporal consistency and efficient memory consumption. In this work, we propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames. By aligning and compressing temporally redundant tokens across frames, our method improves temporal coherence and reduces memory consumption in self-attention computations. The merging strategy matches and aligns tokens according to the temporal correspondence between frames, facilitating natural temporal consistency in generated video frames. To manage the complexity of video processing, we divide videos into chunks and develop intra-chunk local token merging and inter-chunk global token merging, ensuring both short-term video continuity and long-term content consistency. Our video editing approach seamlessly extends the advancements in image editing to video editing, rendering favorable results in temporal consistency over state-of-the-art methods.

Updates

  • [02/2024] Code is released.
  • [02/2024] Accepted to CVPR 2024!
  • [12/2023] Release paper and website.

TODO

  • Release evaluation dataset and more examples.
  • Release evaluation code.

Setup

  1. Clone the repository.
git clone [email protected]:lixirui142/VidToMe.git
cd VidToMe
  1. Create a new conda environment and install PyTorch following PyTorch Official Site. Then pip install required packages.
conda create -n vidtome python=3.9
conda activate vidtome
# Install torch, torchvision (https://pytorch.org/get-started/locally/)
pip install -r requirements.txt

We recommand installing xformers for fast and memory-efficient attention.

Run

python run_vidtome.py --config configs/tea-pour.yaml

Check more config examples in 'configs'. The default config value are specified in 'default.yaml' with explanation.

Citation

If you find this work useful for your research, please consider citing our paper:

@inproceedings{li2024vidtome,
    title={VidToMe: Video Token Merging for Zero-Shot Video Editing},
    author={Li, Xirui and Ma, Chao and Yang, Xiaokang and Yang, Ming-Hsuan},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year={2024}
    }

Acknowledgments

The code is mainly developed based on ToMeSD, PnP, Diffusers.

vidtome's People

Contributors

lixirui142 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vidtome's Issues

def make_tome_block

Hi, thank you for your great work! I'd like to ask about the make_tome_block function in the case where the diffuser pipeline is not used. Is it fully functional in the current version, or is it necessary for me to complete it myself?

Missing License

Dear VidToMe authors,

thank you for sharing your VidToMe implementation.
I'd like to use it in my research, and it would be much easier for me if the repo had a license :)
Could you please consider adding one?

Thanks,
Best,
D

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.