Code Monkey home page Code Monkey logo

trackdiffusion's Introduction

TrackDiffusion

TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models

Pytorch implementation of TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models

Abstract

Despite remarkable achievements in video synthesis, achieving granular control over complex dynamics, such as nuanced movement among multiple interacting objects, still presents a significant hurdle for dynamic world modeling, compounded by the necessity to manage appearance and disappearance, drastic scale changes, and ensure consistency for instances across frames. These challenges hinder the development of video generation that can faithfully mimic real-world complexity, limiting utility for applications requiring high-level realism and controllability, including advanced scene simulation and training of perception systems. To address that, we propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control via diffusion models, which facilitates the precise manipulation of the object trajectories and interactions, overcoming the prevalent limitation of scale and continuity disruptions. A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects, a critical factor overlooked in the current literature. Moreover, we demonstrate that generated video sequences by our TrackDiffusion can be used as training data for visual perception models. To the best of our knowledge, this is the first work to apply video diffusion models with tracklet conditions and demonstrate that generated frames can be beneficial for improving the performance of object trackers.

Method

The framework generates video frames based on the provided tracklets and employs the Instance Enhancer to reinforce the temporal consistency of foreground instance. A new gated cross-attention layer is inserted to take in the new instance information. framework

Getting Started

Environment Setup

The code is tested with Pytorch==2.0.1 and cuda 11.8 on A800 servers. To setup the python environment, follow:

cd ${ROOT}
pip install -r requirements.txt

Then, continue to install third_party requirements, follow:

Install MMTracking

pip install https://download.openmmlab.com/mmcv/dist/cu117/torch2.0.0/mmcv-2.0.0-cp310-cp310-manylinux1_x86_64.whl

git clone https://github.com/open-mmlab/mmtracking.git -b dev-1.x

cd mmtracking
pip install -e .

Install Diffusers

cd third_party/diffusers
pip install -e .

Dataset

Please download the datasets from the official websites. YouTube-VIS 2021

We also provide the text caption files for the ytvis dataset, please download from Google Drive.

Pretrained Weights

T2V Version Stable Video Diffusion Version
Our training are based on pengxiang/GLIGEN_1_4. You can access the following links to obtain the trained weights:

weight
Our training are based on stabilityai/stable-video-diffusion-img2vid. You can access the following links to obtain weights for stage 1 and 2:
Stage1
Stage2

Training

1. Convert Annotations

We use CocoVID to maintain all datasets in this codebase. In this case, you need to convert the official annotations to this style. We provide scripts and the usages are as following:

cd ./third_party/mmtracking
python ./tools/dataset_converters/youtubevis/youtubevis2coco.py -i ./data/youtube_vis_2021 -o ./data/youtube_vis_2021/annotations --version 2021

The folder structure will be as following after your run these scripts:

├── data
│   ├── youtube_vis_2021
│   │   │── train
│   │   │   │── JPEGImages
│   │   │   │── instances.json (the official annotation files)
│   │   │   │── ......
│   │   │── valid
│   │   │   │── JPEGImages
│   │   │   │── instances.json (the official annotation files)
│   │   │   │── ......
│   │   │── test
│   │   │   │── JPEGImages
│   │   │   │── instances.json (the official annotation files)
│   │   │   │── ......
│   │   │── annotations (the converted annotation file)

2. For T2V Training

Launch training with (with 8xA800):

If you encounter an error similar to AssertionError: MMEngine==0.10.3 is used but incompatible. Please install mmengine>=0.0.0, <0.2.0., please directly jump to that line of code and comment it out.

bash ./scripts/t2v.sh

3. For I2V Training (WIP)

If you want SVD version, please find the code at the SVD branch.

Inference

Check demo.ipynb for more details.

Results

  • Compare TrackDiffusion with other methods for generation quality:

main_results

  • Training support with frames generated from TrackDiffusion:
train

More results can be found in the main paper.

The GeoDiffusion Family

We aim to construct a controllable and flexible pipeline for perception data corner case generation and visual world modeling! Check our latest works:

  • GeoDiffusion: text-prompted geometric controls for 2D object detection.
  • MagicDrive: multi-view street scene generation for 3D object detection.
  • TrackDiffusion: multi-object video generation for MOT tracking.
  • DetDiffusion: customized corner case generation.
  • Geom-Erasing: geometric controls for implicit concept removal.

Cite Us

@misc{li2024trackdiffusion,
      title={TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models}, 
      author={Pengxiang Li and Kai Chen and Zhili Liu and Ruiyuan Gao and Lanqing Hong and Guo Zhou and Hua Yao and Dit-Yan Yeung and Huchuan Lu and Xu Jia},
      year={2024},
      eprint={2312.00651},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

trackdiffusion's People

Contributors

kaichen1998 avatar pixeli99 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

trackdiffusion's Issues

Code Release

Excellent work! Could you tell me when you will release the code?

Inquiry about the release of web_demo.py

I'd like to ask when the web_demo.py will be released?
And using OpenAI key to run the demo is a bit of a tall order for many people, but are there any plans for a simplified version?

Confusion about the pipeline

作者您好,对您的这篇工作非常感兴趣,但是在阅读完论文之后,对于TrackDiffusion的pipeline还是很困惑,不知道是如何从输入到输出的,并且对于论文Fig.2 也是看得迷迷糊糊的,麻烦您有空解惑一下

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.