Code Monkey home page Code Monkey logo

moditalker's Introduction

MoDiTalker

Official PyTorch implementation of "MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation".

Seyeon Kim∗1, Siyoon Jin∗1, Jihye Park∗1, Kihong Kim2, Jiyoung Kim1, Jisu Nam1 and Seungryong Kim†1.
Equal contribution ,Corresponding author
1Korea University, 2VIVE STUDIO
paper | project page

1. Environment setup

conda create -n MoDiTalker python=3.8 -y
conda activate MoDiTalker
python -m pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu116
python -m pip install natsort tqdm gdown omegaconf einops lpips pyspng tensorboard imageio av moviepy numba p_tqdm soundfile face_alignemnt

2. Get ready to train models

2.1. Dataset

We utilized two datasets for training each stage. Please refer and follow the dataset preparation from here

2.2. Download auxiliary models

Get the BFM_model_front.mat, similarity_Lm3D_all.mat and Exp_Pca.bin, and place them to the MoDiTalker/data/data_utils/deep_3drecon/BFM directory. Obtain 'BaselFaceModel.tgz and extract a file named 01_MorphableModel.mat and place it to the MoDiTalker/data/data_utils/deep_3drecon/BFM directory.

(Optional)

We had to revise a single code inside the package accelerate due to version conflicts. If some conflicts occur during loading data, please revise the code in accelerate/dataloader.py following this;

from

batch_size = dataloader.batch_size if dataloader.batch_size is not None else dataloader.batch_sampler.batch_size

to

batch_size = dataloader.batch_size if dataloader.batch_size is not None else len(dataloader.batch_sampler[0])

3. Training

3.1. AToM

cd AToM
bash scripts/train.sh

The checkpoints of AToM will be saved in ./runs

3.2. MToV

The checkpoints of AToM will be saved in ./runs

Autoencoder

First, execute the following script:

cd MToV
bash scripts/train/first_stg.sh 

Then the script will automatically create the folder in ./log_dir to save logs and checkpoints.

Second, execute the following script:

cd MToV
bash scripts/train/first_stg_ldmk.sh 

You may change the model configs via modifying configs/autoencoder. Moreover, one needs early-stopping to further train the model with the GAN loss (typically 8k-14k iterations with a batch size of 8).

Diffusion model

cd MToV
bash scripts/train/second_stg.sh

4. Getting the Weights

We provide the corresponding checkpoints in the below: Download and place them in the ./checkpoints/ directory.

Full checkpoints will be released later, ETA July 2024.

5. Inference

5.1. Generating Motions from Audio

Before producing the motions from audio, there's need to preprocess the audio since we process audio in the type of HuBeRT. To produce hubert feature of audio you want, please follow the script below:

cd data
python data_utils/preprocess/process_audio.py \
--audio path to audio \
--ref_dir path to directory of reference images 

Then the processed audio hubert(npy) will be saved in data/inference/hubert/{sampling rate}

Note that, you need to specify the path to (1) reference images (2) processed hubert and (3) checkpoint in the following bash script.

cd AToM
bash scripts/inference.sh

The results of AToM will be saved in AToM/results/frontalized_npy and this path should be consistent with the ldmk_path of the following step.

5.2. Align Motions

Note that, you need to specify the path to (1) reference images and (2) produced landmark.

cd data/data_utils
python motion_align/align_face_recon.py \
--ldmk_path path to directory of generated landmark \
--driv_video_path path to directory of reference images 

The final landmarks will be saved in AToM/results/aligned_npy.

5.3. Generating Video from aligned Motions

cd MToV
bash scripts/inference/sample.sh

The final videos will be saved in MToV/results.

Citation

@misc{kim2024moditalker,
      title={MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation}, 
      author={Seyeon Kim and Siyoon Jin and Jihye Park and Kihong Kim and Jiyoung Kim and Jisu Nam and Seungryong Kim},
      year={2024},
      eprint={2403.19144},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Reference

This code is mainly built upon EDGE and PVDM.
We also used the code from following repository: GeneFace.

moditalker's People

Contributors

friedpotato-hash avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.