Code Monkey home page Code Monkey logo

prodiff's Introduction

ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech

Rongjie Huang, Zhou Zhao, Huadai Liu, Jinglin Liu, Chenye Cui, Yi Ren

PyTorch Implementation of ProDiff (ACM Multimedia'22): a conditional diffusion probabilistic model capable of generating high fidelity speech efficiently.

arXiv GitHub Stars visitors Hugging Face

We provide our implementation and pretrained models as open source in this repository.

Visit our demo page for audio samples.

News

  • April, 2022: Our previous work FastDiff (IJCAI 2022) released in Github.
  • September, 2022: ProDiff (ACM Multimedia 2022) released in Github.

Key Features

  • Extremely-Fast diffusion text-to-speech synthesis pipeline for potential industrial deployment.
  • Tutorial and code base for speech diffusion models.
  • More supported diffusion mechanism (e.g., guided diffusion) will be available.

Quick Started

We provide an example of how you can generate high-fidelity samples using ProDiff.

To try on your own dataset, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below instructions.

Support Datasets and Pretrained Models

You can also use pretrained models we provide here. Details of each folder are as in follows:

Model Dataset Config
ProDiff Teacher LJSpeech modules/ProDiff/config/prodiff_teacher.yaml
ProDiff LJSpeech modules/ProDiff/config/prodiff.yaml

More supported datasets are coming soon.

Put the checkpoints in checkpoints/$Model/model_ckpt_steps_*.ckpt

Dependencies

See requirements in requirement.txt:

Multi-GPU

By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count(). You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE environment variable before running the training module.

Extremely-Fast Text-to-Speech with diffusion probabilistic models

Here we provide a speech synthesis pipeline using diffusion probabilistic models: ProDiff (acoustic model) + FastDiff (neural vocoder). Hugging Face

  1. Prepare acoustic model (ProDiff or ProDiff Teacher): Download LJSpeech checkpoint and put it in checkpoints/ProDiff or checkpoints/ProDiff_Teacher

  2. Prepare neural vocoder (FastDiff): Download LJSpeech checkpoint and put it in checkpoints/FastDiff

  3. Specify the input $text, and set N for reverse sampling in neural vocoder, which is a trade off between quality and speed.

  4. Run the following command for extreme fast speed (2-iter ProDiff + 4-iter FastDiff):

CUDA_VISIBLE_DEVICES=$GPU python inference/ProDiff.py --config modules/ProDiff/config/prodiff.yaml --exp_name ProDiff --hparams="N=4,text='$txt'" --reset

Generated wav files are saved in infer_out by default.
Note: For better quality, it's recommended to finetune the FastDiff neural vocoder here.

  1. Enjoy speed-quality trade-off: (4-iter ProDiff Teacher + 6-iter FastDiff):
CUDA_VISIBLE_DEVICES=$GPU python inference/ProDiff_teacher.py --config modules/ProDiff/config/prodiff_teacher.yaml --exp_name ProDiff_Teacher --hparams="N=6,text='$txt'" --reset

Train your own model

Data Preparation and Configuraion

  1. Set raw_data_dir, processed_data_dir, binary_data_dir in the config file
  2. Download dataset to raw_data_dir. Note: the dataset structure needs to follow egs/datasets/audio/*/pre_align.py, or you could rewrite pre_align.py according to your dataset.
  3. Preprocess Dataset
# Preprocess step: unify the file structure.
python data_gen/tts/bin/pre_align.py --config $path/to/config
# Align step: MFA alignment.
python data_gen/tts/runs/train_mfa_align.py --config $CONFIG_NAME
# Binarization step: Binarize data for fast IO.
CUDA_VISIBLE_DEVICES=$GPU python data_gen/tts/bin/binarize.py --config $path/to/config

We also provide our processed LJSpeech dataset here.

Training Teacher of ProDiff

CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config modules/ProDiff/config/prodiff_teacher.yaml  --exp_name ProDiff_Teacher --reset

Training ProDiff

CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config modules/ProDiff/config/prodiff.yaml  --exp_name ProDiff --reset

Inference using ProDiff Teacher

CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config modules/ProDiff/config/prodiff_teacher.yaml  --exp_name ProDiff_Teacher --infer

Inference using ProDiff

CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config modules/ProDiff/config/prodiff.yaml  --exp_name ProDiff --infer

Acknowledgements

This implementation uses parts of the code from the following Github repos: FastDiff, DiffSinger, NATSpeech, as described in our code.

Citations

If you find this code useful in your research, please cite our work:

@inproceedings{huang2022prodiff,
  title={ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech},
  author={Huang, Rongjie and Zhao, Zhou and Liu, Huadai and Liu, Jinglin and Cui, Chenye and Ren, Yi},
  booktitle={Proceedings of the 30th ACM International Conference on Multimedia},
  year={2022}
}

@article{huang2022fastdiff,
  title={FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis},
  author={Huang, Rongjie and Lam, Max WY and Wang, Jun and Su, Dan and Yu, Dong and Ren, Yi and Zhao, Zhou},
  booktitle = {Proceedings of the Thirty-First International Joint Conference on
               Artificial Intelligence, {IJCAI-22}},
  publisher = {International Joint Conferences on Artificial Intelligence Organization},
  year={2022}
}

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

prodiff's People

Contributors

rongjiehuang avatar rayeren avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.