Code Monkey home page Code Monkey logo

voicecraft's Introduction

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

Demo Paper

TL;DR

VoiceCraft is a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on in-the-wild data including audiobooks, internet videos, and podcasts.

TODO

  • gradio port
  • windows version maybe
  • support multiple gpus
  • faster-distil-whisper-large-v3

Environment setup

conda create -n voicecraft python=3.9.19
conda activate voicecraft
pip install torch==2.0.1 # this assumes your system is compatible with CUDA 11.7, otherwise checkout https://pytorch.org/get-started/previous-versions/#v201
pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft
pip install tensorboard==2.16.2
pip install phonemizer==3.2.1
pip install torchaudio==2.0.2 #tries to install xformers 0.25. ignore the error
pip install datasets==2.16.0
pip install torchmetrics==0.11.1
pip install xformers==0.0.20 # this is new
pip install openai-whisper
apt-get install ffmpeg # if you don't already have ffmpeg installed
apt-get install espeak-ng # backend for the phonemizer installed below
# install MFA for getting forced-alignment, this could take a few minutes
conda install -c conda-forge montreal-forced-aligner=2.2.17 openfst=1.8.2 kaldi=5.5.1068
# conda install pocl # above gives an warning for installing pocl, not sure if really need this
mfa model download dictionary english_us_arpa
mfa model download acoustic english_us_arpa 

cd ./VoiceCraft
python start_gradio.py

If you have encountered version issues when running things, checkout environment.yml for exact matching.

Training

To train an VoiceCraft model, you need to prepare the following parts:

  1. utterances and their transcripts
  2. encode the utterances into codes using e.g. Encodec
  3. convert transcripts into phoneme sequence, and a phoneme set (we named it vocab.txt)
  4. manifest (i.e. metadata)

Step 1,2,3 are handled in ./data/phonemize_encodec_encode_hf.py, where

  1. Gigaspeech is downloaded through HuggingFace. Note that you need to sign an agreement in order to download the dataset (it needs your auth token)
  2. phoneme sequence and encodec codes are also extracted using the script.

An example run:

conda activate voicecraft
export CUDA_VISIBLE_DEVICES=0
cd ./data
python phonemize_encodec_encode_hf.py \
--dataset_size xs \
--download_to path/to/store_huggingface_downloads \
--save_dir path/to/store_extracted_codes_and_phonemes \
--encodec_model_path path/to/encodec_model \
--mega_batch_size 120 \
--batch_size 32 \
--max_len 30000

where encodec_model_path is avaliable here. This model is trained on Gigaspeech XL, it has 56M parameters, 4 codebooks, each codebook has 2048 codes. Details are described in our paper. If you encounter OOM during extraction, try decrease the batch_size and/or max_len. The extracted codes, phonemes, and vocab.txt will be stored at path/to/store_extracted_codes_and_phonemes/${dataset_size}/{encodec_16khz_4codebooks,phonemes,vocab.txt}.

As for manifest, please download train.txt and validation.txt from here, and put them under path/to/store_extracted_codes_and_phonemes/manifest/. Please also download vocab.txt from here if you want to use our pretrained VoiceCraft model (so that the phoneme-to-token matching is the same).

Now, you are good to start training!

conda activate voicecraft
cd ./z_scripts
bash e830M.sh

License

The codebase is under CC BY-NC-SA 4.0 (LICENSE-CODE), and the model weights are under Coqui Public Model License 1.0.0 (LICENSE-MODEL). Note that we use some of the code from other repository that are under different licenses: ./models/codebooks_patterns.py is under MIT license; ./models/modules, ./steps/optim.py, data/tokenizer.py are under Apache License, Version 2.0; the phonemizer we used is under GNU 3.0 License. For drop-in replacement of the phonemizer (i.e. text to IPA phoneme mapping), try g2p (MIT License) or OpenPhonemizer (BSD-3-Clause Clear), although these are not tested.

Acknowledgement

We thank Feiteng for his VALL-E reproduction, and we thank audiocraft team for open-sourcing encodec.

Citation

@article{peng2024voicecraft,
  author    = {Peng, Puyuan and Huang, Po-Yao and Li, Daniel and Mohamed, Abdelrahman and Harwath, David},
  title     = {VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild},
  journal   = {arXiv},
  year      = {2024},
}

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate or edit someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

voicecraft's People

Contributors

friendlyfriend4000 avatar jasonppy avatar ubergarm avatar codykociemba avatar derekjhunt avatar fakerybakery avatar

Stargazers

 avatar  avatar  avatar Xeeroxxx avatar  avatar  avatar Koolen Dasheppi avatar Matthew avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.