Code Monkey home page Code Monkey logo

adaspeech's Introduction

AdaSpeech - PyTorch Implementation

This is an unofficial PyTorch implementation of AdaSpeech. AdaSpeech: Adaptive text to speech for custom voice.

This project is based on ming024's implementation of FastSpeech 2.

Note:

  • Support multi languague training, the default phoneme support Vietnamese and English, custom for other language
  • Utterance level encoder and Phoneme level encoder to improve acoustic generalization

  • Conditional layer norm which is the soul of AdaSpeech paper

Requirements:

  • Install Pytorch Before installing pytorch please check your Cuda version by running following command : nvcc --version
pip install -r requirements.txt

Training

Preprocessing

  • First, align the corpus by using MFA tool to get TextGrid (note that you have to run each language separately then move all speaker's TextGrid in to single folder named "textgrid")
  • copy textgrid folder in to preprocessed path

run the preprocessing script

python preprocess.py config/pretrain/preprocess.yaml

Training

Train baseline model with

python train.py [-h] [-p PREPROCESS_CONFIG_PATH] [-m MODEL_CONFIG_PATH] [-t TRAIN_CONFIG_PATH] [--vocoder_checkpoint VOCODER_CHECKPOINT_PATH] [--vocoder_config VOCODER_CONFIG_PATH]

Finetune

Preprocessing

First, align the corpus by using MFA tool to get TextGrid (note that only finetune 1 speaker for best quality)

run the preprocessing script

python preprocess.py config/finetune/preprocess.yaml

Finetune

Finetune speaker voice with

python finetune.py [-h] [--pretrain_dir BASE_LINE_MODEL_PATH] [-p PREPROCESS_CONFIG_PATH] [-m MODEL_CONFIG_PATH] [-t TRAIN_CONFIG_PATH] [--vocoder_checkpoint VOCODER_CHECKPOINT_PATH] [--vocoder_config VOCODER_CONFIG_PATH]

TensorBoard

Use

tensorboard [--logdir LOG_PATH]
  • Tensorboard for pretrain model

  • Tensorboard for finetune with only 5 sentences

References

Citation

@misc{chen2021adaspeech,
      title={AdaSpeech: Adaptive Text to Speech for Custom Voice}, 
      author={Mingjian Chen and Xu Tan and Bohan Li and Yanqing Liu and Tao Qin and Sheng Zhao and Tie-Yan Liu},
      year={2021},
      eprint={2103.00993},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

adaspeech's People

Contributors

tuanh123789 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

adaspeech's Issues

Can you give samples for speaker embedding and inferenced samples?

Firstly, I really appreciate for this repo. It helped me a lot for learning about TTS.

But I think I met some problems on inference stage.

I trained the model with LibriTTS with adjusted configs from FastSpeech2 repo, just removing language options.
(If you wish, I will make a pull request about it. It would be helpful for others to train model.)

While the training loss was as you shown, I cannot get proper duration prediction while I'm doing inference.

I checked the training stage where synth_one_sample function operates by saving wavs, and I saw that predicted speech and reconstructed speech was fairly good quality (a bit error for mel prediction though).

So I guess there could be some issues on mel embedding for conditional normalization layer and speaker embedding.

Maybe there could be some conflicts on them?

In this sense, it will be helpful for me and other people to get some inference examples such as speaker embedding samples and inferenced samples.

I attach some samples, configs, commands here.
tested_data.zip

Report mistakes during training,please please please help

Training: 0%| | 6/900000 [00:02<122:44:49, 2.04it/s]Traceback (most recent call last):
File "train.py", line 234, in
main(args, configs)
File "train.py", line 108, in main
output = model(*(exe_batch[2:]))
File "/root/anaconda3/envs/Adaspeech/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/Adaspeech/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 159, in forward
return self.module(*inputs[0], **kwargs[0])
File "/root/anaconda3/envs/Adaspeech/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/AdaSpeech-main/model/adaspeech.py", line 75, in forward
output = self.encoder(texts, speaker_embedding, src_masks)
File "/root/anaconda3/envs/Adaspeech/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/AdaSpeech-main/transformer/Models.py", line 95, in forward
enc_output, speaker_embedding, mask=mask, slf_attn_mask=slf_attn_mask
File "/root/anaconda3/envs/Adaspeech/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/AdaSpeech-main/transformer/Layers.py", line 27, in forward
enc_output = self.pos_ffn(enc_output, speaker_embedding)
File "/root/anaconda3/envs/Adaspeech/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/AdaSpeech-main/transformer/SubLayers.py", line 106, in forward
output = self.w_2(F.relu(self.w_1(output)))
File "/root/anaconda3/envs/Adaspeech/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/Adaspeech/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 259, in forward
self.padding, self.dilation, self.groups)
RuntimeError: CUDA error: device-side assert triggered
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [1,0,0], thread: [0,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [1,0,0], thread: [1,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [1,0,0], thread: [2,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [1,0,0], thread: [3,0,0] Assertion srcIndex < srcSelectDimSize failed.

Has anybody got this to work on google colab?

I've been trying to test this repo on google colab but a lot of dependency errors start to come in. And after resolving those, this error comes -
image

Edit: Resolved by downgrading python version to 3.8 and also downgrading resampy version to 0.3.1

loss rise after 6k

Hello, I use the AIShell dataset to synthesize very poorly, making it difficult to read the entire sentence and only uttering a few syllables. (However, I previously used the Vivos Vietnamese training set to produce a fairly good result. Although I don't understand Vietnamese very well, it is at least fluent, so I think the code should be okay.) I observed your total_ Loss began to rise after 8K steps, and my model also had similar problems at 6K steps. Besides, my phone_ level_loss has been in a state of shock. Do you know the probable cause?
image
image

the performance of new voice(fintune) is bad

Thanks for your nice work.
The code works well with the pretrain stage. However, when i finetune towards an unseen voice with 10 sentences, the results is bad. The speech quality is bad, and the voice is significantly different. what went wrong?
image

could you pls provide the models.hifigan

hi man, you did a greate job. I am trying to run your project but there is a problem.

In train.py file (line 23)

from models.hifigan import Generator

but there is no folder named models, could you pls kindly update it

Thx

nan loss during training

hi you, I'm trying to train model AdaSpeech from your project. However when I train pretrained model I get nan loss during training.
do you have any way to fix it from your code?
image

Need Help with source Model training

Hi Folks,

I am at the 1st step of adaspeech training as per paper. Source Model Training. I used Libritts dataset, but reduced it to half to expedite the experiment. It has 1140 speakers for training. There was little mismatch in preprocessing parameters in adaspeech paper and default values provided in code. We went with the value of the code. We trained the model for 300k steps on colab. I am providing screenshot of my loss profile from tensor-board.
Screenshot 2023-07-27 at 4 08 57 AM
WhatsApp Image 2023-07-27 at 04 10 09

Please don't mind multiple color in graphs. While training on colab I had to restore training multiple times, leading to separate log files. But more fluctuating one is Train loss while the smoother line is validation loss. I also attaching output I took from inference.py with speaker ID 107 on an out of the sample test sentence at 160k, 170k and 210k steps. Since I cannot attach .wav/.mp3 here, or may be I don't know how to do that, I am attaching drive link where they are hosted. Reference audio for 107 will give you an idea, how does speaker sound like.
https://drive.google.com/drive/folders/19Og2t4h2quygmrJ87xEMPoTQ7yTz9Q_e?usp=sharing

My output is little metallic and grainy, has little reverberations and pitch needs to improve. I want to understand on what all dimensions it need to improve? Also, what can i do better in training to do that?

Data Storage Requirements and Format

Thank you very much for your code, but I encountered an error while running it.

File "preprocess.py", line 15, in <module>
preprocessor.build_from_path()
File "./adaspeech/preprocessor/preprocessor.py", line 75, in build_from_path
for wav_name in os.listdir(os.path.join(self.in_dir, language, speaker)):
NotADirectoryError: [Errno 20] Not a directory: './raw_data/19/19-198-0024.wav'

Does this mean I need to place a directory containing audio information under '/raw_data'? Even when I use your original code, I encounter this issue. Please, can you guide me on how to structure the data for running this program?

nan loss during training

hi you, I'm trying to train model AdaSpeech from your project. However when I train pretrained model I get nan loss during training.
do you have any way to fix it from your code?
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.