tuanh123789 / adaspeech Goto Github PK

View Code? Open in Web Editor NEW

95.0 3.0 27.0 51.65 MB

An implementation of Microsoft's "AdaSpeech: Adaptive Text to Speech for Custom Voice"

Python 100.00%

text-to-speech voiceclone adaspeech fastspeech2 conditional-layer-norm conditional-layer-normalization

adaspeech's Introduction

AdaSpeech - PyTorch Implementation

This is an unofficial PyTorch implementation of AdaSpeech. AdaSpeech: Adaptive text to speech for custom voice.

This project is based on ming024's implementation of FastSpeech 2.

Note:

Support multi languague training, the default phoneme support Vietnamese and English, custom for other language
Utterance level encoder and Phoneme level encoder to improve acoustic generalization

Conditional layer norm which is the soul of AdaSpeech paper

Requirements:

Install Pytorch Before installing pytorch please check your Cuda version by running following command : nvcc --version

pip install -r requirements.txt

Training

Preprocessing

First, align the corpus by using MFA tool to get TextGrid (note that you have to run each language separately then move all speaker's TextGrid in to single folder named "textgrid")
copy textgrid folder in to preprocessed path

run the preprocessing script

python preprocess.py config/pretrain/preprocess.yaml

Training

Train baseline model with

python train.py [-h] [-p PREPROCESS_CONFIG_PATH] [-m MODEL_CONFIG_PATH] [-t TRAIN_CONFIG_PATH] [--vocoder_checkpoint VOCODER_CHECKPOINT_PATH] [--vocoder_config VOCODER_CONFIG_PATH]

Finetune

Preprocessing

First, align the corpus by using MFA tool to get TextGrid (note that only finetune 1 speaker for best quality)

run the preprocessing script

python preprocess.py config/finetune/preprocess.yaml

Finetune

Finetune speaker voice with

python finetune.py [-h] [--pretrain_dir BASE_LINE_MODEL_PATH] [-p PREPROCESS_CONFIG_PATH] [-m MODEL_CONFIG_PATH] [-t TRAIN_CONFIG_PATH] [--vocoder_checkpoint VOCODER_CHECKPOINT_PATH] [--vocoder_config VOCODER_CONFIG_PATH]

TensorBoard

Use

tensorboard [--logdir LOG_PATH]

Tensorboard for pretrain model
Tensorboard for finetune with only 5 sentences

References

Citation

@misc{chen2021adaspeech,
      title={AdaSpeech: Adaptive Text to Speech for Custom Voice}, 
      author={Mingjian Chen and Xu Tan and Bohan Li and Yanqing Liu and Tao Qin and Sheng Zhao and Tie-Yan Liu},
      year={2021},
      eprint={2103.00993},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

adaspeech's People

Contributors

Stargazers

Watchers

adaspeech's Issues

hello, i am training this repo for ljspeech. i have this problem. can you help me???

GREAT :heart_decoration:

Well, for me, this repo deserves thousands of stars 👍

why convert average mel to integer

AdaSpeech/preprocessor/preprocessor.py

Line 336 in 64f15c4

x_avg = [

Can you give samples for speaker embedding and inferenced samples?

Firstly, I really appreciate for this repo. It helped me a lot for learning about TTS.

But I think I met some problems on inference stage.

I trained the model with LibriTTS with adjusted configs from FastSpeech2 repo, just removing language options.
(If you wish, I will make a pull request about it. It would be helpful for others to train model.)

While the training loss was as you shown, I cannot get proper duration prediction while I'm doing inference.

I checked the training stage where synth_one_sample function operates by saving wavs, and I saw that predicted speech and reconstructed speech was fairly good quality (a bit error for mel prediction though).

So I guess there could be some issues on mel embedding for conditional normalization layer and speaker embedding.

Maybe there could be some conflicts on them?

In this sense, it will be helpful for me and other people to get some inference examples such as speaker embedding samples and inferenced samples.

I attach some samples, configs, commands here.
tested_data.zip

Report mistakes during training,please please please help

Training: 0%| | 6/900000 [00:02<122:44:49, 2.04it/s]Traceback (most recent call last):
File "train.py", line 234, in
main(args, configs)
File "train.py", line 108, in main
output = model(*(exe_batch[2:]))
File "/root/anaconda3/envs/Adaspeech/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/Adaspeech/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 159, in forward
return self.module(*inputs[0], **kwargs[0])
File "/root/anaconda3/envs/Adaspeech/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/AdaSpeech-main/model/adaspeech.py", line 75, in forward
output = self.encoder(texts, speaker_embedding, src_masks)
File "/root/anaconda3/envs/Adaspeech/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/AdaSpeech-main/transformer/Models.py", line 95, in forward
enc_output, speaker_embedding, mask=mask, slf_attn_mask=slf_attn_mask
File "/root/anaconda3/envs/Adaspeech/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/AdaSpeech-main/transformer/Layers.py", line 27, in forward
enc_output = self.pos_ffn(enc_output, speaker_embedding)
File "/root/anaconda3/envs/Adaspeech/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/AdaSpeech-main/transformer/SubLayers.py", line 106, in forward
output = self.w_2(F.relu(self.w_1(output)))
File "/root/anaconda3/envs/Adaspeech/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/Adaspeech/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 259, in forward
self.padding, self.dilation, self.groups)
RuntimeError: CUDA error: device-side assert triggered
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [1,0,0], thread: [0,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [1,0,0], thread: [1,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [1,0,0], thread: [2,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [1,0,0], thread: [3,0,0] Assertion srcIndex < srcSelectDimSize failed.

only fine-tune the two matrices ，and fixing other model parameters，

only fine-tune the two matrices ，and fixing other model parameters， May I ask how this part is reflected in the code,I couldn't find this part in finetune. py

Has anybody got this to work on google colab?

I've been trying to test this repo on google colab but a lot of dependency errors start to come in. And after resolving those, this error comes -

Edit: Resolved by downgrading python version to 3.8 and also downgrading resampy version to 0.3.1

loss rise after 6k

Hello, I use the AIShell dataset to synthesize very poorly, making it difficult to read the entire sentence and only uttering a few syllables. (However, I previously used the Vivos Vietnamese training set to produce a fairly good result. Although I don't understand Vietnamese very well, it is at least fluent, so I think the code should be okay.) I observed your total_ Loss began to rise after 8K steps, and my model also had similar problems at 6K steps. Besides, my phone_ level_loss has been in a state of shock. Do you know the probable cause?

the performance of new voice(fintune) is bad

Thanks for your nice work.
The code works well with the pretrain stage. However, when i finetune towards an unseen voice with 10 sentences, the results is bad. The speech quality is bad, and the voice is significantly different. what went wrong?

How does the speakerembedding part work when finetune

I would like to ask that the speaker used in finetune is unknown, so what should be done with the speakerembedding section?

could you pls provide the models.hifigan

hi man, you did a greate job. I am trying to run your project but there is a problem.

In train.py file (line 23)

from models.hifigan import Generator

but there is no folder named models, could you pls kindly update it

Thx

nan loss during training

hi you, I'm trying to train model AdaSpeech from your project. However when I train pretrained model I get nan loss during training.
do you have any way to fix it from your code?

Need Help with source Model training

Hi Folks,

I am at the 1st step of adaspeech training as per paper. Source Model Training. I used Libritts dataset, but reduced it to half to expedite the experiment. It has 1140 speakers for training. There was little mismatch in preprocessing parameters in adaspeech paper and default values provided in code. We went with the value of the code. We trained the model for 300k steps on colab. I am providing screenshot of my loss profile from tensor-board.

Please don't mind multiple color in graphs. While training on colab I had to restore training multiple times, leading to separate log files. But more fluctuating one is Train loss while the smoother line is validation loss. I also attaching output I took from inference.py with speaker ID 107 on an out of the sample test sentence at 160k, 170k and 210k steps. Since I cannot attach .wav/.mp3 here, or may be I don't know how to do that, I am attaching drive link where they are hosted. Reference audio for 107 will give you an idea, how does speaker sound like.
https://drive.google.com/drive/folders/19Og2t4h2quygmrJ87xEMPoTQ7yTz9Q_e?usp=sharing

My output is little metallic and grainy, has little reverberations and pitch needs to improve. I want to understand on what all dimensions it need to improve? Also, what can i do better in training to do that?

the use of "reference_audio" when inference

hi, I want to know what's the use of "reference_audio" when inference?

Data Storage Requirements and Format

Thank you very much for your code, but I encountered an error while running it.

File "preprocess.py", line 15, in <module>
preprocessor.build_from_path()
File "./adaspeech/preprocessor/preprocessor.py", line 75, in build_from_path
for wav_name in os.listdir(os.path.join(self.in_dir, language, speaker)):
NotADirectoryError: [Errno 20] Not a directory: './raw_data/19/19-198-0024.wav'

Does this mean I need to place a directory containing audio information under '/raw_data'? Even when I use your original code, I encounter this issue. Please, can you guide me on how to structure the data for running this program?

nan loss during training

hi you, I'm trying to train model AdaSpeech from your project. However when I train pretrained model I get nan loss during training.
do you have any way to fix it from your code?

is better than fs2?

hello, the result of adaspeech is better than fastspeech2?

tuanh123789 / adaspeech Goto Github PK

adaspeech's Introduction

AdaSpeech - PyTorch Implementation

Note:

Requirements:

Training

Preprocessing

Training

Finetune

Preprocessing

Finetune

TensorBoard

References

Citation

adaspeech's People

Contributors

Stargazers

Watchers

Forkers

adaspeech's Issues

Recommend Projects

Recommend Topics

Recommend Org