Code Monkey home page Code Monkey logo

kan-bayashi / parallelwavegan Goto Github PK

View Code? Open in Web Editor NEW
1.5K 45.0 336.0 35.6 MB

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

Home Page: https://kan-bayashi.github.io/ParallelWaveGAN/

License: MIT License

Python 7.98% Makefile 0.02% Shell 7.52% Perl 1.58% Jupyter Notebook 82.90%
speech-synthesis neural-vocoder text-to-speech pytorch wavenet parallel-wavenet realtime tts melgan vocoder hifigan style-melgan

parallelwavegan's Introduction

Parallel WaveGAN implementation with Pytorch

Open In Colab

This repository provides UNOFFICIAL pytorch implementations of the following models:

You can combine these state-of-the-art non-autoregressive models to build your own great vocoder!

Please check our samples in our demo HP.

Source of the figure: https://arxiv.org/pdf/1910.11480.pdf

The goal of this repository is to provide real-time neural vocoder, which is compatible with ESPnet-TTS.
Also, this repository can be combined with NVIDIA/tacotron2-based implementation (See this comment).

You can try the real-time end-to-end text-to-speech and singing voice synthesis demonstration in Google Colab!

  • Real-time demonstration with ESPnet2 Open In Colab
  • Real-time demonstration with ESPnet1 Open In Colab
  • Real-time demonstration with Muskits Open In Colab

What's new

Requirements

This repository is tested on Ubuntu 20.04 with a GPU Titan V.

  • Python 3.8+
  • Cuda 11.0+
  • CuDNN 8+
  • NCCL 2+ (for distributed multi-gpu training)
  • libsndfile (you can install via sudo apt install libsndfile-dev in ubuntu)
  • jq (you can install via sudo apt install jq in ubuntu)
  • sox (you can install via sudo apt install sox in ubuntu)

Different cuda version should be working but not explicitly tested.
All of the codes are tested on Pytorch 1.8.1, 1.9, 1.10.2, 1.11.0, 1.12.1, 1.13.1, 2.0.1 and 2.1.0.

Setup

You can select the installation method from two alternatives.

A. Use pip

$ git clone https://github.com/kan-bayashi/ParallelWaveGAN.git
$ cd ParallelWaveGAN
$ pip install -e .
# If you want to use distributed training, please install
# apex manually by following https://github.com/NVIDIA/apex
$ ...

Note that your cuda version must be exactly matched with the version used for the pytorch binary to install apex.
To install pytorch compiled with different cuda version, see tools/Makefile.

B. Make virtualenv

$ git clone https://github.com/kan-bayashi/ParallelWaveGAN.git
$ cd ParallelWaveGAN/tools
$ make
# If you want to use distributed training, please run following
# command to install apex.
$ make apex

Note that we specify cuda version used to compile pytorch wheel.
If you want to use different cuda version, please check tools/Makefile to change the pytorch wheel to be installed.

Recipe

This repository provides Kaldi-style recipes, as the same as ESPnet.
Currently, the following recipes are supported.

  • LJSpeech: English female speaker
  • JSUT: Japanese female speaker
  • JSSS: Japanese female speaker
  • CSMSC: Mandarin female speaker
  • CMU Arctic: English speakers
  • JNAS: Japanese multi-speaker
  • VCTK: English multi-speaker
  • LibriTTS: English multi-speaker
  • LibriTTS-R: English multi-speaker enhanced by speech restoration.
  • YesNo: English speaker (For debugging)
  • KSS: Single Korean female speaker
  • Oniku_kurumi_utagoe_db/: Single Japanese female singer (singing voice)
  • Kiritan: Single Japanese male singer (singing voice)
  • Ofuton_p_utagoe_db: Single Japanese female singer (singing voice)
  • Opencpop: Single Mandarin female singer (singing voice)
  • CSD: Single Korean/English female singer (singing voice)
  • KiSing: Single Mandarin female singer (singing voice)

To run the recipe, please follow the below instruction.

# Let us move on the recipe directory
$ cd egs/ljspeech/voc1

# Run the recipe from scratch
$ ./run.sh

# You can change config via command line
$ ./run.sh --conf <your_customized_yaml_config>

# You can select the stage to start and stop
$ ./run.sh --stage 2 --stop_stage 2

# If you want to specify the gpu
$ CUDA_VISIBLE_DEVICES=1 ./run.sh --stage 2

# If you want to resume training from 10000 steps checkpoint
$ ./run.sh --stage 2 --resume <path>/<to>/checkpoint-10000steps.pkl

See more info about the recipes in this README.

Speed

The decoding speed is RTF = 0.016 with TITAN V, much faster than the real-time.

[decode]: 100%|██████████| 250/250 [00:30<00:00,  8.31it/s, RTF=0.0156]
2019-11-03 09:07:40,480 (decode:127) INFO: finished generation of 250 utterances (RTF = 0.016).

Even on the CPU (Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz 16 threads), it can generate less than the real-time.

[decode]: 100%|██████████| 250/250 [22:16<00:00,  5.35s/it, RTF=0.841]
2019-11-06 09:04:56,697 (decode:129) INFO: finished generation of 250 utterances (RTF = 0.734).

If you use MelGAN's generator, the decoding speed will be further faster.

# On CPU (Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz 16 threads)
[decode]: 100%|██████████| 250/250 [04:00<00:00,  1.04it/s, RTF=0.0882]
2020-02-08 10:45:14,111 (decode:142) INFO: Finished generation of 250 utterances (RTF = 0.137).

# On GPU (TITAN V)
[decode]: 100%|██████████| 250/250 [00:06<00:00, 36.38it/s, RTF=0.00189]
2020-02-08 05:44:42,231 (decode:142) INFO: Finished generation of 250 utterances (RTF = 0.002).

If you use Multi-band MelGAN's generator, the decoding speed will be much further faster.

# On CPU (Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz 16 threads)
[decode]: 100%|██████████| 250/250 [01:47<00:00,  2.95it/s, RTF=0.048]
2020-05-22 15:37:19,771 (decode:151) INFO: Finished generation of 250 utterances (RTF = 0.059).

# On GPU (TITAN V)
[decode]: 100%|██████████| 250/250 [00:05<00:00, 43.67it/s, RTF=0.000928]
2020-05-22 15:35:13,302 (decode:151) INFO: Finished generation of 250 utterances (RTF = 0.001).

If you want to accelerate the inference more, it is worthwhile to try the conversion from pytorch to tensorflow.
The example of the conversion is available in the notebook (Provided by @dathudeptrai).

Results

Here the results are summarized in the table.
You can listen to the samples and download pretrained models from the link to our google drive.

Model Conf Lang Fs [Hz] Mel range [Hz] FFT / Hop / Win [pt] # iters
ljspeech_parallel_wavegan.v1 link EN 22.05k 80-7600 1024 / 256 / None 400k
ljspeech_parallel_wavegan.v1.long link EN 22.05k 80-7600 1024 / 256 / None 1M
ljspeech_parallel_wavegan.v1.no_limit link EN 22.05k None 1024 / 256 / None 400k
ljspeech_parallel_wavegan.v3 link EN 22.05k 80-7600 1024 / 256 / None 3M
ljspeech_melgan.v1 link EN 22.05k 80-7600 1024 / 256 / None 400k
ljspeech_melgan.v1.long link EN 22.05k 80-7600 1024 / 256 / None 1M
ljspeech_melgan_large.v1 link EN 22.05k 80-7600 1024 / 256 / None 400k
ljspeech_melgan_large.v1.long link EN 22.05k 80-7600 1024 / 256 / None 1M
ljspeech_melgan.v3 link EN 22.05k 80-7600 1024 / 256 / None 2M
ljspeech_melgan.v3.long link EN 22.05k 80-7600 1024 / 256 / None 4M
ljspeech_full_band_melgan.v1 link EN 22.05k 80-7600 1024 / 256 / None 1M
ljspeech_full_band_melgan.v2 link EN 22.05k 80-7600 1024 / 256 / None 1M
ljspeech_multi_band_melgan.v1 link EN 22.05k 80-7600 1024 / 256 / None 1M
ljspeech_multi_band_melgan.v2 link EN 22.05k 80-7600 1024 / 256 / None 1M
ljspeech_hifigan.v1 link EN 22.05k 80-7600 1024 / 256 / None 2.5M
ljspeech_style_melgan.v1 link EN 22.05k 80-7600 1024 / 256 / None 1.5M
jsut_parallel_wavegan.v1 link JP 24k 80-7600 2048 / 300 / 1200 400k
jsut_multi_band_melgan.v2 link JP 24k 80-7600 2048 / 300 / 1200 1M
just_hifigan.v1 link JP 24k 80-7600 2048 / 300 / 1200 2.5M
just_style_melgan.v1 link JP 24k 80-7600 2048 / 300 / 1200 1.5M
csmsc_parallel_wavegan.v1 link ZH 24k 80-7600 2048 / 300 / 1200 400k
csmsc_multi_band_melgan.v2 link ZH 24k 80-7600 2048 / 300 / 1200 1M
csmsc_hifigan.v1 link ZH 24k 80-7600 2048 / 300 / 1200 2.5M
csmsc_style_melgan.v1 link ZH 24k 80-7600 2048 / 300 / 1200 1.5M
arctic_slt_parallel_wavegan.v1 link EN 16k 80-7600 1024 / 256 / None 400k
jnas_parallel_wavegan.v1 link JP 16k 80-7600 1024 / 256 / None 400k
vctk_parallel_wavegan.v1 link EN 24k 80-7600 2048 / 300 / 1200 400k
vctk_parallel_wavegan.v1.long link EN 24k 80-7600 2048 / 300 / 1200 1M
vctk_multi_band_melgan.v2 link EN 24k 80-7600 2048 / 300 / 1200 1M
vctk_hifigan.v1 link EN 24k 80-7600 2048 / 300 / 1200 2.5M
vctk_style_melgan.v1 link EN 24k 80-7600 2048 / 300 / 1200 1.5M
libritts_parallel_wavegan.v1 link EN 24k 80-7600 2048 / 300 / 1200 400k
libritts_parallel_wavegan.v1.long link EN 24k 80-7600 2048 / 300 / 1200 1M
libritts_multi_band_melgan.v2 link EN 24k 80-7600 2048 / 300 / 1200 1M
libritts_hifigan.v1 link EN 24k 80-7600 2048 / 300 / 1200 2.5M
libritts_style_melgan.v1 link EN 24k 80-7600 2048 / 300 / 1200 1.5M
kss_parallel_wavegan.v1 link KO 24k 80-7600 2048 / 300 / 1200 400k
hui_acg_hokuspokus_parallel_wavegan.v1 link DE 24k 80-7600 2048 / 300 / 1200 400k
ruslan_parallel_wavegan.v1 link RU 24k 80-7600 2048 / 300 / 1200 400k
oniku_hifigan.v1 link JP 24k 80-7600 2048 / 300 / 1200 250k
kiritan_hifigan.v1 link JP 24k 80-7600 2048 / 300 / 1200 300k
ofuton_hifigan.v1 link JP 24k 80-7600 2048 / 300 / 1200 300k
opencpop_hifigan.v1 link ZH 24k 80-7600 2048 / 300 / 1200 250k
csd_english_hifigan.v1 link EN 24k 80-7600 2048 / 300 / 1200 300k
csd_korean_hifigan.v1 link EN 24k 80-7600 2048 / 300 / 1200 250k
kising_hifigan.v1 link ZH 24k 80-7600 2048 / 300 / 1200 300k
m4singer_hifigan.v1 link ZH 24k 80-7600 2048 / 300 / 1200 1M

Please access at our google drive to check more results.

Please check the license of database (e.g., whether it is proper for commercial usage) before using the pre-trained model.
The authors will not be responsible for any loss due to the use of the model and legal disputes regarding the use of the dataset.

How-to-use pretrained models

Analysis-synthesis

Here the minimal code is shown to perform analysis-synthesis using the pretrained model.

# Please make sure you installed `parallel_wavegan`
# If not, please install via pip
$ pip install parallel_wavegan

# You can download the pretrained model from terminal
$ python << EOF
from parallel_wavegan.utils import download_pretrained_model
download_pretrained_model("<pretrained_model_tag>", "pretrained_model")
EOF

# You can get all of available pretrained models as follows:
$ python << EOF
from parallel_wavegan.utils import PRETRAINED_MODEL_LIST
print(PRETRAINED_MODEL_LIST.keys())
EOF

# Now you can find downloaded pretrained model in `pretrained_model/<pretrain_model_tag>/`
$ ls pretrain_model/<pretrain_model_tag>
  checkpoint-400000steps.pkl    config.yml    stats.h5

# These files can also be downloaded manually from the above results

# Please put an audio file in `sample` directory to perform analysis-synthesis
$ ls sample/
  sample.wav

# Then perform feature extraction -> feature normalization -> synthesis
$ parallel-wavegan-preprocess \
    --config pretrain_model/<pretrain_model_tag>/config.yml \
    --rootdir sample \
    --dumpdir dump/sample/raw
100%|████████████████████████████████████████| 1/1 [00:00<00:00, 914.19it/s]
$ parallel-wavegan-normalize \
    --config pretrain_model/<pretrain_model_tag>/config.yml \
    --rootdir dump/sample/raw \
    --dumpdir dump/sample/norm \
    --stats pretrain_model/<pretrain_model_tag>/stats.h5
2019-11-13 13:44:29,574 (normalize:87) INFO: the number of files = 1.
100%|████████████████████████████████████████| 1/1 [00:00<00:00, 513.13it/s]
$ parallel-wavegan-decode \
    --checkpoint pretrain_model/<pretrain_model_tag>/checkpoint-400000steps.pkl \
    --dumpdir dump/sample/norm \
    --outdir sample
2019-11-13 13:44:31,229 (decode:91) INFO: the number of features to be decoded = 1.
[decode]: 100%|███████████████████| 1/1 [00:00<00:00, 18.33it/s, RTF=0.0146]
2019-11-13 13:44:37,132 (decode:129) INFO: finished generation of 1 utterances (RTF = 0.015).

# You can skip normalization step (on-the-fly normalization, feature extraction -> synthesis)
$ parallel-wavegan-preprocess \
    --config pretrain_model/<pretrain_model_tag>/config.yml \
    --rootdir sample \
    --dumpdir dump/sample/raw
100%|████████████████████████████████████████| 1/1 [00:00<00:00, 914.19it/s]
$ parallel-wavegan-decode \
    --checkpoint pretrain_model/<pretrain_model_tag>/checkpoint-400000steps.pkl \
    --dumpdir dump/sample/raw \
    --normalize-before \
    --outdir sample
2019-11-13 13:44:31,229 (decode:91) INFO: the number of features to be decoded = 1.
[decode]: 100%|███████████████████| 1/1 [00:00<00:00, 18.33it/s, RTF=0.0146]
2019-11-13 13:44:37,132 (decode:129) INFO: finished generation of 1 utterances (RTF = 0.015).

# you can find the generated speech in `sample` directory
$ ls sample
  sample.wav    sample_gen.wav

Decoding with ESPnet-TTS model's features

Here, I show the procedure to generate waveforms with features generated by ESPnet-TTS models.

# Make sure you already finished running the recipe of ESPnet-TTS.
# You must use the same feature settings for both Text2Mel and Mel2Wav models.
# Let us move on "ESPnet" recipe directory
$ cd /path/to/espnet/egs/<recipe_name>/tts1
$ pwd
/path/to/espnet/egs/<recipe_name>/tts1

# If you use ESPnet2, move on `egs2/`
$ cd /path/to/espnet/egs2/<recipe_name>/tts1
$ pwd
/path/to/espnet/egs2/<recipe_name>/tts1

# Please install this repository in ESPnet conda (or virtualenv) environment
$ . ./path.sh && pip install -U parallel_wavegan

# You can download the pretrained model from terminal
$ python << EOF
from parallel_wavegan.utils import download_pretrained_model
download_pretrained_model("<pretrained_model_tag>", "pretrained_model")
EOF

# You can get all of available pretrained models as follows:
$ python << EOF
from parallel_wavegan.utils import PRETRAINED_MODEL_LIST
print(PRETRAINED_MODEL_LIST.keys())
EOF

# You can find downloaded pretrained model in `pretrained_model/<pretrain_model_tag>/`
$ ls pretrain_model/<pretrain_model_tag>
  checkpoint-400000steps.pkl    config.yml    stats.h5

# These files can also be downloaded manually from the above results

Case 1: If you use the same dataset for both Text2Mel and Mel2Wav

# In this case, you can directly use generated features for decoding.
# Please specify `feats.scp` path for `--feats-scp`, which is located in
# exp/<your_model_dir>/outputs_*_decode/<set_name>/feats.scp.
# Note that do not use outputs_*decode_denorm/<set_name>/feats.scp since
# it is de-normalized features (the input for PWG is normalized features).
$ parallel-wavegan-decode \
    --checkpoint pretrain_model/<pretrain_model_tag>/checkpoint-400000steps.pkl \
    --feats-scp exp/<your_model_dir>/outputs_*_decode/<set_name>/feats.scp \
    --outdir <path_to_outdir>

# In the case of ESPnet2, the generated feature can be found in
# exp/<your_model_dir>/decode_*/<set_name>/norm/feats.scp.
$ parallel-wavegan-decode \
    --checkpoint pretrain_model/<pretrain_model_tag>/checkpoint-400000steps.pkl \
    --feats-scp exp/<your_model_dir>/decode_*/<set_name>/norm/feats.scp \
    --outdir <path_to_outdir>

# You can find the generated waveforms in <path_to_outdir>/.
$ ls <path_to_outdir>
  utt_id_1_gen.wav    utt_id_2_gen.wav  ...    utt_id_N_gen.wav

Case 2: If you use different datasets for Text2Mel and Mel2Wav models

# In this case, you must provide `--normalize-before` option additionally.
# And use `feats.scp` of de-normalized generated features.

# ESPnet1 case
$ parallel-wavegan-decode \
    --checkpoint pretrain_model/<pretrain_model_tag>/checkpoint-400000steps.pkl \
    --feats-scp exp/<your_model_dir>/outputs_*_decode_denorm/<set_name>/feats.scp \
    --outdir <path_to_outdir> \
    --normalize-before

# ESPnet2 case
$ parallel-wavegan-decode \
    --checkpoint pretrain_model/<pretrain_model_tag>/checkpoint-400000steps.pkl \
    --feats-scp exp/<your_model_dir>/decode_*/<set_name>/denorm/feats.scp \
    --outdir <path_to_outdir> \
    --normalize-before

# You can find the generated waveforms in <path_to_outdir>/.
$ ls <path_to_outdir>
  utt_id_1_gen.wav    utt_id_2_gen.wav  ...    utt_id_N_gen.wav

If you want to combine these models in python, you can try the real-time demonstration in Google Colab!

  • Real-time demonstration with ESPnet2 Open In Colab
  • Real-time demonstration with ESPnet1 Open In Colab

Decoding with dumped npy files

Sometimes we want to decode with dumped npy files, which are mel-spectrogram generated by TTS models. Please make sure you used the same feature extraction settings of the pretrained vocoder (fs, fft_size, hop_size, win_length, fmin, and fmax).
Only the difference of log_base can be changed with some post-processings (we use log 10 instead of natural log as a default). See detail in the comment.

# Generate dummy npy file of mel-spectrogram
$ ipython
[ins] In [1]: import numpy as np
[ins] In [2]: x = np.random.randn(512, 80)  # (#frames, #mels)
[ins] In [3]: np.save("dummy_1.npy", x)
[ins] In [4]: y = np.random.randn(256, 80)  # (#frames, #mels)
[ins] In [5]: np.save("dummy_2.npy", y)
[ins] In [6]: exit

# Make scp file (key-path format)
$ find -name "*.npy" | awk '{print "dummy_" NR " " $1}' > feats.scp

# Check (<utt_id> <path>)
$ cat feats.scp
dummy_1 ./dummy_1.npy
dummy_2 ./dummy_2.npy

# Decode without feature normalization
# This case assumes that the input mel-spectrogram is normalized with the same statistics of the pretrained model.
$ parallel-wavegan-decode \
    --checkpoint /path/to/checkpoint-400000steps.pkl \
    --feats-scp ./feats.scp \
    --outdir wav
2021-08-10 09:13:07,624 (decode:140) INFO: The number of features to be decoded = 2.
[decode]: 100%|████████████████████████████████████████| 2/2 [00:00<00:00, 13.84it/s, RTF=0.00264]
2021-08-10 09:13:29,660 (decode:174) INFO: Finished generation of 2 utterances (RTF = 0.005).

# Decode with feature normalization
# This case assumes that the input mel-spectrogram is not normalized.
$ parallel-wavegan-decode \
    --checkpoint /path/to/checkpoint-400000steps.pkl \
    --feats-scp ./feats.scp \
    --normalize-before \
    --outdir wav
2021-08-10 09:13:07,624 (decode:140) INFO: The number of features to be decoded = 2.
[decode]: 100%|████████████████████████████████████████| 2/2 [00:00<00:00, 13.84it/s, RTF=0.00264]
2021-08-10 09:13:29,660 (decode:174) INFO: Finished generation of 2 utterances (RTF = 0.005).

Notes

  • The terms of use of the pretrained model follow that of each corpus used for the training. Please carefully check by yourself.
  • Some codes are derived from ESPnet or Kaldi, which are based on Apache-2.0 licenese.

References

Acknowledgement

The author would like to thank Ryuichi Yamamoto (@r9y9) for his great repository, paper, and valuable discussions.

Author

Tomoki Hayashi (@kan-bayashi)
E-mail: hayashi.tomoki<at>g.sp.m.is.nagoya-u.ac.jp

parallelwavegan's People

Contributors

a-quarter-mile avatar c-bata avatar chomeyama avatar dathudeptrai avatar drwelles avatar frankxu2004 avatar ftshijt avatar g-thor avatar jayaneetha avatar kan-bayashi avatar peterguoruc avatar r9y9 avatar rayhane-mamah avatar roholazandie avatar shigekikarita avatar windtoker avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

parallelwavegan's Issues

Is Text2Mel based on deep convolutional neural networks (CNN) compatible with ParallelWaveGAN?

Dear @kan-bayashi,

Thank you for your great project!

Is there a chance to use Text2Mel based on deep convolutional neural networks (CNN) with your ParallelWaveGAN?
It is a standard pythorch .pth file which was trained on the platform like https://github.com/tugstugi/pytorch-dc-tts.
My destination language is so specific to reach good results with Tacotron2. Therefore I am mainly using pytorch-dc-tts for my TTS development.

Thank you in advance

How to calculate batch_max_steps

Thanks for your effort,
I need to know how i can calculate batch_max_steps for my dataset if the max audio length 6 secs and sampling rate 22050 ?

Multi-band MelGAN

Hi,

just found https://arxiv.org/pdf/2005.05106.pdf

It seems to provide significantly better quality than regular MelGAN, and is also stunningly fast (0.03 RTF on CPU). The authors will be publishing the code shortly.

Any chances we will see an implementation in this great repo? =)

with nvidia 2070 8gb ,RuntimeError: CUDA out of memory.

i use LJSpeech with defult config .
i see this in config:

This configuration requires 12 GB GPU memory and takes ~3 days on TITAN V.

and i want know how i can change config file for my 8gb gpu.

[train]: 0%| | 0/400000 [00:00<?, ?it/s]Traceback (most recent call last):
File "/usr/local/bin/parallel-wavegan-train", line 11, in
load_entry_point('parallel-wavegan', 'console_scripts', 'parallel-wavegan-train')()
File "/home/ai/ParallelWaveGAN/parallel_wavegan/bin/train.py", line 671, in main
trainer.run()
File "/home/ai/ParallelWaveGAN/parallel_wavegan/bin/train.py", line 87, in run
self._train_epoch()
File "/home/ai/ParallelWaveGAN/parallel_wavegan/bin/train.py", line 212, in _train_epoch
self._train_step(batch)
File "/home/ai/ParallelWaveGAN/parallel_wavegan/bin/train.py", line 159, in train_step
y
= self.model"generator"
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/ai/ParallelWaveGAN/parallel_wavegan/models/parallel_wavegan.py", line 151, in forward
x, h = f(x, c)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/ai/ParallelWaveGAN/parallel_wavegan/layers/residual_block.py", line 119, in forward
xa, xb = xa + ca, xb + cb

RuntimeError: CUDA out of memory. Tried to allocate 38.00 MiB (GPU 0; 7.79 GiB total capacity; 6.65 GiB already allocated; 37.31 MiB free; 6.67 GiB reserved in total by PyTorch)

Resume training info

Hi,
great repo, very intuitive to use.

Just wanted to suggest to maybe include some information regarding how to resume training, which as far as I can search, it is not documented yet. Given how simple and well managed it is, it's a pity one has to look into the code:)

For reference:
just paste the path to the desired checkpoint in the variable resume in run.sh

Also, I find that it would be great if both resume and the name of the dataset would be CL variables for run.sh, it would make it even easier to train on your own data.

How do I convert the tensorflow to tensorflowlite

Hello,

Thank you for providing the conversion notebook for tensorflow. I am new to ML and was wondering how would i go about converting the tensorflow .pb file to tensorflow lite file .tflite?

Regards,
Manny

I tried the following:

import tensorflow
converter = tensorflow.lite.TFLiteConverter.from_saved_model("./checkpoint/tensorflow_generator/")
test = converter.convert()

Getting this error:

None is only supported in the 1st dimension. Tensor 'serving_default_input_1' has invalid shape '[None, None, 80]'.

> convert_melgan_from_pytorch_to_tensorflow (tf-lite conversion error)

what error ?

When I run the code, "audio = TFMelGANGenerator(**config["generator_params"])(inputs)" line raises an InaccessibleTensorError.

InaccessibleTensorError: The tensor 'Tensor("conv2d_346/dilation_rate:0", shape=(2,), dtype=int32)' cannot be accessed here: it is defined in another function or code block. Use return values, explicit Python locals or TensorFlow collections to access it. Defined in: FuncGraph(name=call, id=140430667155440); accessed from: FuncGraph(name=call, id=140430364789896).

Do you not meet any error on your side?

Originally posted by @John-K92 in #61 (comment)

Any news on this colab??

How is the runtime on CPU?

Hi! Thx for the repo. I was curious about the performance on CPU. AFAIK, it is 8x real-time on GPU but could you also share some values about CPU performance?

Use pretrained models with Mozilla TTS

Hi all,

I gave the LibriTTS long vocoder a try and I was impressed at how good it can resample. I tried to use it with Mozilla's TTS, but didn't have any luck. Has anyone managed to plug any of the pretrained models in it?

Stage 1 broken w/ custom dataset

https://github.com/Iamgoofball/ParallelWaveGAN/tree/mlp_dataset
My dataset is located on this branch at egs/mlp, PR number #147. Branch is a derivative of feature/mb-melgan.
train_nodev processes properly as shown by this image:
chrome_XkUBmCspQp
However, dev and eval don't.
The preprocessing log reveals the following:
https://pastebin.com/GSQmXqxT
Is this an issue with my data or an issue with ParallelWaveGAN? It's all 48khz 16-bit PCM wave files located at https://drive.google.com/file/d/1IgTkZiYOk8MGq1ZQ-P2NskAF8ttkIcYp/view?usp=sharing

spectrum not continuos

image
Hi, I just find when use Parallel_WaveGAN to synthesize, the spectrum will be not continus. my config file is from ljspeech/*v1.yaml. Is there any suggestions for my problem? thank you

[HELP] Could you provide your pretrained models?

Hey guys, recently I implemented many recipes and added new configs such as melgan.
I want to try these new configs for each recipe and upload pretrained models, but I have not enough GPUs.
Therefore, if you trained the models which have not yet uploaded as the pretrained model, could you provide them to me?

Thank you for your cooperation in advance.

TODO

parallel_wavegan.v1

melgan.v3

  • JSUT
  • CSMSC
  • CMU Arctic
  • JNAS
  • VCTK
  • LibriTTS

parallel_wavegan.v3

This configuration requires too much time, so low priority.

  • JSUT
  • CSMSC
  • CMU Arctic
  • JNAS
  • VCTK
  • LibriTTS

by my dataset with ./run.sh script run without error but it didn't create any folder

i create my csv file with (229 lines and its 229 wavs) i know its not enough for quality .
and i replaced my data with LJSpeech downloads data.
after i use ./run.sh ,script run without any error but it didn't create any dir like ( data,dump,exp).
i check my gpu and its working very hard ! but where is directories!

note1: my data text and wavs are in Persian language ( csv is utf-8)
note2: i run script with LJSpeech own orginall data and its work perfect without any problem.

(base) ai@ai-Z390-GAMING-X:~/ParallelWaveGAN/egs/ljspeech/voc1$ ./run.sh
Stage -1: Data download
Already exists. Skipped.
Stage 0: Data preparation
Successfully split data directory.
Successfully split data directory.
Successfully prepared data.
Stage 1: Feature extraction
Feature extraction start. See the progress via dump/train_nodev/raw/preprocessing..log.
Feature extraction start. See the progress via dump/eval/raw/preprocessing.
.log.
Feature extraction start. See the progress via dump/dev/raw/preprocessing..log.
Successfully make subsets.
Successfully make subsets.
Successfully make subsets.
Successfully finished feature extraction of dev set.
Successfully finished feature extraction of eval set.
Successfully finished feature extraction of train_nodev set.
Successfully finished feature extraction.
Statistics computation start. See the progress via dump/train_nodev/compute_statistics.log.
Successfully finished calculation of statistics.
Nomalization start. See the progress via dump/train_nodev/norm/normalize.
.log.
Nomalization start. See the progress via dump/dev/norm/normalize..log.
Nomalization start. See the progress via dump/eval/norm/normalize.
.log.
Successfully finished normalization of eval set.
Successfully finished normalization of train_nodev set.
Successfully finished normalization of dev set.
Successfully finished normalization.
Stage 2: Network training
Training start. See the progress via exp/train_nodev_ljspeech_parallel_wavegan.v1/train.log.

RuntimeError: Format is invalid: csmsc_009901 csmsc_009901 0.285

Traceback (most recent call last):
File "/home/data/xfding/miniconda3/envs/env_tts/bin/parallel-wavegan-preprocess", line 11, in
load_entry_point('parallel-wavegan==0.3.4', 'console_scripts', 'parallel-wavegan-preprocess')()
File "/home/data/xfding/miniconda3/envs/env_tts/lib/python3.7/site-packages/parallel_wavegan-0.3.4-py3.7.egg/parallel_wavegan/bin/preprocess.py", line 119, in main
File "/home/data/xfding/miniconda3/envs/env_tts/lib/python3.7/site-packages/parallel_wavegan-0.3.4-py3.7.egg/parallel_wavegan/datasets/scp_dataset.py", line 155, in init
File "/home/data/xfding/miniconda3/envs/env_tts/lib/python3.7/site-packages/kaldiio-2.15.1-py3.7.egg/kaldiio/matio.py", line 62, in load_scp
File "/home/data/xfding/miniconda3/envs/env_tts/lib/python3.7/site-packages/kaldiio-2.15.1-py3.7.egg/kaldiio/matio.py", line 145, in init

I have this problem, anyone can help me ? thx

strange noise in long continous voice

hi, kan.
thank you for your great work.
I have tried your work with singing voice, but I found that it can;'t performed as well as in speech for long continous singing voice generation.
There's some bad case, can you give me some ideas for this problem?
image

data_download script in CSMSC

I'd like to report that the shell script csmsc/voc1/local/data_download.sh breaks at line 24 with an error
find: ‘CSMSC/PhoneLabeling’: No such file or directory

I removed CSMSC/, and the script works.

AssertionError

AssertionError: Caught AssertionError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/data/ParallelWaveGAN/tools/venv/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/data/ParallelWaveGAN/tools/venv/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/data/ParallelWaveGAN/parallel_wavegan/bin/train.py", line 513, in call
self._check_length(x, c, self.hop_size, 0)
File "/data/ParallelWaveGAN/parallel_wavegan/bin/train.py", line 556, in _check_length
assert len(x) == (len(c) - 2 * context_window) * hop_size
AssertionError

I change config as:
hop_size: 200
upsample_scales: [5, 4, 2, 5]

Mismatch between sample rate in preprocess and what librosa says

I am trying to use ParallelWaveGan with the v3 weights. https://drive.google.com/drive/folders/1a5Q2KiJfUQkVFo5Bd1IoYPVicJGnm7EL and in prepocessing the assert function fails because it says the sample rate is 16k but when opening with librosa it says it is 22.05k why is this happening?? If I comment the line and execute everything works but the output audio is like sped up (I guess that is normal).

This is the sample file:
https://drive.google.com/open?id=1M-Bv84EpEMTibMS-HumekhIcgE3TJ99k

Colab Notebook taking too long?

Thanks to the author for the Google Colab Notebook.
I managed to install all prerequisites without problem.
Yet, executing the cell "Synthesis" hasn't finished after 1h 20m.
This seems to be too long for the short example sentence,
so I would be grateful for any time/GPU!
As sidenote, I'm using Colab Pro that offers an Nividia P100 as GPU.

Generator exploded after ~138K iters.

I observed a interesting behaviour after 138K iters where discriminator dominated the training and generator exploded in both train and validation losses. Do you have any idea why and how to prevent it?

I am training on LJSpeech and I basically use the same learning schedule you released with the v2 config for LJSpeech. (Train generator until 100K and enable the discriminator)

Here is the tensorboard screenshot.

image

Possible BUG may make your code not achieve the best performance. (background noise, etc.)

Hi @kan-bayashi , there is a bug i think that make ur code cann't achieve the best performance for both melgan and PWG. That is after training 1 step generator, you should re-compute y_ then use this y_ for discriminator, but seem ur code not do that. In my experiment, re-compute y_ is crucial for obtain best quality. I have my tensorflow code for melgan, i can get the same performance as ur code but just around 2M steps from scratch (don't need PWG auxiliary loss to help convergence speed), but if i don't recompute y_, my tf code at 2M steps not good as 2M steps when recompute y_

How to train with low quality data?

Any suggestions of training the model with low qulity, such as audio recoding by iphone.I have try to train with such dataset, and the discriminator_loss decrease to almost zero(0.006) after 110K steps, which means it's hard to get a good result.

How to re-train on music?

Hello, the examples you give for music sound amazing, the best I've heard from any GAN so far. How can I train this on my own set of music?

increase LJSpeech data rows

hi, i want create my own dataset for persian language.
now going to create dataset, but befor that i want increase ljspeech data for testing what happend if my dataset will samaller than its.
i keep first 1000 rows and delete data after 1000th row and after runing run.sh i get this error:

Nomalization start. See the progress via dump/dev/norm/normalize.log.
Nomalization start. See the progress via dump/eval/norm/normalize.log.
Successfully finished normalization of dev set.
Successfully finished normalization of eval set.
run.pl: job failed, log is in dump/train_nodev/norm/normalize.log
./run.sh: 1 background jobs are failed.

normalize.log:
99%|█████████▉| 12484/12600 [00:21<00:00, 861.87it/s]
100%|█████████▉| 12571/12600 [00:21<00:00, 860.90it/s]
100%|██████████| 12600/12600 [00:21<00:00, 577.12it/s]
Traceback (most recent call last):
File "/usr/local/bin/parallel-wavegan-normalize", line 11, in
load_entry_point('parallel-wavegan', 'console_scripts', 'parallel-wavegan-normalize')()
File "/home/ai/ParallelWaveGAN/parallel_wavegan/bin/normalize.py", line 123, in main
[delayed(_process_single_file)(data) for data in tqdm(dataset)])
File "/usr/local/lib/python3.6/dist-packages/joblib/parallel.py", line 950, in call
n_jobs = self._initialize_backend()
File "/usr/local/lib/python3.6/dist-packages/joblib/parallel.py", line 711, in _initialize_backend
**self._backend_args)
File "/usr/local/lib/python3.6/dist-packages/joblib/_parallel_backends.py", line 517, in configure
**memmappingexecutor_args)
File "/usr/local/lib/python3.6/dist-packages/joblib/executor.py", line 42, in get_memmapping_executor
initargs=initargs, env=env)
File "/usr/local/lib/python3.6/dist-packages/joblib/externals/loky/reusable_executor.py", line 116, in get_reusable_executor
executor_id=executor_id, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/joblib/externals/loky/reusable_executor.py", line 153, in init
initializer=initializer, initargs=initargs, env=env)
File "/usr/local/lib/python3.6/dist-packages/joblib/externals/loky/process_executor.py", line 915, in init
self._processes_management_lock = self._context.Lock()
File "/usr/local/lib/python3.6/dist-packages/joblib/externals/loky/backend/context.py", line 225, in Lock
return Lock()
File "/usr/local/lib/python3.6/dist-packages/joblib/externals/loky/backend/synchronize.py", line 174, in init
super(Lock, self).init(SEMAPHORE, 1, 1)
File "/usr/local/lib/python3.6/dist-packages/joblib/externals/loky/backend/synchronize.py", line 90, in init
resource_tracker.register(self._semlock.name, "semlock")
File "/usr/local/lib/python3.6/dist-packages/joblib/externals/loky/backend/resource_tracker.py", line 171, in register
self.ensure_running()
File "/usr/local/lib/python3.6/dist-packages/joblib/externals/loky/backend/resource_tracker.py", line 143, in ensure_running
pid = spawnv_passfds(exe, args, fds_to_pass)
File "/usr/local/lib/python3.6/dist-packages/joblib/externals/loky/backend/resource_tracker.py", line 301, in spawnv_passfds
return fork_exec(args, _pass)
File "/usr/local/lib/python3.6/dist-packages/joblib/externals/loky/backend/fork_exec.py", line 43, in fork_exec
pid = os.fork()
#OSError: [Errno 12] Cannot allocate memory
Accounting: time=23 threads=1
Ended (code 1) at Sat Feb 22 09:50:33 +0330 2020, elapsed time 23 seconds

and my ram and gpu memory is free (more than 80% free)

Spectral convergence loss, what does it measure?

Hi There!

I'd like to better understand spectral convergence loss. In the literature, these are the mentions I have found so far:

SC loss emphasizes highly on large spectral components, which helps especially in the early phases of training.
"Fast Spectrogram Inversion using Multi-head Convolutional Neural Networks" https://arxiv.org/pdf/1808.06719.pdf

Because the spectral convergence loss emphasizes spectral peaks and the log STFT magnitude loss accurately fits spectral valleys
"Probability Density Distillation with Generative Adversarial Networks for High-Quality Parallel Waveform Generation" https://arxiv.org/pdf/1904.04472.pdf

The above explanations for including the loss function are fairly vague and short. Furthermore, I am unable to find any mentions of a similar loss else were in literature. To better describe the loss, I searched for "relative spectral power" (f.y.i. since the spectrogram to the power of two is the "power spectrogram", and the sum of the power spectrogram is a "power spectral density").

Lastly, this paper from Google Brain "DDSP: DIFFERENTIABLE DIGITAL SIGNAL PROCESSING" https://arxiv.org/pdf/2001.04643.pdf just uses a spectrogram magnitude loss. It doesn't train with a spectral convergence loss and their results are pretty good.

Have you tried training without spectral convergence loss? Is there some more literature I am missing that validates it as a perceptual loss?

InaccessibleTensorError: The tensor 'Tensor conv2d_80/dilation_rate:

hi @dathudeptrai @kan-bayashi
get this error when this line code “audio = TFMelGANGenerator(**config["generator_params"])(inputs)” run ,

InaccessibleTensorError: The tensor 'Tensor("conv2d_80/dilation_rate:0", shape=(2,), dtype=int32)' cannot be accessed here: it is defined in another function or code block. Use return values, explicit Python locals or TensorFlow collections to access it. Defined in: FuncGraph(name=call, id=139867161584976); accessed from: FuncGraph(name=call, id=139867559519064).'

Originally posted by @xzm2004260 in #112 (comment)

cannot preprocess csmsc dataset

Hello Kan-bayashi,

Thank you for your awesome job, I am trying to run the code for the mandarin dataset, but met the following problem during stage-0 as follows, can help here? Thank you so much.
.....
Extracting PhoneLabeling/009997.interval OK
Extracting PhoneLabeling/009998.interval OK
Extracting PhoneLabeling/009999.interval OK
Extracting PhoneLabeling/010000.interval OK
All OK
Successfully finished download.
Stage 0: Data preparation
Successfully split data directory.
Successfully split data directory.
Successfully prepared data.
Stage 1: Feature extraction
Feature extraction start. See the progress via dump/train_nodev/raw/preprocessing..log.
Feature extraction start. See the progress via dump/dev/raw/preprocessing.
.log.
Feature extraction start. See the progress via dump/eval/raw/preprocessing..log.
Successfully make subsets.
Successfully make subsets.
Successfully make subsets.
run.pl: 16 / 16 failed, log is in dump/eval/raw/preprocessing.
.log
run.pl: 16 / 16 failed, log is in dump/train_nodev/raw/preprocessing..log
run.pl: 16 / 16 failed, log is in dump/dev/raw/preprocessing.
.log
./run.sh: 3 background jobs are failed.

TTS + ParallelWaveGAN progress

If you don't mind, I like to share my progress with PWGAN with TTS.

Here is the first try results:
https://soundcloud.com/user-565970875/sets/ljspeech_tacotron_5233_paralle

Results are not better than what we have with WaveRNN, I should say it is much faster.

There is a hissing noise in the backgroung. If you have any idea to get rid of this, please let me know.

The only difference in training (I guess) I don't apply mean-normalization to melspectrograms and I normalize to -4,4 range.

How about ParallelWaveGan combine with Melgan?

Hi @kan-bayashi, First I would like to thank you for this implementation.

After read ParallelWavegan Paper, I realized there was another paper out at the same time with parallelwavegan as melgan (https://arxiv.org/abs/1910.06711). These 2 papers have very similar ideas but approach with completely different directions and can complement each other. Here is my summary of the two models for you to get an overview:

  • ParallelWavegan:
  1. Use noise vector for generator.
  2. Use upsampling + conv2d for generator.
  3. MultiScale STFT Loss for generator.
  4. Radam optimizer.
  • MelGan:
  1. Don't use noise vector.
  2. use transpose-2D with carefully chosen Kernel-size and stride to avoid checkerboard artifact
  3. MultiScale Discriminator.
  4. Feature matching for Discriminator.
  5. Adam

In terms of speed on CPU (melgal consist of 6 layers, parallelwavegan consist 30 layers), melgan faster than parallelwavegan around 7 times.

As u can see, we can combine these 2 model into one :))). Tell me what do u think about that :)). @kan-bayashi

Many iterations of discriminator training causes strange noise

I compared the following two models:

  • (Red) The model which trains the discriminator from 200k iters
  • (Blue) The model which trains the discriminator from the first iter
    Here is the training curve.

スクリーンショット 2019-11-06 午前0 04 12

From the curve, the blue one is better than the red in terms of log STFT magnitude loss.

However, the blue model causes strange noise.

You can listen to the samples.
https://drive.google.com/open?id=1LL_A4ysUqKJ13YQBdQwzNBvGp8m8BhqY

I think this is caused by the discriminator (v1 is red and v2 is blue).
If you have any idea or suggestion to avoid this issue, please share with me.

memory when inference

I want to know when inference, the memory is ? because I test the result is 1.5G, the model size is 1.44M, I think the memory is too large

Training with 16kHz audio

What configuration parameters should I modify in order to train with 16kHz audio?

I intend to work with:
sampling_rate: 16000
fft_size: 512
hop_size: 128

I believe I should also change upsample_scales and stft_loss_params, but I am not sure which values to use....

my persian dataset isnt work true after traning

hi,
i create my csv file with (229 lines and its 229 wavs) i know its not enough for quality .
and i replaced my data with LJSpeech downloads data.

after training i use this code for text to speech :

https://colab.research.google.com/github/espnet/notebook/blob/master/tts_realtime_demo.ipynb#scrollTo=9gGRzrjyudWF

its work very fine by English, but now by my Persian dataset , its create a weird voice that isn't like a Persian language :

Input your favorite sentencne in English!
result:
که چقدر از اینکه پیششون نبود ولشون کرده بوده ناراحت بوده
Cleaned text: KHH CHQDR Z YNKHH PYSHSHWN NBWD WLSHWN KHRDH BWDH NRHT BWDH
RTF = 0.220197

i must do any change in parallelwavegan or ljspeech config file for persian language? or any change to this script :
https://colab.research.google.com/github/espnet/notebook/blob/master/tts_realtime_demo.ipynb#scrollTo=9gGRzrjyudWF

because its write (Input your favorite sentencne in English!)

note: this is amost true :(KHH CHQDR Z YNKHH PYSHSHWN NBWD WLSHWN KHRDH BWDH NRHT BWDH) but voice isnt read this!
note2:this paragraph exist in my dataset (که چقدر از اینکه پیششون نبود ولشون کرده بوده ناراحت بوده)

error in ./run.sh

by runing ./run.sh i get this error:

[Parallel(n_jobs=16)]: Done 6018 tasks      | elapsed:  2.4min
[Parallel(n_jobs=16)]: Done 7168 tasks      | elapsed:  2.7min
/home/bahram/ParallelWaveGAN/tools/venv/lib/python3.6/site-packages/joblib/externals/loky/backend/utils.py:55: UserWarning: Failed to kill subprocesses on this platform. Pleaseinstall psutil: https://github.com/giampaolo/psutil
  warnings.warn("Failed to kill subprocesses on this platform. Please"
/home/bahram/ParallelWaveGAN/tools/venv/lib/python3.6/site-packages/joblib/externals/loky/backend/utils.py:55: UserWarning: Failed to kill subprocesses on this platform. Pleaseinstall psutil: https://github.com/giampaolo/psutil
  warnings.warn("Failed to kill subprocesses on this platform. Please"
/home/bahram/ParallelWaveGAN/tools/venv/lib/python3.6/site-packages/joblib/externals/loky/backend/utils.py:55: UserWarning: Failed to kill subprocesses on this platform. Pleaseinstall psutil: https://github.com/giampaolo/psutil
  warnings.warn("Failed to kill subprocesses on this platform. Please"
/home/bahram/ParallelWaveGAN/tools/venv/lib/python3.6/site-packages/joblib/externals/loky/backend/utils.py:55: UserWarning: Failed to kill subprocesses on this platform. Pleaseinstall psutil: https://github.com/giampaolo/psutil
  warnings.warn("Failed to kill subprocesses on this platform. Please"
/home/bahram/ParallelWaveGAN/tools/venv/lib/python3.6/site-packages/joblib/externals/loky/backend/utils.py:55: UserWarning: Failed to kill subprocesses on this platform. Pleaseinstall psutil: https://github.com/giampaolo/psutil
  warnings.warn("Failed to kill subprocesses on this platform. Please"
/home/bahram/ParallelWaveGAN/tools/venv/lib/python3.6/site-packages/joblib/externals/loky/backend/utils.py:55: UserWarning: Failed to kill subprocesses on this platform. Pleaseinstall psutil: https://github.com/giampaolo/psutil
  warnings.warn("Failed to kill subprocesses on this platform. Please"
/home/bahram/ParallelWaveGAN/tools/venv/lib/python3.6/site-packages/joblib/externals/loky/backend/utils.py:55: UserWarning: Failed to kill subprocesses on this platform. Pleaseinstall psutil: https://github.com/giampaolo/psutil
  warnings.warn("Failed to kill subprocesses on this platform. Please"
/home/bahram/ParallelWaveGAN/tools/venv/lib/python3.6/site-packages/joblib/externals/loky/backend/utils.py:55: UserWarning: Failed to kill subprocesses on this platform. Pleaseinstall psutil: https://github.com/giampaolo/psutil
  warnings.warn("Failed to kill subprocesses on this platform. Please"
joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/bahram/ParallelWaveGAN/tools/venv/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 418, in _process_worker
    r = call_item()
  File "/home/bahram/ParallelWaveGAN/tools/venv/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 272, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/home/bahram/ParallelWaveGAN/tools/venv/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 608, in __call__
    return self.func(*args, **kwargs)
  File "/home/bahram/ParallelWaveGAN/tools/venv/lib/python3.6/site-packages/joblib/parallel.py", line 256, in __call__
    for func, args, kwargs in self.items]
  File "/home/bahram/ParallelWaveGAN/tools/venv/lib/python3.6/site-packages/joblib/parallel.py", line 256, in <listcomp>
    for func, args, kwargs in self.items]
  File "/home/bahram/ParallelWaveGAN/parallel_wavegan/bin/preprocess.py", line 178, in _process_single_file
    write_hdf5(os.path.join(args.dumpdir, f"{utt_id}.h5"), "wave", audio.astype(np.float32))
  File "/home/bahram/ParallelWaveGAN/parallel_wavegan/utils/utils.py", line 87, in write_hdf5
    hdf5_file = h5py.File(hdf5_name, "r+")
  File "/home/bahram/ParallelWaveGAN/tools/venv/lib/python3.6/site-packages/h5py/_hl/files.py", line 408, in __init__
    swmr=swmr)
  File "/home/bahram/ParallelWaveGAN/tools/venv/lib/python3.6/site-packages/h5py/_hl/files.py", line 175, in make_fid
    fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 88, in h5py.h5f.open
OSError: Unable to open file (bad object header version number)
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/bahram/ParallelWaveGAN/tools/venv/bin/parallel-wavegan-preprocess", line 11, in <module>
    load_entry_point('parallel-wavegan', 'console_scripts', 'parallel-wavegan-preprocess')()
  File "/home/bahram/ParallelWaveGAN/parallel_wavegan/bin/preprocess.py", line 190, in main
    [delayed(_process_single_file)(data) for data in tqdm(dataset)])
  File "/home/bahram/ParallelWaveGAN/tools/venv/lib/python3.6/site-packages/joblib/parallel.py", line 1017, in __call__
    self.retrieve()
  File "/home/bahram/ParallelWaveGAN/tools/venv/lib/python3.6/site-packages/joblib/parallel.py", line 909, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/bahram/ParallelWaveGAN/tools/venv/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 562, in wrap_future_result
    return future.result(timeout=timeout)
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
OSError: Unable to open file (bad object header version number)
# Accounting: time=404 threads=1

Questions on E2E-TTS demo

Dear Kan-Bayashi,

Got two questions on your E2E-TTS mandarin demo on colab. Can help here?

Download pretrained models

You can select Transformer or FastSpeech.

  1. For mandarin demo, you only provide options of Transformer and FastSpeech, but don't have Tacotron2, is there any recipe for train Tacotron2 using espnet, and to combine it with your parallel-wavegan?
  2. How did you train your transformer and fastspeech models, is there any recipe I can follow to reproduce it?

In other words, my main concern is how to combine tacotron2/transformer/fastspeech with parallele-wavegan? how to make the melspectrom common in both models.

Thank you very much.

setting audio parameters used ljspeech conf for personal dataset caused errors

Great work!!!
i copy the ljspeech/voc1/* in own dir, when i set audio parameters in own dataset, i meet the errors, for example,
sampling_rate: 48000
fft_size: 4096
hop_size: 600
win_length: 2400
parallel_wavegan/models/parallel_wavegan.py line 145
assert c.size(-1) == x.size(-1)
how should i set right parameters?

decoding cannot distribute on all GPUs

I have 2 GPUs, the stage 2 training process distributed on the 2 cards went well. However at the stage 3, network decoding, it always has one dataset succeeded but another failed. The failed decode.log gives RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable. It seems the decoding uses only one single GPU (device 0) although I have 2 cards. I used the following command line for decoding:
./run.sh --stage 3 --n_gpus 2

How to use pre-trained model?

Hello, I just found "checkpoint-400000steps.pkl" in your google drive,and I want to synthsis Chinese voice with your pretrained csmsc model, but now I don't know how to use it.
Can you tell me if the pre-trained model could synthesis the mandarin voice,if so,how?
Thanks all the time!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.