kan-bayashi / pytorchwavenetvocoder Goto Github PK

View Code? Open in Web Editor NEW

297.0 16.0 58.0 431 KB

WaveNet-Vocoder implementation with pytorch.

Home Page: https://kan-bayashi.github.io/WaveNetVocoderSamples/

License: Apache License 2.0

Python 33.98% Shell 55.05% Perl 10.84% Makefile 0.13%

neural-network wavenet pytorch wavenet-vocoder speech-synthesis neural-vocoder

pytorchwavenetvocoder's People

Contributors

Stargazers

Watchers

Forkers

entn-at kastnerkyle shubhampachori12110095 sw005320 huguanglong chochobo xuanhan863 fireae toannhu zhf459 bajibabu splinter21 guanlongzhao b2220333 kkokdari joydosun kingstorm syang1993 codezero00 ooshaunoo wyn314 shaun95 yasuharakazuki xi-studio ibulu cmc1023 linzai1992 frontierdk afd77 yfliao tungk roelvdp tuanad121 lianfei jefflai108 potato-inoue naotokakegawa kytening hiyoung-asr hongwen-sun appalachianwine ruclion bruceyang-yeu exeex chunhuiwang-china hlp2819 grantl10 laranea georgehappy1 guoyang94 rudolfix zhipingzhou celestialized nzpeng fuji1226 kimyeondu lwzzz7

pytorchwavenetvocoder's Issues

could not find wav folder

Hey, thanks for the amazing repo on wavenet vocoder. I have tried your https://github.com/kan-bayashi/PytorchWaveNetVocoder/blob/master/egs/arctic/si-open-melspc/run.sh to run the stage. When I run stage 1, one error occured.

the codes here:

# make scp files
    if [ ${highpass_cutoff} -eq 0 ];then
        cp "data/${train}/wav.scp" "data/${train}/wav_hpf.scp"
    else
        find "wav/${train}" -name "*.wav" | sort > "data/${train}/wav_hpf.scp"
    fi

will execute else part and find wav folder's files. However, there is no wav folder created before so I could not find it. Could you please help with it?

ValueError: operands could not be broadcast together with shapes

Tried following the example in the README and got this error. Any ideas?

Process Process-1:
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "./../../../src/bin/decode.py", line 276, in gpu_decode
    for feat_ids, (batch_x, batch_h, n_samples_list) in generator:
  File "./../../../src/bin/decode.py", line 140, in decode_generator
    h = feat_transform(h)
  File "/Users/michaelp/Code/PytorchWaveNetVocoder/tools/venv/lib/python3.6/site-packages/torchvision/transforms/transforms.py", line 42, in __call__
    img = t(img)
  File "./../../../src/bin/decode.py", line 245, in <lambda>
    lambda x: scaler.transform(x)])
  File "/Users/michaelp/Code/PytorchWaveNetVocoder/tools/venv/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 692, in transform
    X -= self.mean_
ValueError: operands could not be broadcast together with shapes (633,30) (28,) (633,30)

WAV File I used: https://drive.google.com/file/d/1-2aMp0gyxn0Km25him8C_PRV2Y8V7WMk/view?usp=sharing

Possible bug?

In feature_extract.py lines 241-242 /feat holds the extended features (upsampled features), and /feat_org holds the original features.

On the other hand, decode.py lines 74-77, /feat is loaded from the features file when upsampling_factor == 0, and /feat_org is loaded otherwise.

Shouldn't it be the other way around?

Thanks

Failed building wheel for pyworld and dtw-c

Hi I followed this instrcutions to setup:

$ git clone https://github.com/kan-bayashi/PytorchWaveNetVocoder.git
$ cd PytorchWaveNetVocoder/tools
$ make

however this resulted in the following errors:

x86_64-linux-gnu-gcc: error: pyworld/pyworld.cpp: No such file or directory
x86_64-linux-gnu-gcc: fatal error: no input files
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

Failed building wheel for pyworld
Running setup.py clean for pyworld
Building wheel for dtw-c (setup.py) ... error
Complete output from command /home/ubuntu/Projects/NLP/ajalaSpeech/TTS/experiments/wavenet_vocoder/PytorchWaveNetVocoder/tools/venv/bin/python3.6 -u -c "import setuptools, tokenize;file='/tmp/pip-install-mnveb40h/dtw-c/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" bdist_wheel -d /tmp/pip-wheel-b0fie57u --python-tag cp36:
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.6
creating build/lib.linux-x86_64-3.6/dtw_c
copying dtw_c/init.py -> build/lib.linux-x86_64-3.6/dtw_c
running build_ext
building 'dtw_c.dtw_c' extension
error: unknown file type '.pyx' (from 'dtw_c/dtw_c.pyx')

Failed building wheel for dtw-c
Running setup.py clean for dtw-c
Failed to build pyworld dtw-c

Do these final messages in the stdout indicate that the earlier build/ compilation failures are rectified by a sanity check?

Running setup.py install for pyworld ... done
Running setup.py install for dtw-c ... done
Successfully installed PyWavelets-1.0.2 audioread-2.1.6 cffi-1.12.2 cycler-0.10.0 cython-0.29.6 decorator-4.4.0 dtw-1.3.3 dtw-c-0.6.0 fastdtw-0.3.2 h5py-2.9.0 imageio-2.5.0 joblib-0.13.2 kiwisolver-1.0.1 librosa-0.6.3 llvmlite-0.28.0 matplotlib-3.0.3 networkx-2.2 numba-0.43.1 numpy-1.13.3 pillow-6.0.0 pycparser-2.19 pyparsing-2.4.0 pysptk-0.1.16 python-dateutil-2.8.0 pyworld-0.2.8 pyyaml-5.1 resampy-0.2.1 scikit-image-0.15.0 scikit-learn-0.20.3 scipy-1.2.1 six-1.12.0 soundfile-0.10.2 sprocket-vc-0.18.2 torch-1.0.1.post2 torchvision-0.2.2.post3
touch venv/bin/activate

Or do I have a faulty build here?

Any help would be appreciated!

Integration with Merlin

Has anyone looked at bootstrapping the wavenet vocoder to Merlin (https://github.com/CSTR-Edinburgh/merlin/)? Merlin is an open-source TTS system (which uses Ossian or Festival as a front-end) for acoustic and duration modelling by default uses the WORLD vocoder and therefore extracts world vocoder features, as such it seems that an integration of this with Merlin should be possible. Just interested to see if someone has tried this out, and if they can offer some guidance.

How to generate speech from features with WORLD vocoder

Hi,

I'm trying to debug my system that uses you WaveNet vocoder. Is there any way to create WAV from the features your code generates?

Thanks

Some questions about the subjective evaluation (MOS chart)

If my comprehension is correct, the vocoders on the MOS chart were evaluated in the condition such that the input of the vocoders were features extracted from STRAIGHT, and the output were raw waveforms. If so, then how come STRAIGHT got such low score? Shouldn't it score as high as raw waveform does?

Speech to Speech

Does it make sense to use the Wavenet vocoder as it is for speech to speech? For example, Can I record my voice, generate a melspectrogram, then use a pre-trained model on LJSpeech dataset to respeak it?

I've been trying this and the results don't sound good!

Trying to train on nancy corpus subset

Hi,

I'm trying to train a model on 200 wav files (100 train/val) from the nancy corpus (Blizzard 2011 dataset). I modified the egs/arctic/sd/run.sh script to process my own files.

I get an error due to some issues with size of batch tensor, perhaps something having to do with upsampling?

Here is the full error log:

# train.py --n_gpus 1 --waveforms data/train/wav_ns.scp --feats data/train/feats.scp --stats data/train/stats.h5 --expdir exp/tr_nancy_16k_sd_nancy_lr1e-4_wd0.0_bl20000_bs1_ns_up --n_quantize 256 --n_aux 28 --n_resch 512 --n_skipch 256 --dilation_depth 10 --dilation_repeat 3 --lr 1e-4 --weight_decay 0.0 --iters 200000 --batch_length 20000 --batch_size 1 --checkpoints 10000 --use_speaker_code false --upsampling_factor 80 --resume
# Started at Wed Mar 14 20:48:13 UTC 2018
#
/home/ubuntu/PytorchWaveNetVocoder/tools/venv/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
WaveNet(
  (onehot): OneHot(
  )
  (causal): CausalConv1d(
    (conv): Conv1d(256, 512, kernel_size=(2,), stride=(1,), padding=(1,))
  )
  (upsampling): UpSampling(
    (conv): ConvTranspose2d(1, 1, kernel_size=(1, 80), stride=(1, 80))
  )
  (dil_sigmoid): ModuleList(
    (0): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(1,))
    )
    (1): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(2,), dilation=(2,))
    )
    (2): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(4,), dilation=(4,))
    )
    (3): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(8,), dilation=(8,))
    )
    (4): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(16,), dilation=(16,))
    )
    (5): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(32,), dilation=(32,))
    )
    (6): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(64,), dilation=(64,))
    )
    (7): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(128,), dilation=(128,))
    )
    (8): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(256,), dilation=(256,))
    )
    (9): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(512,), dilation=(512,))
    )
    (10): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(1,))
    )
    (11): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(2,), dilation=(2,))
    )
    (12): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(4,), dilation=(4,))
    )
    (13): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(8,), dilation=(8,))
    )
    (14): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(16,), dilation=(16,))
    )
    (15): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(32,), dilation=(32,))
    )
    (16): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(64,), dilation=(64,))
    )
    (17): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(128,), dilation=(128,))
    )
    (18): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(256,), dilation=(256,))
    )
    (19): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(512,), dilation=(512,))
    )
    (20): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(1,))
    )
    (21): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(2,), dilation=(2,))
    )
    (22): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(4,), dilation=(4,))
    )
    (23): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(8,), dilation=(8,))
    )
    (24): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(16,), dilation=(16,))
    )
    (25): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(32,), dilation=(32,))
    )
    (26): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(64,), dilation=(64,))
    )
    (27): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(128,), dilation=(128,))
    )
    (28): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(256,), dilation=(256,))
    )
    (29): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(512,), dilation=(512,))
    )
  )
  (dil_tanh): ModuleList(
    (0): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(1,))
    )
    (1): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(2,), dilation=(2,))
    )
    (2): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(4,), dilation=(4,))
    )
    (3): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(8,), dilation=(8,))
    )
    (4): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(16,), dilation=(16,))
    )
    (5): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(32,), dilation=(32,))
    )
    (6): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(64,), dilation=(64,))
    )
    (7): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(128,), dilation=(128,))
    )
    (8): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(256,), dilation=(256,))
    )
    (9): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(512,), dilation=(512,))
    )
    (10): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(1,))
    )
    (11): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(2,), dilation=(2,))
    )
    (12): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(4,), dilation=(4,))
    )
    (13): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(8,), dilation=(8,))
    )
    (14): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(16,), dilation=(16,))
    )
    (15): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(32,), dilation=(32,))
    )
    (16): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(64,), dilation=(64,))
    )
    (17): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(128,), dilation=(128,))
    )
    (18): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(256,), dilation=(256,))
    )
    (19): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(512,), dilation=(512,))
    )
    (20): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(1,))
    )
    (21): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(2,), dilation=(2,))
    )
    (22): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(4,), dilation=(4,))
    )
    (23): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(8,), dilation=(8,))
    )
    (24): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(16,), dilation=(16,))
    )
    (25): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(32,), dilation=(32,))
    )
    (26): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(64,), dilation=(64,))
    )
    (27): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(128,), dilation=(128,))
    )
    (28): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(256,), dilation=(256,))
    )
    (29): CausalConv1d(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(1,), padding=(512,), dilation=(512,))
    )
  )
  (aux_1x1_sigmoid): ModuleList(
    (0): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (1): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (2): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (3): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (4): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (5): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (6): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (7): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (8): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (9): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (10): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (11): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (12): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (13): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (14): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (15): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (16): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (17): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (18): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (19): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (20): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (21): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (22): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (23): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (24): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (25): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (26): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (27): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (28): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (29): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
  )
  (aux_1x1_tanh): ModuleList(
    (0): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (1): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (2): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (3): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (4): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (5): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (6): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (7): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (8): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (9): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (10): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (11): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (12): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (13): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (14): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (15): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (16): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (17): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (18): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (19): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (20): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (21): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (22): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (23): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (24): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (25): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (26): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (27): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (28): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
    (29): Conv1d(28, 512, kernel_size=(1,), stride=(1,))
  )
  (skip_1x1): ModuleList(
    (0): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (1): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (2): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (3): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (4): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (5): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (6): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (7): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (8): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (9): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (10): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (11): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (12): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (13): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (14): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (15): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (16): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (17): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (18): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (19): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (20): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (21): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (22): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (23): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (24): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (25): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (26): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (27): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (28): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
    (29): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
  )
  (res_1x1): ModuleList(
    (0): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (1): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (2): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (3): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (4): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (5): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (6): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (7): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (8): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (9): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (10): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (11): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (12): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (13): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (14): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (15): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (16): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (17): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (18): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (19): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (20): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (21): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (22): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (23): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (24): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (25): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (26): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (27): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (28): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
    (29): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
  )
  (conv_post_1): Conv1d(256, 256, kernel_size=(1,), stride=(1,))
  (conv_post_2): Conv1d(256, 256, kernel_size=(1,), stride=(1,))
)
number of training data = 100.
batch length is decreased due to upsampling (20000 -> 19970)
Traceback (most recent call last):
  File "../../../src/bin/train.py", line 513, in <module>
    main()
  File "../../../src/bin/train.py", line 474, in main
    batch_output = model(batch_x, batch_h)
  File "/home/ubuntu/PytorchWaveNetVocoder/tools/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/PytorchWaveNetVocoder/src/nets/wavenet.py", line 237, in forward
    self.skip_1x1[l], self.res_1x1[l])
  File "/home/ubuntu/PytorchWaveNetVocoder/src/nets/wavenet.py", line 511, in _residual_forward
    aux_output_sigmoid = aux_1x1_sigmoid(h)
  File "/home/ubuntu/PytorchWaveNetVocoder/tools/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/PytorchWaveNetVocoder/tools/venv/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 168, in forward
    self.padding, self.dilation, self.groups)
  File "/home/ubuntu/PytorchWaveNetVocoder/tools/venv/lib/python3.6/site-packages/torch/nn/functional.py", line 54, in conv1d
    return f(input, weight, bias)
RuntimeError: Given groups=1, weight[512, 28, 1], so expected input[1, 64, 23040] to have 28 channels, but got 64 channels instead
# Accounting: time=7 threads=1
# Ended (code 1) at Wed Mar 14 20:48:20 UTC 2018, elapsed time 7 seconds

What could be the problem?

auxiliary input

Hi, I'm wondering if you could help me. I'm trying to build your speaker-dependent vocoder in TensorFlow, but I'm struggling to understand how auxiliary input is feed to the network, is it added in parallel (two parallel layers) to the sample values and the output combined at a later layer? If you can point me in the direction of a text-book/article on auxiliary input/conditioning network I would be eternally grateful, I've looked many times and I can't find anything that gives a general undestanding of this.

about speaker

I would like to ask, if I use A's data to train the network, after training, the input sound becomes B, then the effect is good？ or need to use B data to train again.

Mirror for m-ailabs dataset

Hi @kan-bayashi
Could you, please, provide some temporary mirror link for m-ailabs dataset:

http://www.m-ailabs.bayern/en/the-mailabs-speech-dataset/

Thank you in advance

RuntimeError: zero(s) are found in periodogram, use eps option to floor

Hi, I modified one of the arctic egs to train on a different dataset, and I get this error from the mcep extraction. I looked into it, and it seems that this happens when there is a long enough period of silence in the audio. It seems to fix the problem to change the mcep calculation line to:

    mcep = [pysptk.mcep(x[shiftl * i: shiftl * i + fftl] * win, dim, alpha, eps=EPS, etype=1) for i in range(n_frame)]

However, this might not be a complete fix. In particular, it looks like there are places where you use world to calculate mcep, and a colleague told me that he recalls that world will actually dump core because of this problem and doesn't have an eps option, so it to fix it it might be necessary to add a tiny bit of noise to the audio.

typo in ljspeech/sd/run.sh

91 # stop when error occurred
92 set -euo pipfail

should be 'pipefail' shouldn't it?

To do

ValueError: numpy.ufunc has the wrong size, try recompiling. Expected 192, got 216

Hi, hope you can help.

I've been trying a good part of the day, but cannot get this to work. The issue is seen here;

########################################################### 
#                 DATA PREPARATION STEP                   #
###########################################################
###########################################################
#               FEATURE EXTRACTION STEP                   #
###########################################################
run.pl: job failed, log is in exp/feature_extract/feature_extract_tr_slt.log

And the matching error form the log is this;

# feature_extract.py --waveforms data/tr_slt/wav.scp --wavdir wav/tr_slt --hdf5dir hdf5/tr_slt --feature_type world --fs 16000 --shiftms 5 --minf0 120 --maxf0 275 --mcep_dim 24 --mcep_alpha 0.410 --highpass_cutoff 70 --fftl 1024 --n_jobs 10 
# Started at Mon  1 Apr 13:13:54 AEDT 2019
#
Traceback (most recent call last):
  File "../../../src/bin/feature_extract.py", line 19, in <module>
    import pysptk
  File "/home/roel/PytorchWaveNetVocoder/tools/venv/lib/python3.6/site-packages/pysptk/__init__.py", line 41, in <module>
    from .sptk import *  # pylint: disable=wildcard-import
  File "/home/roel/PytorchWaveNetVocoder/tools/venv/lib/python3.6/site-packages/pysptk/sptk.py", line 147, in <module>
    from . import _sptk
  File "__init__.pxd", line 872, in init pysptk._sptk
ValueError: numpy.ufunc has the wrong size, try recompiling. Expected 192, got 216
# Accounting: time=2 threads=1
# Ended (code 1) at Mon  1 Apr 13:13:56 AEDT 2019, elapsed time 2 seconds

This does NOT seem related to numpy versions on the host or the virtual environment (tried both, in numerous ways and in numerous times with both versions of numpy, the originally required one (1.13.3) and the latest.

The "ValueError: numpy.ufunc has the wrong size, try recompiling. Expected 192, got 216" bug can be found all over the net, but no good writeups except for this one;
http://codebase.site/index.php/question/show_question_details/26004

I remain of the hope that manual building of each required item is not required.

Hope you have ideas/insights/"seen-this-before" comments.

Abnormal noise is occurred at the silence region

Hi, I'm constructing speaker dependent WaveNet vocoder.

When I train WaveNet vocoder, sometimes the generated waveform contains very big noise at the silence region of original speech.

When the waveform becomes speech presence region, the waveform generates correct speech samples.

Could you tell me why this problem happens and how to solve this problem?

Development plan

This is the reminder of my action items.

Yaml format network configuration
Limit the range of F0 in Mel-spectrogram
Rich upsampling layer (stack of conv and deconv)
Lightweight WaveRNN
Single Gaussian WaveNet
Parallel WaveNet with Single Gaussian WaveNet

If I have a free time, I will implement the above items.

Update information

2018/05/01

Updated to be compatible with pytorch v0.4
Updated to be able to use melspectrogram as auxiliary feature

Due to above update, some parts are changed (see below)

# -------------------- #
# feature path in hdf5 #
# -------------------- #
old -> new
/feat_org -> /world or /melspc
/feat -> no more saving extended featrue (it is replicated when loading)

# ----------------------- #
# statistics path in hdf5 #
# ----------------------- #
old -> new
/mean -> /world/mean or /melspc/mean
/scale -> /world/scale or /melspc/scale

# ----------------------- #
# new options in training #
# ----------------------- #
--feature_type: Auxiliary feature type (world or melspc)
--use_upsampling_layer: Flag to decide whether to use upsampling layer in WaveNet
--upsampling_factor: Changed to be alway needed because feature extension is performed in loading

Note that old model file checkpoint-*.pkl can be used, but it is necessary to modify model.conf file as follows.

# how-to-convert to new config file
import torch
args = torch.load("old_model.conf")
args.use_upsampling_layer = True
args.feature_type = "world"
torch.save(args, "new_model.conf")

Some questions

Hi, I have some questions concerning your code:

1 - In train_generator() function of train.py script (line 69 to 269), you create your batches using buffers x_buffer and h_buffer. You initialize them at the beginning of your code and then fill them with new audio / feature data. My question refers to lines 135-136 and 178-179 :

x_buffer = np.concatenate([x_buffer, x], axis=0)
h_buffer = np.concatenate([h_buffer, h], axis=0)

Initially, x_buffer and h_buffer are empty. However, by iterating on files, it is possible that x_buffer and h_buffer contains data from the previous wav file. In this case, you will concatenate data from two different audios, which can affect training quality. Is it voluntary ?

2 - In the same script, at lines 472 - 474, your loss doesn't take into account of the receptive field of the WaveNet. It may be because your shift size when creating batches is equal to batch_length and not batch_length + receptive_field, but I wanted to be sure of this choice concerning the loss calculation.

3 - Finally, this last question is open. Have you thought about training WaveNet on mel-spectrograms, like tacotron 2 paper ? Apparently, training on mel-spectrograms allow a better audio quality during synthesis. This maybe assume that you change your loss with a Mixture of logistic distributions (MoL), like WaveNet 2 paper

Hope this post will not bother you !

Cheers,

Julian

paralle training

Dear Tomoki,

Is it possible to run parallel training/conversion using more than one machine at the same time?

In our working environment, there are 4 machines, each with 2 GPUs and Slurm had been well installed. However, it seems that only one machine could be allocated for stage 4 and 5. For example:

sinfo

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
P1* up infinite 1 alloc gccn01
P1* up infinite 4 idle gccn[02-04],gchead

squeue

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
46 P1 tr.sh liao R 8:55 1 gccn01

Thanks for your help and have a nice day!

Best Regards,
Yuanfu

PS: Here are the environment settings:

#run.sh

n_gpus=2
n_quantize=256
n_aux=80
n_resch=512
n_skipch=256
dilation_depth=10
dilation_repeat=3
kernel_size=2
lr=1e-4
weight_decay=0.0
iters=200000
batch_length=20000
batch_size=8
checkpoints=1000
use_upsampling=true
use_noise_shaping=true
resume=

#cmd.sh

export train_cmd="slurm.pl --config conf/slurm.conf"
export cuda_cmd="slurm.pl --gpu 1 --config conf/slurm.conf"
#export max_jobs=-1

slurs.conf

COMPUTE NODES

GresTypes=gpu
NodeName=gchead Gres=gpu:0 CPUs=20 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 RealMemory=128815 State=UNKNOWN
NodeName=gccn0[1-4] Gres=gpu:2 CPUs=20 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 RealMemory=128815 State=UNKNOWN
PartitionName=P1 Nodes=gchead,gccn0[1-4] Default=YES MaxTime=INFINITE State=UP

#gres.conf

Name=gpu Type=tesla File=/dev/nvidia0 Cores=0,1
Name=gpu Type=tesla File=/dev/nvidia1 Cores=0,1

control show node

NodeName=gccn01 Arch=x86_64 CoresPerSocket=10
CPUAlloc=0 CPUErr=0 CPUTot=20 CPULoad=0.08 Features=(null)
Gres=gpu:2
NodeAddr=gccn01 NodeHostName=gccn01 Version=15.08
OS=Linux RealMemory=128815 AllocMem=0 FreeMem=123504 Sockets=2 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
BootTime=2019-03-05T17:01:46 SlurmdStartTime=2019-03-10T18:32:19
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=gccn02 Arch=x86_64 CoresPerSocket=10
CPUAlloc=0 CPUErr=0 CPUTot=20 CPULoad=0.20 Features=(null)
Gres=gpu:2
NodeAddr=gccn02 NodeHostName=gccn02 Version=15.08
OS=Linux RealMemory=128815 AllocMem=0 FreeMem=1974 Sockets=2 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
BootTime=2018-02-04T18:11:19 SlurmdStartTime=2019-03-10T18:32:24
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=gccn03 Arch=x86_64 CoresPerSocket=10
CPUAlloc=0 CPUErr=0 CPUTot=20 CPULoad=0.15 Features=(null)
Gres=gpu:2
NodeAddr=gccn03 NodeHostName=gccn03 Version=15.08
OS=Linux RealMemory=128815 AllocMem=0 FreeMem=113919 Sockets=2 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
BootTime=2019-03-05T17:24:43 SlurmdStartTime=2019-03-10T18:32:28
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=gccn04 Arch=x86_64 CoresPerSocket=10
CPUAlloc=0 CPUErr=0 CPUTot=20 CPULoad=0.28 Features=(null)
Gres=gpu:2
NodeAddr=gccn04 NodeHostName=gccn04 Version=15.08
OS=Linux RealMemory=128815 AllocMem=0 FreeMem=126584 Sockets=2 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
BootTime=2019-03-05T17:25:50 SlurmdStartTime=2019-03-10T18:32:32
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=gchead Arch=x86_64 CoresPerSocket=10
CPUAlloc=0 CPUErr=0 CPUTot=20 CPULoad=0.00 Features=(null)
Gres=(null)
NodeAddr=gchead NodeHostName=gchead Version=15.08
OS=Linux RealMemory=128815 AllocMem=0 FreeMem=114584 Sockets=2 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
BootTime=2019-02-28T18:49:40 SlurmdStartTime=2019-03-10T18:32:47
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Potential bug in feature_extract.py

I think there is a bug in the method convert_continuos_f0(f0) in feature_extract.py (line 122 - 124):

if f0.all() == 0: print("WARNING: all of the f0 values are 0.") return uv, f0

if I understand, you want to avoid converting F0 to continuous F0 if all the F0 values are equal to 0. However, f0.all() check if ALL the values in the array are True (i.e are different from 0). If only one value is equal to 0, it will return False and f0.all() == 0 will then return True.

I don't know if it is wanted, but in most of the cases your F0 curve will not be continuous.

ModuleNotFoundError: No module named 'pysptk'

hey,
i'm trying to implement your project but get in to some problems while running the 'Build SD model' ./run.sh (and the same problem for 'Build SI-CLOSE model' and 'Build SI-OPEN model')
This is my log file message:

feature_extract.py --waveforms data/tr_slt/wav.scp --wavdir wav/tr_slt --hdf5dir hdf5/tr_slt --fs 16000 --shiftms 5 --minf0 120 --maxf0 275 --mcep_dim 24 --mcep_alpha 0.410 --highpass_cutoff 70 --fftl 1024 --n_jobs 10
Started at 22 10:05:24 IST 2018

Traceback (most recent call last):
File "../../../src/bin/feature_extract.py", line 22, in
from sprocket.speech.feature_extractor import FeatureExtractor
File "/PytorchWaveNetVocoder/tools/venv/lib/python3.6/site-packages/sprocket_vc-0.18-py3.6.egg/sprocket/speech/init.py", line 1, in
from .feature_extractor import FeatureExtractor
File "/PytorchWaveNetVocoder/tools/venv/lib/python3.6/site-packages/sprocket_vc-0.18-py3.6.egg/sprocket/speech/feature_extractor.py", line 3, in
import pysptk
ModuleNotFoundError: No module named 'pysptk'
Accounting: time=1 threads=1
Ended (code 1) at 22 10:05:25 IST 2018, elapsed time 1 seconds

Does somebody know how to solve this?

thanks in advance.

Is adaptation of wavenet vocoder possible?

Thank you for your work.
is it possible to train wavenet vocoder with multi-speaker data and adapt it to a target speaker with limited data? if yes, how can i do that?