hcy71o / autovocoder Goto Github PK

Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing

License: Apache License 2.0

Python 100.00%

speech-synthesis tts vocoder waveform-generator

autovocoder's Introduction

Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing

Unofficial Pytorch implementation of Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing. This repository is based on iSTFTNet github (Paper).

Disclaimer : This repo is built for testing purpose.

Training :

python train.py --config config.json

In train.py, change --input_wavs_dir to the directory of LJSpeech-1.1/wavs.
In config.json, change latent_dim for AV128, AV192, and AV256 (Default).
Considering Section 3.3, you can select dec_istft_input between cartesian (Default), polar, and both.

Note:

Validation loss of AV256 during training.
In our test, it converges almost 3X times faster than HiFi-V1 (referring to the official repo).

Citations :

@article{Webber2022AutovocoderFW,
  title={Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing},
  author={Jacob J. Webber and Cassia Valentini-Botinhao and Evelyn Williams and Gustav Eje Henter and Simon King},
  journal={ArXiv},
  year={2022},
  volume={abs/2211.06989}
}

References:

autovocoder's People

Contributors

Stargazers

Watchers

Forkers

ishine shaun95 splinter21 maxmax2016 entn-at krishanubanerjee saber5433 jamirando malradhi spandaz innnky riadlarbi

autovocoder's Issues

The paper claims that it can be used in TTS, but is lack of TTS experiments.

Acoustic model is hard to reconstruct the phase information, but autovocoder contains phase information in the 256 representation, so that the 256 version outperforms hifiganv1.
If it is used in TTS pipeline, that means the acoustic model has to learn reconstructing the phase information.
I don't think the TTS pipeline using autovocoder will outperform that using vocoder (e.g. hifiganv1), unless the author provides the TTS experimental data.

Questions about the samplerates and parameter size

Hi!

I want to reproduce this work at 16kHz. I used a frame length and frame shift similar to 22.05kHz, which means the frame shift is 11.6ms (256 points) at 22.05kHz and 10ms (160 points) at 16kHz. The frame length is four times the frame shift. The FFT size is also 1024.

Currently, my model has been trained for 200k steps on a 256-dimensional input, but there are still noticeable phase artifacts. I would like to know if this model is sensitive to the sampling rate or if my current training steps are insufficient.

The paper does not seem to provide information on how many steps the model was trained for, and I'm curious about how many steps are generally sufficient for acceptable results (do I really need to run all 3100 epochs?).

I would be approciate if you could offer your pretrained model params weights file! Thanks a lot!

Also, I noticed that the trained model parameters are very small. The encoder parameter file is only 543kB, and the generator parameter file is only 545kB. Is this normal? It's amazing that such a small number of parameters can achieve this task!

Besides, I noticed my mel loss on val set at 200k steps is about 0.3, which is much higher than the curve you gave. I use 'both' mode to train this project, is this also a sampling rate problem?

ResBlock

Hi @hcy71o , I know in the paper it is said:
Each such block consists of two 2D convolutional layers of width 3, followed by
a 2D batch norm and a ReLU nonlinearity. This basic block
also applies a residual, by summing the input with the output,
but only if the number of input channels is the same as the
number of output channels.

I want to know why residual link in ResBlock is used in each layer, even on BatchNorm2d and ReLU. Because as far as I know, we usually add the residual link after the whole block. For example:

def forward(self, x):
    res = x
    for c in self.convs:
        x = c(x)
    if self.out_ch == self.in_ch:
        x = res + x
    return x

An error during inference: AttributeError: 'Generator' object has no attribute 'remove_weight_norm'

I am getting an error during inference:

Initializing Inference Process..
{'num_gpus': 0, 'batch_size': 32, 'learning_rate': 0.0002, 'adam_b1': 0.8, 'adam_b2': 0.99, 'lr_decay': 0.999, 'seed': 1234, 'n_blocks': 11, 'latent_dim': 256, 'latent_dropout': 0.1, 'dec_istft_input': 'cartesian', 'segment_size': 8192, 'num_mels': 80, 'n_fft': 1024, 'hop_size': 256, 'win_size': 1024, 'sampling_rate': 22050, 'fmin': 0, 'fmax': 8000, 'fmax_for_loss': None, 'num_workers': 4, 'dist_config': {'dist_backend': 'nccl', 'dist_url': 'tcp://localhost:54321', 'world_size': 1}}
Loading 'cp_autovocoder/g_00010000'
Complete.
Traceback (most recent call last):
  File "inference_file.py", line 88, in <module>
    main()
  File "inference_file.py", line 84, in main
    inference(a)
  File "inference_file.py", line 44, in inference
    generator.remove_weight_norm()
  File "/home/yehor/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1265, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'Generator' object has no attribute 'remove_weight_norm'

hcy71o / autovocoder Goto Github PK

autovocoder's Introduction

Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing

Training :

Note:

Citations :

References:

autovocoder's People

Contributors

Stargazers

Watchers

Forkers

autovocoder's Issues

The paper claims that it can be used in TTS, but is lack of TTS experiments.

Questions about the samplerates and parameter size

ResBlock

An error during inference: AttributeError: 'Generator' object has no attribute 'remove_weight_norm'

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent