Code Monkey home page Code Monkey logo

autovocoder's Introduction

Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing

Unofficial Pytorch implementation of Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing. This repository is based on iSTFTNet github (Paper).

Disclaimer : This repo is built for testing purpose.

Training :

python train.py --config config.json

In train.py, change --input_wavs_dir to the directory of LJSpeech-1.1/wavs.
In config.json, change latent_dim for AV128, AV192, and AV256 (Default).
Considering Section 3.3, you can select dec_istft_input between cartesian (Default), polar, and both.

Note:

  • Validation loss of AV256 during training.

  • In our test, it converges almost 3X times faster than HiFi-V1 (referring to the official repo).

Citations :

@article{Webber2022AutovocoderFW,
  title={Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing},
  author={Jacob J. Webber and Cassia Valentini-Botinhao and Evelyn Williams and Gustav Eje Henter and Simon King},
  journal={ArXiv},
  year={2022},
  volume={abs/2211.06989}
}

References:

autovocoder's People

Contributors

hcy71o avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

autovocoder's Issues

The paper claims that it can be used in TTS, but is lack of TTS experiments.

Acoustic model is hard to reconstruct the phase information, but autovocoder contains phase information in the 256 representation, so that the 256 version outperforms hifiganv1.
If it is used in TTS pipeline, that means the acoustic model has to learn reconstructing the phase information.
I don't think the TTS pipeline using autovocoder will outperform that using vocoder (e.g. hifiganv1), unless the author provides the TTS experimental data.

Questions about the samplerates and parameter size

Hi!

I want to reproduce this work at 16kHz. I used a frame length and frame shift similar to 22.05kHz, which means the frame shift is 11.6ms (256 points) at 22.05kHz and 10ms (160 points) at 16kHz. The frame length is four times the frame shift. The FFT size is also 1024.

Currently, my model has been trained for 200k steps on a 256-dimensional input, but there are still noticeable phase artifacts. I would like to know if this model is sensitive to the sampling rate or if my current training steps are insufficient.

The paper does not seem to provide information on how many steps the model was trained for, and I'm curious about how many steps are generally sufficient for acceptable results (do I really need to run all 3100 epochs?).

I would be approciate if you could offer your pretrained model params weights file! Thanks a lot!

Also, I noticed that the trained model parameters are very small. The encoder parameter file is only 543kB, and the generator parameter file is only 545kB. Is this normal? It's amazing that such a small number of parameters can achieve this task!

Besides, I noticed my mel loss on val set at 200k steps is about 0.3, which is much higher than the curve you gave. I use 'both' mode to train this project, is this also a sampling rate problem?

ResBlock

Hi @hcy71o , I know in the paper it is said:
Each such block consists of two 2D convolutional layers of width 3, followed by
a 2D batch norm and a ReLU nonlinearity. This basic block
also applies a residual, by summing the input with the output,
but only if the number of input channels is the same as the
number of output channels.

I want to know why residual link in ResBlock is used in each layer, even on BatchNorm2d and ReLU. Because as far as I know, we usually add the residual link after the whole block. For example:

def forward(self, x):
    res = x
    for c in self.convs:
        x = c(x)
    if self.out_ch == self.in_ch:
        x = res + x
    return x

An error during inference: AttributeError: 'Generator' object has no attribute 'remove_weight_norm'

I am getting an error during inference:

Initializing Inference Process..
{'num_gpus': 0, 'batch_size': 32, 'learning_rate': 0.0002, 'adam_b1': 0.8, 'adam_b2': 0.99, 'lr_decay': 0.999, 'seed': 1234, 'n_blocks': 11, 'latent_dim': 256, 'latent_dropout': 0.1, 'dec_istft_input': 'cartesian', 'segment_size': 8192, 'num_mels': 80, 'n_fft': 1024, 'hop_size': 256, 'win_size': 1024, 'sampling_rate': 22050, 'fmin': 0, 'fmax': 8000, 'fmax_for_loss': None, 'num_workers': 4, 'dist_config': {'dist_backend': 'nccl', 'dist_url': 'tcp://localhost:54321', 'world_size': 1}}
Loading 'cp_autovocoder/g_00010000'
Complete.
Traceback (most recent call last):
  File "inference_file.py", line 88, in <module>
    main()
  File "inference_file.py", line 84, in main
    inference(a)
  File "inference_file.py", line 44, in inference
    generator.remove_weight_norm()
  File "/home/yehor/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1265, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'Generator' object has no attribute 'remove_weight_norm'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.