Code Monkey home page Code Monkey logo

neural-hmm's Introduction

Neural HMMs are all you need (for high-quality attention-free TTS)


This is the official code repository for the paper "Neural HMMs are all you need (for high-quality attention-free TTS)". For audio examples, visit our demo page. pre-trained model (female) and pre-trained model (male) are also available.

Synthesising from Neural-HMM

Setup and training using LJ Speech

  1. Download and extract the LJ Speech dataset. Place it in the data folder such that the directory becomes data/LJSpeech-1.1. Otherwise update the filelists in data/filelists accordingly.
  2. Clone this repository git clone https://github.com/shivammehta25/Neural-HMM.git
    • If using single GPU checkout the branch gradient_checkpointing it will help to fit bigger batch size during training.
    • Use git clone --single-branch -b gradient_checkpointing https://github.com/shivammehta25/Neural-HMM.git for that.
  3. Initalise the submodules git submodule init; git submodule update
  4. Make sure you have docker installed and running.
    • It is recommended to use Docker (it manages the CUDA runtime libraries and Python dependencies itself specified in Dockerfile)
    • Alternatively, If you do not intend to use Docker, you can use pip to install the dependencies using pip install -r requirements.txt
  5. Run bash start.sh and it will install all the dependencies and run the container.
  6. Check src/hparams.py for hyperparameters and set GPUs.
    1. For multi-GPU training, set GPUs to [0, 1 ..]
    2. For CPU training (not recommended), set GPUs to an empty list []
    3. Check the location of transcriptions
  7. Once your filelists and hparams are updated run python generate_data_properties.py to generate data_parameters.pt for your dataset (the default data_parameters.pt is available for LJSpeech in the repository).
  8. Run python train.py to train the model.
    1. Checkpoints will be saved in the hparams.checkpoint_dir.
    2. Tensorboard logs will be saved in the hparams.tensorboard_log_dir.
  9. To resume training, run python train.py -c <CHECKPOINT_PATH>

Synthesis

  1. Download our pre-trained LJ Speech model. (This is the exact same model as system NH2 in the paper, but with training continued until reaching 200k updates total.)
  2. Download HiFi gan pretrained HiFiGAN model.
    • We recommend using fine tuned on Tacotron2 if you cannot finetune on NeuralHMM.
  3. Run jupyter notebook and open synthesis.ipynb.

Miscellaneous

Mixed-precision training or full-precision training

  • In src.hparams.py change hparams.precision to 16 for mixed precision and 32 for full precision.

Multi-GPU training or single-GPU training

  • Since the code uses PyTorch Lightning, providing more than one element in the list of GPUs will enable multi-GPU training. So change hparams.gpus to [0, 1, 2] for multi-GPU training and single element [0] for single-GPU training.

Known issues/warnings

PyTorch dataloader

  • If you encounter this error message [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool), this is a known issue in PyTorch Dataloader.
  • It will be fixed when PyTorch releases a new Docker container image with updated version of Torch. If you are not using docker this can be removed with torch > 1.9.1

Torchmetric error on RTX 3090

  • If you encoder this error message ImportError: cannot import name 'get_num_classes' from 'torchmetrics.utilities.data' (/opt/conda/lib/python3.8/site-packages/torchmetrics/utilities/data.py)
  • Update the requirement.txt file with these requirements:
torch==1.11.0a0+b6df043
--extra-index-url https://download.pytorch.org/whl/cu113
torchmetrics==0.6.0

Support

If you have any questions or comments, please open an issue on our GitHub repository.

Citation information

If you use or build on our method or code for your research, please cite our paper:

@inproceedings{mehta2022neural,
  title={Neural {HMM}s are all you need (for high-quality attention-free {TTS})},
  author={Mehta, Shivam and Sz{\'e}kely, {\'E}va and Beskow, Jonas and Henter, Gustav Eje},
  booktitle={Proc. ICASSP},
  year={2022}
}

Acknowledgements

The code implementation is based on Nvidia's implementation of Tacotron 2 and uses PyTorch Lightning for boilerplate-free code.

neural-hmm's People

Contributors

birgermoell avatar deepsourcebot avatar jimregan avatar pre-commit-ci[bot] avatar shivammehta25 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

neural-hmm's Issues

The speed of Neural-HMM TTS

Thanks for this good job. but there is no description about this tts model inference speed. Is there any data about its inference RTF? Is it faster then other non-autoregession TTS models like FastSpeech?

making a sapi5 implimentation to use with windows programs

hello. I'm not a dev, but i suggest addin a windows version which is compatible with sapi5(speech application programming interface) for use with different types of programs like textreaders, screenreaders for blind people etc. The voice has to be optimized for responsiveness if you can do that. If that can't be achieved with this neural network thing, then would making an hts tts high quality clone out of it do the job correctly? I dunno, we could discuss this. Thanks and sorry, I'm kind of in a hurry, that's why the post's breaf. Thanks again

Style / Prosody guidence?

This is really impressive work!

Do you have any ideas or code changes to guide the generated speech style? For example, having the appropriate emotion if we are reading a news story about a tragedy.

I understand there are a few Tacotron projects that achieve this, but their methods often lead to degraded voice quality (in my opinion).

One crazy idea that is easy to try, but probably won't work, is to train on a new dataset and embed the emotion into the generated sequence encoding.

Use as an aligner

It is a great job!
I wonder if it can be used as an aligner. If so, the performance will be better than HMM-GMM or not.

val_dataloader error message

Hi, I run the train.py and it shows the following error message , may I have your help? thanks :

pytorch_lightning.utilities.exceptions.MisconfigurationException: val_dataloader must be implemented to be used with the Lightning Trainer

How to train a new model with diffirent language?

I would like to know if it possible to train a Neural-HMM for another language,
What is needed for this? A cmudict in a new language is required or can be bypassed ?
Is there any tutorial to do so?

Male voice

Hey,
Thanks for sharing the code. Is there any example for male voice synthesis? I want to train on custom male voice dataset .

oom

gpu memory:24G

File "src/model/HMMComponents/EmissionModel.py", line 45, in forward
out = emission_dists.log_prob(x_t)
File "/lib/python3.7/site-packages/torch/distributions/normal.py", line 77, in log_prob
return -((value - self.loc) ** 2) / (2 * var) - log_scale - math.log(math.sqrt(2 * math.pi))

Variance floored

When I train (in my language - czech), variance floored is sometimes displayed. But train usually continues. Is it a mistake? And how do I fix this error? (my batch size is only 1 - gtx1080 8GB, so it can't be reduced anymore). Could you not describe in HPARAMS what each line means (at least the most important code lines) ?

Finetuning from an existing model for a small dataset?

Hello,
Is it possible to finetune an existing model for a small dataset?

I have a small dataset of ~5 hours, ~2900 samples.
I suppose it's not enough for training from scratch.
So can I finetune a model with it?
If yes, how could I do so?

Thanks!

Version Issuies

Can you provide some information about the python version you have been using?
librosa cannot be installed on my system

RuntimeError: shape '[1, 1, 65864]' is invalid for input of size 131728

Hi, I got the following error when run on cuda 11.2, do you have any hints how to solve it? thanks.

Traceback (most recent call last):
File "generate_data_properties.py", line 179, in
main(args)
File "generate_data_properties.py", line 147, in main
data_mean, data_std, go_token_init_value, init_transition_prob = get_data_parameters_for_flat_start(
File "generate_data_properties.py", line 78, in get_data_parameters_for_flat_start
for i, batch in enumerate(tqdm(train_loader)):
File "/usr/local/lib/python3.8/dist-packages/tqdm/std.py", line 1195, in iter
for obj in iterable:
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 530, in next
data = self._next_data()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1224, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
data.reraise()
File "/usr/local/lib/python3.8/dist-packages/torch/_utils.py", line 457, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/root/Neural-HMM/src/utilities/data.py", line 166, in getitem
return self.get_mel_text_pair(self.audiopaths_and_text[index])
File "/root/Neural-HMM/src/utilities/data.py", line 126, in get_mel_text_pair
mel = self.get_mel(audiopath)
File "/root/Neural-HMM/src/utilities/data.py", line 148, in get_mel
melspec = self.stft.mel_spectrogram(audio_norm)
File "/root/Neural-HMM/src/model/layers.py", line 122, in mel_spectrogram
magnitudes, phases = self.stft_fn.transform(y)
File "/root/Neural-HMM/src/utilities/stft.py", line 112, in transform
input_data = input_data.view(num_batches, 1, num_samples)
RuntimeError: shape '[1, 1, 65864]' is invalid for input of size 131728

Segmentation Fault

I am able to build the docker environment successfully but when I try to run
python train.py
I get following error:
Segmentation fault (core dumped)

Gibberish output

Hello,
I'm training a model on a dataset of ~5 hrs , ~2800 samples.
Now it's at 22k steps, and using synthesis.ipynb the model generates some gibberish output.

output.mp4

The text is The Secret Service believed that it was very doubtful that any President would ride regularly in a vehicle with a fixed top, even though transparent.

The training says Variance Floored many times during the process.
And in tensorboard, some charts look strange with some NaNs on it.
For example:
20220718172543

What am I missing?
Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.