as-ideas / forwardtacotron Goto Github PK

View Code? Open in Web Editor NEW

This project forked from fatchord/wavernn

579.0 579.0 111.0 208.13 MB

⏩ Generating speech in a single forward pass without any attention!

Home Page: https://as-ideas.github.io/ForwardTacotron/

License: MIT License

Python 100.00%

axelspringerai deep-learning forwardtacotron python pytorch tacotron text-to-speech tts

forwardtacotron's People

Contributors

Stargazers

Watchers

Forkers

datitran vyaslkv ashishpatel26 stjordanis templeblock havingfun gentaiscool shivampanchal appleholic moomonkey blaxe05 ravindharblack abhijitdalavi linhduongtuan purevoidov nanaakwasiabayieboateng jwr1995 gabrioo inconnu11 boltomli sevaroy amirstudy loserking iamamarpal kumarshikhardeep chijuwu90 gatarelib rajamaniraja anuragse0012 nianzu-ethan-zheng liuyikuikui andrewarrow syoyo 100rabh1401 jfsantos sts-sadr giuliogatto indrajit-ai-research neurobe ondal90 katyamineeva viktorigeland zshy1205 chenchy jiaheni1127 onisimchukv jokecorleone oliveiracwb taylantorres09 alexanderxuan dwtcourses nishant-pall 41whiteelephants shangqwe123 prajwaljpj knowosadko brodymacfarlane elainevoice darkalfx longphungtuan94 xiexukang lantip epochsimate vmr013 atlonxp samsgates isabella232 vernitgarg joseluismoreira dmylzenova bloodraven66 dmzubr liusongxiang lixucuhk anh jayniljaiswal sbuser sciai-ai muyangdu sx-tts macroustc tarepan yaoao2017 riderjensen nangongmujd windowxiaoming dotted thangnvkcn devarsh13 willfant mehdihosseinimoghadam kafan1986 tuannvhust nithin-s mathigatti charlie-coleman kanapazombie rmcpantoja superjonotron gediont

forwardtacotron's Issues

Feature Request: Objective evaluation metric

Any plans on implementing Mel Cepstral Distortion scores for LJSpeech? I found some useful repositories:
SamuelBroughton
MattShannon

For anyone trying to preprocess on Windows and running into multiprocessor issues

Hi
I am running this on Windows, which does not have fork(); As a result we are supposed to add the if __name__ == '__main__': guards but that did not work for me. So I hacked together a solution that just preprocesses on a single core.

import glob
from random import Random

from utils.display import *
from utils.dsp import *
from utils import hparams as hp
from multiprocessing import Pool, cpu_count
from utils.paths import Paths
import pickle
import argparse

from utils.text import clean_text
from utils.text.recipes import ljspeech
from utils.files import get_files, pickle_binary
from pathlib import Path


# Helper functions for argument types
def valid_n_workers(num):
    n = int(num)
    if n < 1:
        raise argparse.ArgumentTypeError('%r must be an integer greater than 0' % num)
    return n

parser = argparse.ArgumentParser(description='Preprocessing for WaveRNN and Tacotron')
parser.add_argument('--path', '-p', help='directly point to dataset path (overrides hparams.wav_path')
parser.add_argument('--extension', '-e', metavar='EXT', default='.wav', help='file extension to search for in dataset folder')
parser.add_argument('--num_workers', '-w', metavar='N', type=valid_n_workers, default=cpu_count()-1, help='The number of worker threads to use for preprocessing')
parser.add_argument('--hp_file', metavar='FILE', default='hparams.py', help='The file to use for the hyperparameters')
args = parser.parse_args()

hp.configure(args.hp_file)  # Load hparams from file
if args.path is None:
    args.path = hp.wav_path

extension = args.extension
path = args.path

def convert_file(path: Path):
    y = load_wav(path)
    peak = np.abs(y).max()
    if hp.peak_norm or peak > 1.0:
        y /= peak
    mel = melspectrogram(y)
    if hp.voc_mode == 'RAW':
        quant = encode_mu_law(y, mu=2**hp.bits) if hp.mu_law else float_2_label(y, bits=hp.bits)
    elif hp.voc_mode == 'MOL':
        quant = float_2_label(y, bits=16)

    return mel.astype(np.float32), quant.astype(np.int64)

def process_wav(path: Path):
    _path = path
    
def procwav():
    wav_id = _path.stem
    m, x = convert_file(_path)
    np.save(paths.mel/f'{wav_id}.npy', m, allow_pickle=False)
    np.save(paths.quant/f'{wav_id}.npy', x, allow_pickle=False)
    text = text_dict[wav_id]
    text = clean_text(text)
    return wav_id, m.shape[-1], text

if __name__ != '__main__': 
    try:
        procwav()
    except NameError:
        lll = 0

if __name__ == '__main__':        
    wav_files = get_files(path, extension)
    paths = Paths(hp.data_path, hp.voc_model_id, hp.tts_model_id)

    print(f'\n{len(wav_files)} {extension[1:]} files found in "{path}"\n')

    if len(wav_files) == 0:

        print('Please point wav_path in hparams.py to your dataset,')
        print('or use the --path option.\n')

    else:
        text_dict = ljspeech(path)

        n_workers = max(1, args.num_workers)

        simple_table([
            ('Sample Rate', hp.sample_rate),
            ('Bit Depth', hp.bits),
            ('Mu Law', hp.mu_law),
            ('Hop Length', hp.hop_length),
            ('CPU Usage', f'{n_workers}/{cpu_count()}'),
            ('Num Validation', hp.n_val)
        ])
        pool = Pool(processes=n_workers)
        dataset = []
        cleaned_texts = []
        wav_files = get_files(path, extension)

        for i, fn in enumerate(wav_files, 1):
            _path = fn
            item_id, length, cleaned_text = procwav()
            if item_id in text_dict:
                dataset += [(item_id, length)]
                cleaned_texts += [(item_id, cleaned_text)]
            bar = progbar(i, len(wav_files))
            message = f'{bar} {i}/{len(wav_files)} '
            stream(message)

        random = Random(hp.seed)
        random.shuffle(dataset)
        train_dataset = dataset[hp.n_val:]
        val_dataset = dataset[:hp.n_val]
        # sort val dataset longest to shortest
        val_dataset.sort(key=lambda d: -d[1])

        for id, text in cleaned_texts:
            text_dict[id] = text

        pickle_binary(text_dict, paths.data/'text_dict.pkl')
        pickle_binary(train_dataset, paths.data/'train_dataset.pkl')
        pickle_binary(val_dataset, paths.data/'val_dataset.pkl')

        print('\n\nCompleted. Ready to run "python train_tacotron.py" or "python train_wavernn.py". \n')

Hopefully this can help anyone in the same situation. Maybe a flag could be added to preprocess.py to run in single core compatibility mode for Windows users. I don't care if preprocessing is not multithreaded, it's a very short process compared to training.

Transfer learning and tips for internationalization

First of all, thanks for this repo.
Can you talk about transferring learning and/or tips on using it in other languages?
I am training Portuguese with an optimized version of https://github.com/Edresson/TTS-Portuguese-Corpus and I am happy with the results.
I realized that I will have to fix my

ForwardTacotron/utils/text/cleaners.py

Line 13 in 155c0ba

    
             2. "transliteration_cleaners" for non-English text that can be transliterated to ASCII using

that are not perfect, although clearly documented.

forward_100K model

Hi. Do you have some commit with pretrained forward_100K model, that was trained without phonemes ?

Experiments / Discussion

Hi,

great work!
I saw your autoregression branch and wanted to ask if it worked out?
I always wondered how much the effect of the autoregression (apart from the formal aspect that it then is a autoregressive, generative model P(x_i|x_<i)) really is, considering there are RNNs in the network anyway.

Also, wanted to point you to this paper in case you don't know it yet: https://tencent-ailab.github.io/durian/

They use, similarly to older models like in https://github.com/CSTR-Edinburgh/merlin, an additional value for the expanded vectors to indicate the position in the current input symbol. Wonder if that would help a bit with prosody.

Alignments during Inference

Is there a way I can extract the alignments during inference? I have a forward tacotron model and a melgan model.

librosa and numbra version issues

Hi, similar to your transformerTTS the version for numbra must be 0.49.1 and for librosa 0.7.2. Otherwise wav generation fails.

Very fast convergence (and overfitting?), and useless PostNet

Hi again,

I would like to ask for your experience with this model. In my case, I found the model converges very fast, even with 10K steps training (batch size 64). I tried significantly reducing model size (from 20M to 10M params), but no difference. I am training on a subset of VCTK multi-speaker corpus. After that, the validation loss keeps increasing (even though I see no degradation on inference quality).

Another thing that I noticed is that the Postnet seems to be doing nothing relevant (before and after means b/a the Postnet). Is that also your case?

Thanks for your advice.

Advice on prepping datasets other than LJspeech?

Hi, I'm trying to prep my own dataset to train on the ForwardTacotron model--could you give any insight as to what train_tacotron.py or train_forward.py is expecting in terms of training data organization? Like, the old NVIDIA TT2 repo expects two text files formatted in a certain way and a path to the WAV files in the arguments. Is there something similar for this repo?

Problem training for new language.

I am trying to train my model on a Marathi dataset. It is strange espeak doesnt seem to support it although mentioned in their Supported Languages Page.

In [3]: ph = to_phonemes("प्रदर्शनों के दौरान पुलिस की हिंसा और चुनावों में कथित धोखाधड़ी के ख़िलाफ़ बेलारूस में लोगों का गुस्सा बढ़ता ही जा रहा है")
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-3-944424e45249> in <module>
----> 1 ph = to_phonemes("प्रदर्शनों के दौरान पुलिस की हिंसा और चुनावों में कथित धोखाधड़ी के ख़िलाफ़ बेलारूस में लोगों का गुस्सा बढ़ता ही जा रहा है")

<ipython-input-2-a226f9833051> in to_phonemes(text)
      9                          njobs=1,
     10                          punctuation_marks=';:,.!?¡¿—…"«»“”()',
---> 11                          language_switch='remove-flags')
     12     phonemes = phonemes.replace('—', '-')
     13     return phonemes

~/.virtualenvs/forwardtacoenv/lib/python3.6/site-packages/phonemizer/phonemize.py in phonemize(text, language, backend, separator, strip, preserve_punctuation, punctuation_marks, with_stress, language_switch, njobs, logger)
    158             with_stress=with_stress,
    159             language_switch=language_switch,
--> 160             logger=logger)
    161     elif backend == 'espeak-mbrola':
    162         phonemizer = backends[backend](

~/.virtualenvs/forwardtacoenv/lib/python3.6/site-packages/phonemizer/backend/espeak.py in __init__(self, language, punctuation_marks, preserve_punctuation, language_switch, with_stress, logger)
    145         super().__init__(
    146             language, punctuation_marks=punctuation_marks,
--> 147             preserve_punctuation=preserve_punctuation, logger=logger)
    148         self.logger.debug('espeak is %s', self.espeak_path())
    149 

~/.virtualenvs/forwardtacoenv/lib/python3.6/site-packages/phonemizer/backend/base.py in __init__(self, language, punctuation_marks, preserve_punctuation, logger)
     52             raise RuntimeError(
     53                 'language "{}" is not supported by the {} backend'
---> 54                 .format(language, self.name()))
     55         self.language = language
     56 

RuntimeError: language "mr" is not supported by the espeak backend

Now my only solution is to run it directly on grapheme. There is very less difference between the grapheme and phoneme for indic script. So I made the following changes.

cleaners.py

def to_phonemes(text):
    # text = text.replace('-', '—')
    # phonemes = phonemize(text,
    #                      language=hp.language,
    #                      backend='espeak',
    #                      strip=True,
    #                      preserve_punctuation=True,
    #                      with_stress=False,
    #                      njobs=1,
    #                      punctuation_marks=';:,.!?¡¿—…"«»“”()',
    #                      language_switch='remove-flags')
    # phonemes = phonemes.replace('—', '-')
    phonemes = text
    return phonemes

symbols.py

""" from https://github.com/keithito/tacotron """

"""
Defines the set of symbols used in text input to the model.

The default is a set of ASCII characters that works well for English or text that has been run through Unidecode. For other data, you can modify _characters. See TRAINING_DATA.md for details. """
from utils.text import cmudict

_pad = "_"
_punctuation = "!'(),.:;? "
_special = "-"

# Prepend "@" to ARPAbet symbols to ensure uniqueness (some are the same as uppercase letters):
_arpabet = ["@" + s for s in cmudict.valid_symbols]

# Phonemes
# _vowels = 'iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻ'
# _non_pulmonic_consonants = 'ʘɓǀɗǃʄǂɠǁʛ'
# _pulmonic_consonants = 'pbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟ'
# _suprasegmentals = 'ˈˌːˑ'
# _other_symbols = 'ʍwɥʜʢʡɕʑɺɧ'
# _diacrilics = 'ɚ˞ɫ'

_phones = "ँंःअआइईउऊऋऌऍऎएऐऑऒओऔकखगघङचछजझञटठडढणतथदधनऩपफबभमयरऱलळऴवशषसह़ऽािीुूृॄॅॆेैॉॊोौ्ॐक़ख़ग़ज़ड़ढ़फ़य़ॠ॰ॲ"

phonemes = sorted(
    list(
        _pad
        + _punctuation
        + _special
        + _phones
        # + _non_pulmonic_consonants
        # + _pulmonic_consonants
        # + _suprasegmentals
        # + _other_symbols
        # + _diacrilics
    )
)

When I run python preprocess.py --path /home/ubuntu/datasets/Marathi_trim/ i get this error

/home/ubuntu/.virtualenvs/forwardtacoenv/lib/python3.6/site-packages/librosa/util/decorators.py:9: NumbaDeprecationWarning: An import was requested from a module that has moved location.
Import of 'jit' requested from: 'numba.decorators', please update to use 'numba.core.decorators' or pin to Numba version 0.48.0. This alias will not be present in Numba version 0.50.0.
  from numba.decorators import jit as optional_jit

35999 wav files found in "/home/ubuntu/datasets/Marathi_trim/"

+-------------+-----------+--------+------------+-----------+----------------+
| Sample Rate | Bit Depth | Mu Law | Hop Length | CPU Usage | Num Validation |
+-------------+-----------+--------+------------+-----------+----------------+
|    22050    |     9     |  True  |    256     |  103/104  |      200       |
+-------------+-----------+--------+------------+-----------+----------------+

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "preprocess.py", line 56, in process_wav
    m, x = convert_file(path)
  File "preprocess.py", line 42, in convert_file
    peak = np.abs(y).max()
  File "/home/ubuntu/.virtualenvs/forwardtacoenv/lib/python3.6/site-packages/numpy/core/_methods.py", line 39, in _amax
    return umr_maximum(a, axis, None, out, keepdims, initial, where)
ValueError: zero-size array to reduction operation maximum which has no identity
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "preprocess.py", line 91, in <module>
    for i, (item_id, length, cleaned_text) in enumerate(pool.imap_unordered(process_wav, wav_files), 1):
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 735, in next
    raise value
ValueError: zero-size array to reduction operation maximum which has no identity

Am I making a mistake somewhere?
Thanks in advance!

License question

Hi @cschaefer26, I am confused because your repo's license appears to be a MIT license, but also states:

Copyright (c) 2020 Axel Springer AI. All rights reserved.

Can I use your code in my project? Are redistributions of code allowed?

stack expects a non-empty TensorList

`import torch
import torchtext
from torchtext.data import TabularDataset,BucketIterator,Field

import pandas as pd
import numpy as np

df_train = pd.read_csv('V:\pythonproject\hii\train.csv')
df_test = pd.read_csv('V:\pythonproject\hii\test.csv')

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenize = lambda x:x.split()

text = Field(tokenize=tokenize,lower=True,batch_first=True)
label = Field(sequential=False,use_vocab=False)

fields = {'text':('t',text),'label':('l',label)}

train_data,test_data = TabularDataset.splits(path='V:\pythonproject\hii\',train='train.csv',validation = 'test.csv',format='CSV',fields = fields)

text.build_vocab(train_data,max_size = 50000,min_freq=1)

train_iterator, test_iterator = BucketIterator.splits(
(train_data, test_data), batch_size=4, device=device,sort = False,shuffle=False
)
import torch.nn as nn

class RNN_LSTM(nn.Module):
def init(self, input_size, embed_size, hidden_size, num_layers):
super(RNN_LSTM, self).init()
self.hidden_size = hidden_size
self.num_layers = num_layers

    self.embedding = nn.Embedding(input_size, embed_size)
    self.rnn = nn.LSTM(embed_size, hidden_size, num_layers,batch_first=True)
    self.fc_out = nn.Linear(hidden_size, 1)

def forward(self, x):
    # Set initial hidden and cell states
    h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device)
    c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device)

    embedded = self.embedding(x)
    outputs, _ = self.rnn(embedded, (h0, c0))
    prediction = self.fc_out(outputs[ :,-1, :])

    return prediction

#hyper parameters
input_size = len(text.vocab)
hidden_size = 64
num_layers = 1
embedding_size = 100
learning_rate = 0.005
num_epochs = 10

model = RNN_LSTM(input_size, embedding_size, hidden_size, num_layers).to(device)

Loss and optimizer

criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

import time
start = time.time()
trn_loss = []
tst_loss = []
for epoch in range(num_epochs):

for batch_idx, batch in enumerate(train_iterator):
    # Get data to cuda if possible
    data = batch.t.to(device=device)
    targets = batch.l.to(device=device)

    # forward
    scores = model(data)
    loss = criterion(scores,targets.type(torch.FloatTensor).reshape(targets.shape[0],1).to(device))
    
    # backward
    optimizer.zero_grad()
    loss.backward()

    # gradient descent
    optimizer.step()

trn_loss.append(loss)
print(f'training epoch:{epoch}--loss:{loss}')



with torch.no_grad():

    for batch_idx, batch in enumerate(test_iterator):
        # Get data to cuda if possible
        data = batch.t.to(device=device)
        targets = batch.l.to(device=device)

        # forward
        scores = model(data)
      
        loss = criterion(scores,targets.type(torch.FloatTensor).reshape(targets.shape[0],1).to(device))

    tst_loss.append(loss)
    print(f'test epoch:{epoch}--loss:{loss}')

import matplotlib.pyplot as plt

plt.plot(trn_loss,c = 'r',label='train_loss')
plt.plot(tst_loss,c = 'b',label = 'test_loss')
plt.legend()

torch.save(model.state_dict(), 'fake_news.pt')

y_pred = []
with torch.no_grad():

    for batch_idx, batch in enumerate(test_iterator):
        # Get data to cuda if possible
        data = batch.t.to(device=device)
        targets = batch.l.to(device=device)

        # forward
        scores = model(data)
        y_pred.extend(scores)

y_pred_np = []

for i in y_pred:
y_pred_np.append(np.array([i.to('cpu')]))

y_pred = np.array(y_pred_np)
pred = []
for i in y_pred:
if i <= 0.5:
pred.append(0)
else :
pred.append(1)

y_true = np.array(df_test['label'])

from sklearn.metrics import confusion_matrix,accuracy_score

cm = confusion_matrix(y_true, pred)

acc = accuracy_score(y_true, pred)
`

error is :
`v:\pythonproject\venv\lib\site-packages\torch\nn\modules\rnn.py in forward(self, input, hx)
580 if batch_sizes is None:
581 result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
--> 582 self.dropout, self.training, self.bidirectional, self.batch_first)
583 else:
584 result = _VF.lstm(input, batch_sizes, hx, self._flat_weights, self.bias,

RuntimeError: stack expects a non-empty TensorList`

Changes on duration model

Hi,

I just found the current duration model suffers when synthesizing very long sentences (probably because of the recurrent part of the duration network).

I managed to fix it by feeding prenet's outputs instead of the raw text embeddings, and replacing the current duration model by FastSpeech's (fully convolutional). I also converted the durations to the log domain.

I found these changes not only successfully fix the problems with very long sentences, but also make prosody more natural in multi-speaker settings (specially on silence/pauses modelling).

Hope this is useful somehow.

Best regards.

Loss masking bug?

Hi,

just noticed that the length dimension for the mel spectra when returning from model is 2 as here:

ForwardTacotron/models/forward_tacotron.py

Line 148 in 08c67bc

x_post = self.pad(x_post, mel.size(2))

And in the loss function the max len is calculated as following:

ForwardTacotron/trainer/common.py

Line 67 in 08c67bc

max_len = target.size(1)

from dim 1.

I never actually tested it with your codebase but with mine I'm getting max_len 80 and therefore a mask with [32, 80] that is then expanded to [32, 80, T].

Sounds wrong on first glance?

Training ForwardTacotron on a dataset comprised of multiple male voices as a single speaker dataset?

Hi,

I was wondering, if it was possible to train on a dataset, that has let's say 2-3 male voices, each with about 10 hours of data.

Will the end result of this be a good neutral male voice?

training on custom language

thanks for great work.

Can you guide which part of code to change to work with language other than english. (ex. hindi)?

A couple of questions

Hi!

I have tried the latest version and I am quite pleased with the results; there is some great progress happening on this repository!

I am using 48KHz 7000 samples of my own voice.

I am very happy with pronunciation.

I had a couple questions:
When I have many sentences together, it does not seem to take a pause and sounds like it is rushing through the sentences. Is this normal, is there a workaround? my current one is to add a '...' instead of '.'

My other question is, are there plans for tokenizable pitch, to be able to do things like emphasize a specific word, or to give a work in particular a specific tone (in the text input not automatic)

Thanks!

Training multiple models

I've trained a model based on the LJSpeech dataset and found the results quite satisfactory after 25000 steps in ForwardTacotron. Now, I'm currently preparing several other datasets where new models would be based.

How do I switch training between different models? I know that I could specify the path of my target dataset when running the preprocess script, but could I do the same with the training scripts?
When generating sentences, how could I use a specific model when generating?
Do the results of previous trainings in previous models affect training new models?
If I add new audio samples to one of my datasets and preprocess it again, would the training for the model start from the beginning again or could it pick up from where it left before new samples were added?

Multi-GPU training

Hi, thanks for sharing your work!

Do you have any idea how to get multi-GPU training working? I looked at how it is implemented on fatchord's original repo, but doesn't seem to work well:

           # Parallelize model onto GPUS using workaround due to python bug
            if device.type == 'cuda' and torch.cuda.device_count() > 1:
                m1_hat, m2_hat, attention = data_parallel_workaround(model, x, m)
            else:
                m1_hat, m2_hat, attention = model(x, m)

Thanks in advance!

Fine-tuning pre-trained models

Hi, Great project you've got here.
Just wondering if it's possible to perform fine-tuning on the pre-trained models? If so, what have the results been like, and what is the minimum amount of speech required to get a reasonable likeness?

Thanks

output of python train_tacotron.py --force_align

Hi,

Thanks for this work. Can you please explain what the output of python train_tacotron.py --force_align is? I would like to try training with durations from a forced alignment procedure produced by an ASR model.

Questions: Male Voice / Multispeaker

Is there any male Voice pretrained model or maybe a dataset to do it (fine-tune is a good approach here) ? Is there a way to train using a multispeaker dataset like libritts ? And what about other languages ? Thank you!

Cannot generate using a pre-trained model

Hi,

I am trying to generate the example from the README page. I get an error "RuntimeError: language "en" is not supported by the espeak backend".

In colab is works fine, but I cannot make it work on my machine. My eSpeak is 1.48.03 which comes by default on Ubuntu 20.4

Traceback (most recent call last):
  File "gen_forward.py", line 118, in <module>
    text = clean_text(input_text.strip())
  File "/home/jadmin/kaalam.github/ForwardTacotron/utils/text/__init__.py", line 59, in clean_text
    text = cleaner(text)
  File "/home/jadmin/kaalam.github/ForwardTacotron/utils/text/cleaners.py", line 82, in english_cleaners
    text = to_phonemes(text)
  File "/home/jadmin/kaalam.github/ForwardTacotron/utils/text/cleaners.py", line 97, in to_phonemes
    language_switch='remove-flags')
  File "/home/jadmin/anaconda3/envs/torch/lib/python3.7/site-packages/phonemizer/phonemize.py", line 160, in phonemize
    logger=logger)
  File "/home/jadmin/anaconda3/envs/torch/lib/python3.7/site-packages/phonemizer/backend/espeak.py", line 147, in __init__
    preserve_punctuation=preserve_punctuation, logger=logger)
  File "/home/jadmin/anaconda3/envs/torch/lib/python3.7/site-packages/phonemizer/backend/base.py", line 54, in __init__
    .format(language, self.name()))
RuntimeError: language "en" is not supported by the espeak backend

*.npy files in alg and gta folders for training forward tacotron

Hi!

If I understand everything correctly, it's impossible to run train_forward.py without running prior it train_tacotron.py with flag --force_align set. Otherwise, train_forward.py fails:

Traceback (most recent call last):
  File "train_forward.py", line 103, in <module>
    trainer.train(model, optimizer)
  File "/mnt/sdb/ForwardTacotron/trainer/forward_trainer.py", line 36, in train
    bs=bs, train_set=train_set, val_set=val_set)
  File "/mnt/sdb/ForwardTacotron/trainer/common.py", line 23, in __init__
    self.val_sample = next(iter(val_set))
...
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/sdb/data/RUSLAN/tiny_version/alg/<some wavname>.npy'

Please, add flag --force_align to train_forward.py or at least mention this in README.md. And thank you for this repo!

Is it possible to pretrain on a different speaker?

Hi,

ForwardTacotron works very well on long articles it is just surprising and welcome. With a good dataset and a lot of training, it is incredible how robust it can get.

Two questions:

I am interested in making a TTS for a specific voice, however, I have a smaller corpus from a different speaker, which is very good in terms of vocabulary and even. In other implementations (Mozilla etc.) it is easy to pretrain the TTS using a corpus and then restore that model when kickstarting training for a new speaker. In ForwardTacotron, I am not really sure how to go about this, if even possible. Should I train Tacotron on the first speaker, or ForwardTacotron? I would like to make use of the first, smaller corpus, because I think it would greatly help with the encoder.
Regarding WaveRNN, is it true that RAW yields better results than MOLD when it comes to sound quality? I tried training it on a studio dataset and I got good results using RAW, after training for 900k steps, however one can still hear some parts where the voice is "shakier". Do you think I may achieve better results using MOLD? The dataset has no background noise, is 22050Hz and monophonic. :)

Thank you for all this work, it is just incredible how well it performs in whole pages of books!

Adding pauses to the input text

Hi. I was wondering if there was any way to add (longer) pauses in between sentences of the input text?

I've also seen in some other issues the idea of controling this with a factor, but I was unable to find it in the code. Could you give a few tips on this?

Thanks

Error while executing(training for Hindi dataset):- !python train_tacotron.py

I have changed the language = 'fr', tts_cleaner_name = 'basic_cleaners' in hparams.py.
The command below gives me no error.
py preprocess.py --path /content/tacotron/dataset_folder

output

71 wav files found in "/content/tacotron/dataset_folder."

Using 68 wav files that are indexed in metafile.

+-------------+-----------+--------+------------+-----------+----------------+
| Sample Rate | Bit Depth | Mu Law | Hop Length | CPU Usage | Num Validation |
+-------------+-----------+--------+------------+-----------+----------------+
|    22050    |     9     |  True  |    256     |    1/2    |      200       |
+-------------+-----------+--------+------------+-----------+----------------+
 
████████████████ 68/68 First val sample: 1_1_60`


Completed. Ready to run "python train_tacotron.py" or "python train_wavernn.py".

But when I run

python train_tacotron.py

I get the following error message:-

`Using device: cuda

Initialising Model...

Creating latest checkpoint...
Saving latest weights: /content/tacotron/checkpoints/ljspeech_raw.wavernn/latest_weights.pyt
Saving latest optimizer state: /content/tacotron/checkpoints/ljspeech_raw.wavernn/latest_optim.pyt
2020-12-30 07:09:50.274334: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
Traceback (most recent call last):
  File "train_wavernn.py", line 62, in <module>
    voc_trainer.train(voc_model, optimizer, train_gta=args.gta)
  File "/content/tacotron/trainer/voc_trainer.py", line 46, in train
    path=self.paths.data, batch_size=bs, train_gta=train_gta)
  File "/content/tacotron/utils/dataset.py", line 38, in get_vocoder_datasets
    train_ids, train_lens = zip(*filter_max_len(train_data))
ValueError: not enough values to unpack (expected 2, got 0)`

Non-attentive Tacotron by Google

Hi, Google just published a model similar to yours. I just thought it might be interesting for you, in case you haven't read it yet :p

https://arxiv.org/abs/2010.04301

I think it's particularly interesting the Gaussian way they upsample the encoder outputs, and maybe also the positional encoding.

Best regards!

non-empty TensorList?

I am a super noob when it comes to this so bare with me

I get this error when trying to synthezise a sentence
"RuntimeError: stack expects a non-empty TensorList"

I just wanted to check the progress on a custom dataset i made, but i am not able to do so.

I am training on colab.

Another stupid question: How do I backup files after some training in order to resume later? Which files should i make sure to save and what is the command for resuming training from a specific checkpoint.
I find that Colab is sometimes unreliable and need to make sure i have all the data i need on a drive in order to load it back into the notebook.

Error when using with the HiFiGAN vocoder

I was trying to synthesize speech using the pretrained model with the HiFiGAN vocoder.

First I generated the mel file with
python gen_forward.py --input_text 'this is whatever you want it to be' melgan

Then, in the HiFiGAN vocoder repo, as stated in its README, I put the mel file in the test_mel_files, and run
python inference_e2e.py --checkpoint_file "./model/generator_v3"

And I got the following error:

Removing weight norm...
Traceback (most recent call last):
  File "inference_e2e.py", line 89, in <module>
    main()
  File "inference_e2e.py", line 85, in main
    inference(a)
  File "inference_e2e.py", line 49, in inference
    x = torch.FloatTensor(x).to(device)
ValueError: could not determine the shape of object type 'NpzFile'

It seems to have something to do with the mel file.

What am I missing? What should I do?
Thanks.

Permanent Reference for the repository

Hi there,

I also would like to reference Forward Tacotron in a scientific publication. Since there is no paper of your repository, could you also share us the way to reference your work scientifically?

It was requested here #43 but was not revealed publicly.

Thank you in advance

double_duration_sep

hi,Has the effect of double_duration_sep improved?

implement hifigan vocoder?

RuntimeError: Expected object of scalar type Float but got scalar type Int for argument #2 'target' When training forward network

I am using Windows 10, Pytorch 1.2, python 3.7 and all other required libs.
I was able to generate the preprocessor data and fully train the tacotron model and generate the GTAs.
But now I come to train the forward network and it looks like the dur parameter in for i, (x, m, ids, lens, dur) in enumerate(session.train_set, 1): is an int tensor and it expects a float tensor.

  File "train_forward.py", line 98, in <module>
    trainer.train(model, optimizer)
  File "D:\speech\ForwardTacotron-master\ForwardTacotron-master\trainer\forward_trainer.py", line 37, in train
    self.train_session(model, optimizer, session)
  File "D:\speech\ForwardTacotron-master\ForwardTacotron-master\trainer\forward_trainer.py", line 67, in train_session
    dur_loss = F.l1_loss(dur_hat, dur)
  File "C:\Users\Josh\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\nn\functional.py", line 2165, in l1_loss
    ret = torch._C._nn.l1_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction))
RuntimeError: Expected object of scalar type Float but got scalar type Int for argument #2 'target'

dur tensor looks like:

tensor([[ 0,  6, 14,  ...,  0,  0,  0],
        [ 0,  8,  9,  ...,  0,  0,  0],
        [ 0,  5,  8,  ...,  0,  0,  0],
        ...,
        [ 8,  9, 16,  ...,  0,  0,  0],
        [ 0,  8,  8,  ...,  0,  0,  0],
        [ 0,  6, 12,  ...,  0,  0,  0]], dtype=torch.int32)

Would you have any ideas what could cause this and how to address it?

Thank you

Pytorch: DLL load failed: The operating system cannot run %1

hello.
as the title says, i am experiencing this problem when i preprocess the ljspeech dataset.

python preprocess.py --path /dataset_folder/ Traceback (most recent call last): File "preprocess.py", line 4, in <module> from utils.display import * File "F:\Downloads\ForwardTacotron-master\ForwardTacotron-master\utils\__init__.py", line 6, in <module> import torch File "C:\Users\****\Miniconda3\envs\tacotron\lib\site-packages\torch\__init__.py", line 81, in <module> from torch._C import * ImportError: DLL load failed: The operating system cannot run %1.
can someone help? thanks

Cannot download forward_tacotron pretrained model (Access denied)

Curious what causes inconsistent pronunciation of certain words

This is not an issue per-se but I noticed that both on my model and on LJ, certain words like, "Can't" vary between the brittish way of pronouncing and american. EG: C ANT vs C AUNT and sometimes I get both in one sentence. I thought that by using phonemes it would be consistent? Do you know what causes this sort of issue? Is it fixed by simply having more training data? LJ is 24 hours, mine is 8.5.

Thank you

Vocoder ParallelWaveGAN

Hi,
I have learned model in our language using your repository. Melgan as vocoder was learning very slow, so i used https://github.com/kan-bayashi/ParallelWaveGAN

Have you tried to use it with this vocoder?

I have learned new model using your default configuration and this configuration from ParallelWaveGAN. Did i miss something?

Final audio is here "Hello how are you". When using grifim limm as vocoder, it is working fine.

I don't see fft_bins or bits in ParallelWavegan configuration, is this the problem?

Thanks a lot for your work!
Tom

Citation

Hi Christian,

I am preparing a scientific publication involving a model that, although introduces many changes with respect to your current model, it is based on it. I would like, if you agree, to cite and reference your work. How should be the best way to do it? You can write me on [email protected].

Thanks.

Paper of this repository.

Do you have a paper discribed this implementation. I would like to see the comparision with others model and baseline (tacotron2 + waveglow)

does the published weights fit the model?

I'm trying to synthesize using gen_forward.py and pretrained weights and getting an error in WaveRNN initialization:


Traceback (most recent call last):
  File "gen_forward.py", line 94, in <module>
    voc_model.load(voc_load_path)
  File "/mnt/sdb/ForwardTacotron_latest/models/fatchord_version.py", line 420, in load
    self.load_state_dict(torch.load(path, map_location=device), strict=False)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 845, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for WaveRNN:
	size mismatch for fc3.weight: copying a param with shape torch.Size([30, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
	size mismatch for fc3.bias: copying a param with shape torch.Size([30]) from checkpoint, the shape in current model is torch.Size([512]).

And in ForwardTacotron initialization


Traceback (most recent call last):
  File "gen_forward.py", line 112, in <module>
    tts_model.load(tts_load_path)
  File "/mnt/sdb/ForwardTacotron_latest/models/forward_tacotron.py", line 195, in load
    self.load_state_dict(state_dict, strict=False)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 845, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for ForwardTacotron:
	size mismatch for embedding.weight: copying a param with shape torch.Size([148, 256]) from checkpoint, the shape in current model is torch.Size([127, 256]).

Am I missing something?

[INFO] Material/Request for SSML Implementation/Integration

Please note this is not an issue. So, @cschaefer26 please feel free to close it whenever necessary.

I have been looking around for SSML implementation for quite some time and don't seem to find any literature on it. Can anyone point me to the right direction?

Thanks!

confuse about duration extract

Hi
Thank you for great repo. I am trying to implement version for multi speaker since i already have tacotron2 multi speaker for aliment extraction. Bụt when i extract aliment and duration by command python train_tacotron.py --force_align, I have confused that , mel_len is len of ground truth mel spectrogram but aliment matrix is relation between input text and predicted mel spectrogram by tacotron. So this mel_len will mismatch to aliment attention matrix. I saw that u use this line to bring attention matrix have the same len to mel_len path_probs = 1.-att[:mel_len, :]. And this make me confuse, pls explain why this mel_len not be len of predicted mel spectrogram ?

Thank sir

Any chance of merging master into the multispeaker branch?

Hi @cschaefer26

I have had some good success for my project with the multispeaker branch, except it is quite far behind master and most importantly, the pitch conditioning is not present.

I would love to do it myself, but there's just too many merge conflicts and I'm not sure I can get something working...

I figured I might as well try my luck at asking if there's a way master could be merged into the multispeaker branch, or at least the pitch conditioning stuff. It would be a really great help.

Thanks!

About aligner model and GAN discriminator branches

Hi,

I am following close the developments in the different branches as I am very interested in all your progresses with this model. I was wondering particularly about the forward_gan and the aligner branches.

I am curious: what has been your experience with those? Is the aligner model any better than the vanilla Tacotron for recovering the alignments? Did you manage to successfully implement GAN training to enhance spectrogram reconstruction details?

Thanks again!

Extracting Alignment from Tacotron - Cherry Pick?

If I'm following along correctly, it looks to me like the model in train_tacotron is only used to extract the alignment layer which is then saved and used in train_foward's model.

When using train_tacotron on a single speaker dataset of ~100k English utterances, I'm seeing a divergence between Val/Loss and Val/Attention_Score around step 15,000 (batch size 22). Val/Loss keeps decreasing, but Val/Attention_Score starts to drop as well. This continues down through my modified training schedule (which I created after seeing this in the original schedule).

It doesn't look to me like the alignments are cherry picked from the model with the best Val/Attention_Score? I can't think of a downside to implementing that? Or am I missing something?

Was the original schedule with changes at distances of 10k steps based on the ~10k utterances in ljspeech, and would you suggest I dramatically increase the steps for my data? Or was the original schedule just the result of tuning/experimentation?

Any ideas what might be causing the divergence around step 15k? Thinking it was simple overfitting I've tried increasing dropout significantly but still see the same overall phenomenon.

voice_mask

Hello, what is the value in voice_mask??

How to save model using torch.jit to run generate on cpu without external code?

gen_forward empty TensorList

Hi,
I've carried out the steps in the "Training your own model" section of the readme but can't run gen_forward.py:

`python gen_forward.py --alpha 1 --input_text "this is whatever you want it to be" griffinlim
Using device: cuda

Initialising Forward TTS Model...

+----------+--------------+----------+
| Tacotron | Vocoder Type | GL Iters |
+----------+--------------+----------+
| 10k | Griffin-Lim | 32 |
+----------+--------------+----------+

| Generating 1/1
Traceback (most recent call last):
File "gen_forward.py", line 142, in
_, m, _ = tts_model.generate(x, alpha=args.alpha)
File "/home/user/Documents/vocal_synthesis/models/forward_tacotron.py", line 165, in generate
x, _ = self.lstm(x)
File "/home/user/Documents/vocal_synthesis/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/user/Documents/vocal_synthesis/venv/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 570, in forward
self.dropout, self.training, self.bidirectional, self.batch_first)
RuntimeError: stack expects a non-empty TensorList`

I tried running gen_tacotron.py and it ran without error, but the file it produced seemed too long and sounded nothing like speaking.

If it's relevant, I didn't get too far in the 289,000 step section of train_forward but the loss wasn't reducing much anyway.

Thanks :)

results dont match

results from samples dont match with results generated by LJ_FT_T2_V3 hifigan model specially this sentence "In a statement announcing his resignation, Mr Ross.." in the samples the girl says "mr ross" good, but in generated file, it says distorted. the sample results very good dont know to reproduce

as-ideas / forwardtacotron Goto Github PK

forwardtacotron's People

Contributors

Stargazers

Watchers

Forkers

forwardtacotron's Issues

Loss and optimizer

Recommend Projects

Recommend Topics

Recommend Org