Hi, great work! I saw your autoregression branch and wanted to a

Thanks, having read this paper recently <a href="https://arxiv.org/abs/1909.01145" rel

Oh right, I actually found your repo there <a class="issue-link js-issue-link" dat

Experiments / Discussion,about as-ideas/forwardtacotron

Comments (69)

cschaefer26 commented on June 23, 2024

Hi, thx for the hint - I have actually skipped through the paper after I finished the implementation - there is a lot of overlap with ForwardTacotron - it could definitely be worth a try and could help for sure. As for the autoregressive ForwardTacotron, it worked but I found that it exhibits lower mel quality (I didnt do an exhaustive test though) - main problem was (probably) that I trained with teacher enforcing and thus got a very low loss very quickly. With additional pre-nets and dropout as in the Tacotron the quality improved slightly, but was still lower than the non-autoreressive model.

from forwardtacotron.

cschaefer26 commented on June 23, 2024

from forwardtacotron.

m-toman commented on June 23, 2024

Thanks, having read this paper recently https://arxiv.org/abs/1909.01145 I come to think that autoregression hurts more than it helps ;).
Furthermore considering that in the Tencent paper above they found that the power of Taco does not seem to come from attention but from the pre/postnets it's not surprising you ended up with this model.

I think I'll try the length regulator from your repo in a Taco2 setting (as I see it you use the Taco1 CBHG etc. layers) and see how it goes. Also might make sense to use a classical forced aligner instead of training vanilla Taco first just for the alignments.
I'll keep you posted when I find something interesting ;)

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Hey, that paper is actually sth I want to try soon for the ForwardTacotron as well - although I am not sure if it would be beneficial for a non-autoregressive model. Trying taco2 definitely makes sense, I had some success also with conv-only models similar to what they use in MelGAN, there is probably room for improvement! I also thought about using a STT model for extracting the durations. Keep us posted if you find anything interesting!

from forwardtacotron.

m-toman commented on June 23, 2024

Oh right, I actually found your repo there
NVIDIA/tacotron2#280

I'm currently just using DFR 0.2 without MMI because of some reports there and would also first have to adapt the code to the phone set instead of characters.
But this should be obsolete with an explicit duration model.

It's interesting that this duration model is trained together with the rest instead of separately.

I'm quite eager to get rid of attention as it's really the #1 source of issues I encounter.

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Yes I found that the model without attention is really robust. It seems to be the general trend to get rid of it. Also worth a look: https://arxiv.org/abs/2006.04558

from forwardtacotron.

m-toman commented on June 23, 2024

Oh thanks, didn't see that one yet.

I got the impression that the two lines of research are either to use explicit durations (IBM model, the new Facebook model, Tencent etc) or try to improve on the attention mechanism, like
Monotonic attention or https://google.github.io/tacotron/publications/location_relative_attention/index.html

But you really wonder how much of an attention model this actually is if you just use it to attend to a single input at a time.

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Yeah thats true. From my experience the duration models perform well enough and they are much faster. Next thing I will try is to use a different approach for extracting durations from the data, probably with a simple STT model with CTC loss.

from forwardtacotron.

m-toman commented on June 23, 2024

Just integrated your model into my training framework, preprocessing and with MelGAN and already works quite well after just 20k steps. Audible but noisy, very smooth spectra. Let's see how it evolves.

Also prepared Integration taco 2 layers but first want a baseline.

I wonder if training the duration model separately would be beneficial but I guess it won't make a big difference.

Any reason why you picked L1 loss?

I mostly plan to try forced alignment next, perhaps try subphone units like in HMM systems and a couple smaller things similar to DurIAN, like the skip encoder and the positional index.

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Cool, keep me updated - the spectra get much better until 200k steps in my experience. No special reason for L1 over L2, i would not think it makes a big difference. I am trying now to extract the durations with a simple conv-lstm STT model. I use the output log-probablilities to align the mels to the input text with a graph search algorithm. Works pretty well, but so far I don't see it performing better than the alignments extracted from the taco.

from forwardtacotron.

m-toman commented on June 23, 2024

It started to converge a bit at around 100k steps (bs32). Stopped for now and trying the suggestions from alexdemartos (prenet output into duration model, duration model from fastspeech).

Implemented positional index here https://github.com/vocalid/tacotron2/blob/b958c7d889b7b6161f56f36b2d525650ff55df3c/model.py#L41
But have to see how to improve that. The torch.gather solution trains with 0.4s/iteration while this one is about 4s (actually 40 sec if you don't move expanded to the gpu first as in the link)

But that's next.
Do you plan to release the alignment model?

I planned to use something like https://github.com/CSTR-Edinburgh/merlin/blob/master/misc/scripts/alignment/state_align/forced_alignment.py
Which worked quite well on small datasets in my experience

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Yeah if the model works well I sure gonna open source it. If you're interested check out the (researchy) branch 'aligner and run train_aligner.py and then force_alignment.py. The HM models seem to be standard for extracting alignments, I though want something independent from third parties. I'd be really interested in how the hmm works though!

from forwardtacotron.

cschaefer26 commented on June 23, 2024

It started to converge a bit at around 100k steps (bs32). Stopped for now and trying the suggestions from alexdemartos (prenet output into duration model, duration model from fastspeech).

Implemented positional index here https://github.com/vocalid/tacotron2/blob/b958c7d889b7b6161f56f36b2d525650ff55df3c/model.py#L41
But have to see how to improve that. The torch.gather solution trains with 0.4s/iteration while this one is about 4s (actually 40 sec if you don't move expanded to the gpu first as in the link)

But that's next.
Do you plan to release the alignment model?

I planned to use something like https://github.com/CSTR-Edinburgh/merlin/blob/master/misc/scripts/alignment/state_align/forced_alignment.py
Which worked quite well on small datasets in my experience

Hi, just to share with you - I did a couple of tests extracting the durations with a STT model (simple conv-lstm such as a standard OCR model). I overfitted the STT model on the train set, extracted the predition probs and use a graph-search method to align phonemes and mel steps (based on maximising the prediction prob for the current phoneme at each mel step). It works pretty well, the results are intelligible, but the prosody is slightly worse and more robotic than with the taco extracted durations. Any luck yet for you with the forced alignment?

from forwardtacotron.

m-toman commented on June 23, 2024

Hey. Haven't tried it yet. I ran lots of variations with taco1 vs taco2 postnet, prenet vs no prenet.
I found the prenet before the CBHG didn't really make a difference, neither the postnet choice.

Generally I often see differences in the training loss but none in validation loss.

Biggest difference was exchanging the duration predictor with the fastspeech style one and putting it after CBHG instead of before. Unfortunately I didn't test yet which of the two is the more important modification.

Generally I see a lot more artefacts than with our taco2 model. I train melgan by generating Mel spectra using ground truth durations and it reconstructs them very well. Once I feed Mel spectra generated using predicted durations things get ugly.

Also training with positional indices at the moment but no significant difference either.

I'm also interested in trying a generative model for durations as in https://www.ibm.com/blogs/research/2019/09/tts-using-lpcnet/

Another interesting aspect: I see training loss decreasing after 500k+ steps but validation loss is pretty much stable after about 50k or so. Seems a bit early for overfitting to me.

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Thx for the update. As for the duration predictor - I've had the problem of overfitting when I put it after the prenet plus the mels looked worse. As for the increasing validation loss - I think this is kind of normal as the model is not teacher forced and the predicted patterns differ from the ground truth, the audio quality still improves up until 200k steps or so in my experience. I normally do not even look at the validation loss to be honest and judge more by the audio. Also, I have seen quite some artifacts with a pretrained ljspeech melgan + forward taco, but less with wavernn. On our male custom dataset there are quite few artifacts with melgan - I increased melgans receptive field, maybe that helps...

from forwardtacotron.

cschaefer26 commented on June 23, 2024

One more thing, when I train the MelGAN, I usually mix ground truth mels with predicted ones as I think it makes training more stable - this could also be worth a try. If you only train on predicted spectra the problem could be that they differ too much from the GT as they are not really teacher forced (e.g. the pitch could be different etc.).

from forwardtacotron.

m-toman commented on June 23, 2024

Yeah ive also increased the receptive field as proposed in a paper I forgot. Didnt see the huge improvement they saw but well... Regarding mixing GT with GTA - I could have sworn I did that but strangely only find it in my wavernn codebase.
And yes, seemed to make it more robust.

I'll also try multispeaker. With vanilla taco I never could get it to learn attention well for all speakers but spectra were pretty good, so I guess zu should work well with this model. Considering the models I trained using merlin (mostly just 3 LSTMs on HTS Labels) were very well able to produce and mix more than 1000 voices easily in combination with WORLD.

from forwardtacotron.

cschaefer26 commented on June 23, 2024

I would also assume that the model goes well with multispeaker, thats quite some work though. For this it makes sense to first find a quicker way of extracting durations probably. I am running another ljspeech training now with melgan. I see improvements of audio quality up to 400k steps on the forward if I test it with the standard pretrained melgan (fewer squeezy artefacts)

from forwardtacotron.

m-toman commented on June 23, 2024

Seems it was an issue of patience again. MelGAN loss is hard to interpret and just letting it run often helps. So let it run over the weekend, acoustic model to 500k steps, MelGAN just another day more or so and it's definitely better now.
https://drive.google.com/file/d/1YBsS7sxus_tw9PQdr0HVtScGo8Ccuolw/view?usp=sharing

Prosody not yet at the level of the Taco2 model I trained but we're getting closer.

And yes, definitely have to work on the aligner first before tackling multispeaker.

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Thats not too bad for melgan and the inference speed you get with both models. I would assume that the durations could be overfitted (did you check the dur val loss?). I am also testing some model variations and I found that it helps to concat the lstm output with the prenet_output:

    x = self.prenet(x)
    x = self.lr(x, dur)

    x_p, _ = self.lstm(x)
    x_p = F.dropout(x_p,
                    p=self.dropout,
                    training=self.training)
    x_p = torch.cat([x, x_p], dim=-1)
    x_p = self.lin(x_p)

This is closer to the Taco architecture, where the attention is also concatenated with the lstm output.

from forwardtacotron.

m-toman commented on June 23, 2024

Well, validation loss is strange for me ;)

EDIT: zooming in doesn't really help

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Seems like instant overfitting...

from forwardtacotron.

cschaefer26 commented on June 23, 2024

https://drive.google.com/file/d/1S__-0_3N2swYCsWu4TciZ6O9owkxX7eK/view?usp=sharing
https://drive.google.com/file/d/1fIU8SfijwsUg_vEOSykgs7hXb3lO1Fh0/view?usp=sharing

Here is some result with the updated model trained 320k steps together with the pretrained melgan from the repo.

from forwardtacotron.

m-toman commented on June 23, 2024

Currently investigating this "overfitting" issue. Been plotting pre und post mel validation error now and it is at the lowest point at around 10k steps and then gradually increases.

Looking at mel spectra from the validation set, this is after 12k steps

After 81k steps

definitely more detail.

Then I've also plotted the error here:
12k steps

81k steps

Seems the error in the formant structure is really higher.
I would assume that there might just be some ... shift in the frequency axis that messes up the loss but obviously still sounds fine.

Ground truth

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Very cool. That's what I expected too, the structure gets more pronounced but may vary from the ground truth (e.g. different pitch or voice going a bit up instead of down) - as the model is not teacher forced.

BTW I found that with the melgan preprocessing it is necessary to do some cherry picking with the tts model, but training to 400K steps definitely is worth it.

from forwardtacotron.

m-toman commented on June 23, 2024

Oh, I got the alignment with HTK to work and while it generally works fine, I currently getting more "raspy" voices and I'm not completely sure if it's because of the alignment.
My main issue is that I'm not completely sure how to handle the word boundaries best. Tacotron usually works fine with spaces as wound boundary symbols, but it messes up the aligner in most cases. Except there's really a pause between words.

I think it might be the best solution to not have them in alignment and then use some skip encoder like in DurIAN. Where they keep the word boundaries as separate symbols until the state expansion.
If I just drop them completely it strings the words together without any pause ever and it sounds awful.

Well, having diacritics as separate symbols is not really helpful either...

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Thats also my experience with durations from a STT model. I tried 1. generating phoneme boundaries (and word boundaries) from the output probabilities and 2. extracting the exact phoneme positions in time and splitting right between them. Both resulted in lower mel quality.

from forwardtacotron.

m-toman commented on June 23, 2024

Should change the name of this issue ;)

Still seeing generalization issues. Implemented multispeaker, injecting speaker codes after CBHG but in generally it always defaults to one voice (or I am not fully sure if it's actually am average voice) except if I pick a sentence from training set with the respektive speaker code. Strange as it's fed directly into the LSTM.
Pretty strange considering I previously used similar 3 layer LSTM networks where it worked without issues.
Currently adding residual connections, like the concat you suggested above around the LSTM and also additive after the postnet like in taco2 but it still seems to do the same thing. Even more interesting - if I pick a sentence from training set with a specific speaker and just change a word it sort of interpolates the whole sentence.

Hmm gotta try synthesizing from pre postnet Mel spectra if it makes a difference. - update: nope, sounds a bit different but already lost speaker information.

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Renamed - good research. Did you use durations extracted by a respective tacotron trained separately on each dataset? As for the overfitting - I see the same issues, you mentioned that having a pre-net with heavy dropout did not help, did it? I conducted a lot of unsuccessful experiments, mainly trying various forms of duration prediction, e.g. a separate autoregressive duration predictor (heavily overfitted). Currently I am experimenting with GANS again and got them to work quite well, although the voice quality is not yet better than with standard L1 loss.

from forwardtacotron.

m-toman commented on June 23, 2024

Didn't see much difference with the prenet, more dropout etc.
I aligned them all using HTK.
Tried one-hot encoding any a separate speaker encoder, both resulted in the same.

I'm considering differences to the merlin models I trained previously
https://github.com/CSTR-Edinburgh/merlin/blob/33fa6e65ddb903ed5633ccb66c74d3e7c128667f/src/keras_lib/model.py#L132
Where it worked without any issues just concat a one-hot vector to the input sequence.
Those also have separate duration models but:
No convolutions etc. but just 3 LSTMs - perhaps the simpler model is is the reason.
Instead more linguistic and contextual features like position in word, sentence etc. (more or less the HTS label format + additional position index).
Also a 5 state subphone model and 5ms hop size.

I've considered 3 subphone units but my durations are sometimes just 2-4 frames with the Taco style hop size. Doesn't make much sense to split those.
Actually I'm building a model with a smaller hop size atm as well but still in progress.

Just to verify that overfitting issue might be interesting to throw out CBHG and see how it behaves.

from forwardtacotron.

m-toman commented on June 23, 2024

Hi,

meanwhile I tried injecting speaker codes at nearly all possible points - initializing LSTMs, projecting to similar dimension and then adding, concatenating at all time steps etc.
But the model still seems to ignore them and rather memorize the speaker identity as a function of the phonetic input.
When generating GTA features it therefore gets them all right but once you feed it in synthesis it seems to fully ignore it and pick a voice at random (or most likely - the voice with the most similar input context?).

If it's overfitting so much that it memorizes this I tried to radically reduce the network - similar to https://www.ibm.com/blogs/research/2019/09/tts-using-lpcnet/ where they mention layer sizes of 64 and similar.
Both training loss and validation loss were worse but still ignored the speaker embeddings.

https://www.semion.io/doc/can-speaker-augmentation-improve-multi-speaker-end-to-end-tts investigated the effect of injecting at different points, usually using a projection to 64 dimensions and then concatenating. No luck with that either.

Cost me some hair the last two weeks ;)

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Thx for sharing! I am idle at the moment (parental leave for 4 weeks), but I am going back to investigate multispeaker for sure after that. Is the data roughly equally distributed? I could imagine that the model tends to turn to the voice with the largest dataset. Plus, It could make sense to look into the contribution of the duration predictor, i.e. try to use separate predictors for separate datasets (or first try to feed target duration of the respective voices) - the model probably fits on the duration distributions as well.

from forwardtacotron.

m-toman commented on June 23, 2024

Congratulations, I'm also trying to work between a 4 year old and an almost 1 year old. Challenging ;).
Yeah I tried reducing LJ to the 2-3k sentences most other speakers have and also mixed in VCTK to get lots of speakers. Tried one-hot encoding as well as a speaker embedding from an encoder model.

With lots of LJ data it definitely produced lots of LJ. With a more balanced set it seems pretty much every sentence uses a different voice. But always the same voice no matter which input embedding I use. So I guess it learns which text context is spoken by which voice.

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Challenging indeed, but congrats! Seems like you tried the same approaches that I would go about. To verify one could look at the activations from the speaker embedding input (or correlations with the output voice identity, which probably takes some time to implement). As for a solution - did you try to use large input dropout again? Also, It could make sense to try to regularize using variable input lengths (e.g. just feed random parts of the input sentences). Probably the most low hanging fruit would be to increase the dataset using many different speakers to reduce overfitting. I am just starting to investigate multispeaker and will keep you posted!

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Hi, so I am finally back and did some multispeaker research. Starting point was this repo: https://github.com/CorentinJ/Real-Time-Voice-Cloning and the corresponding paper https://arxiv.org/pdf/1806.04558.pdf. I implemented a multispeaker tacotron with the speaker embedding from https://github.com/resemble-ai/Resemblyzer. Started out using the LibriTTS corpus but found it pretty noisy and had problems to get stable attention, then moved to the VCTK corpus. Tacotron trained ok and also reproduced the voices in inference (with r=3) given a specific speaker embedding (extracted from a sample wav). Attention was a bit shaky but good enough to extract durations for a first ForwardTacotron training. First Implementation is here: https://github.com/as-ideas/ForwardTacotron/blob/multispeaker/models/forward_tacotron.py. It seems to work pretty well already actually, see the samples: https://drive.google.com/drive/folders/1qr2nFbmdjJX-ZI8LU7CpghxjSXoJUGiS?usp=sharing

from forwardtacotron.

m-toman commented on June 23, 2024

Hi,
I've been working with another taco implementation originally and also tried CorentinJs encoder. Also worked quite well regarding speaker identity, main issue was getting good alignment for all speakers.

But with forward taco I get the describe issues. At first glance my implementation seems quite similar to yours (no wonder, just repeat and concat ;)). I'll dig deeper if I find any differences. Did you train on a larger batch size or so?

Main differences are that I put the duration model after CBHG and concat embeddings before LR (but they should just be repeated with the rest).
EDIT: Wonder if putting the duration model after the CBHG might really affect this, as it's much simpler and more likely to overfit. Although I multiplied the duration loss by 0.1 so it got a less strong effect (I've later seen that they did the same in FastPitch, also with 0.1 ;)). Generally I wonder if having it as a separately trained model (e.g. like the prosody model in https://www.ibm.com/blogs/research/2019/09/tts-using-lpcnet/ ) might make sense, or at least freeze it after some time.

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Yeah I found that the duration model is hugely overfitting when applied after CBHG, could be a problem. You could try to just replicate my implementation and go from there. The stratification by speaker really seems to work well for me. Could be also the dataset, VCTK has about 100 speakers with about equal number of files. I also thought about an independend prosody model, def. will look into it.

from forwardtacotron.

m-toman commented on June 23, 2024

Well, while my model is pretty similar, the whole repo is based on the nvidia taco2 repo and overall vastly different.
I've also added positional index and other stuff that might result in differences but intuitively I would suspect that the duration model after CBHG might still have a strong effect on the CBHG...

from forwardtacotron.

cschaefer26 commented on June 23, 2024

I'm for sure experimenting with different architectures and give you an update!

from forwardtacotron.

m-toman commented on June 23, 2024

Thanks. I am now running with duration model before CBHG (I am using that FastSpeech one, so this is also a bit different). I also got a pitch model in there, so that might also interfere, but I think I saw this issue before I had it. To make sure I also do that one before CBHG now.
Before that it was quite similar to https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechSynthesis/FastPitch/img/fastpitch_model.png with CBHG instead of the transformer blocks.
Subjectively I don't hear a significant difference to that much larger model. In the IBM paper above they used even smaller layers and sounds pretty good (although just 16kHz)

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Ah cool. I'm also planning on messing with a pitch predictor, the fastpitch samples seem quite convincing. Let me know how it goes!

from forwardtacotron.

m-toman commented on June 23, 2024

Oh it worked quite well without any real issues. The more I wondered that it ignores my speaker IDs while those work nicely.
I'm just at 20k steps with the modifications above but also seems to output me some random voice, checking the embeddings again, weird.

Update 2:
OMG I think I got it. Such a stupid bug. I've checked the embeddings on inference if they match the speaker, I've checked them in the data loader if they match the filename etc. I've checked in forward if they vary in each batch and the repeat does its work correctly.
But I did NOT check the collate function 😱

Such a simple stupid bug and took me longer than that X11 forwarding syscall issue here mozilla/TTS#417 (comment) ;)

😠

Retraining now....

Update 3:
Working, 3 samples after just 2k steps (like 10 minutes of training)
ms.zip

from forwardtacotron.

commented on June 23, 2024

Hi there @cschaefer26, nice project! Regarding #7 (comment), while integrating fatchord's tacotron model with https://github.com/CorentinJ/Real-Time-Voice-Cloning (my work is in #472), I've also encountered the same problems you had with LibriTTS, which are mainly caused by the highly inconsistent prosody between speakers. You can get much better results by preprocessing or curating the dataset (either trimming mid-sentence pauses or discarding utterances when that occurs). VCTK works a lot better if you trim the silence at the beginning and end of each file. I can go into more detail if it is helpful.

The baseline tacotron requires very clean data for multispeaker, and even then I'm having trouble producing a decent model. Which is what leads me to your repo. :) I will be trying it out. Keep up the great work!

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Hi @m-toman I totally missed your update. Sounds really good, I assume its melgan? I got some ok results usind VCTK, as @blue-fish states the datasets require some good trimming etc. I found this to be really helpful: https://github.com/resemble-ai/Resemblyzer/blob/cf57923d50c9faa7b5f7ea1740f288aa279edbd6/resemblyzer/audio.py#L57

Any updates? We are also looking into adding GST. The main problem I have right now is that I would nead really clean german datasets to benefit from transfer learning for our use case. I also looked into other open source repos and tested Taco2 etc. but found it not really to perform much better.

Currently, I am looking into some different preprocessings, e.g. mean-var scaling to improve the voice quality.

from forwardtacotron.

m-toman commented on June 23, 2024

Hi.
I have mixed in vctk (also with this trimming ;)) but I felt that the larger speakers I added lost a bit of prosody/felt flatter than when trained individually. Wanted to investigate further but did not get to it yet. Yeah it's melgan.

Yeah it's interesting. As I can't really believe it I regularly compare to taco2 and other more complex methods out there (https://github.com/tugstugi/dl-colab-notebooks) but neither attention nor autoregression really seem to make a significant difference.

Regarding styles I would have considered something like the simple method presented in DurIAN which is mostly just style embedding. Also read the flowtron paper again and thought about wrapping the whole model in such a Flow formulation but after listening to the samples again I felt it's probably not worth it vs just playing with the pitch predictor I got (might also be possible to predict and sample F0 from a gaussian where you could then play with the variance).

I would have to read the GST paper again but I felt the control is a bit top random when I remember correctly? So in the sense that the tokens are hard to interpret and probably different with each run?

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Yeah exactly, although they show some impressive results absorbing background noise into the tokens. I would probably think that pitch prediction is the lowest hanging fruit of them all...

from forwardtacotron.

m-toman commented on June 23, 2024

I feel we're getting to a similar state of saturation like we had it before deep learning entered the speech synthesis field. The HMM-based methods became so loaded with more and more tricks and features, the complexity was insane. The training script I used during my PhD consisted of I think 120 separate steps in the end, each calling HTS tools with dozens of command line parameters and additional script files ;).
Recently there have been so many approaches to make attention work better for this use case, like the monotonic methods that force it to either take a step or stay in the current state and only attend to a single input. With lots of weird tricks to make it differentiable etc.
At that point it's so far from the origins that it seems awkward to even use attention.
The seq2seq AR approach also means dependence on the stop token prediction (who did not end up with 30 seconds of garbled speech please raise their hands ;)).
https://arxiv.org/abs/1909.01145 was an interesting paper but it's yet again another rather complicated workaround for the issues introduced by AR, besides scheduled sampling/curricum learning (which introduced now robustness issues) and gradually decreasing r and stopping at r=2 (although that works quite well) to keep it from predicting from the previous samples ignoring conditioning information.

i admit I wasn't brave enough to just try what you did and throw all that stuff out. The thousand people at Google would have certainly done that, right? :)
Enough ranting, curious what you will achieve with GST, I'll further play with multispeaker soon.

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Good rant though! The more I test the autoregressive stuff the less impressed I am. Its basically not usable for us in production (we are trying to synth long german politics articles). The forward models are pretty robust though. I wish I could get rid of the AR model to extract durations, we experimented with the aligner module from google EATS, didnt work. Extracting with an STT model worked but quality was worse. Today I spent the whole day debugging why my forward model all of the sudden sucked badly and found that the tacotron alignments were shifted - somehow I got unlucky, increasing the batch size solved this. When I started with TTS I was wondering why people got so interested in thes attention plots, now I know - watching a tacotron giving birth to attention is one of my good moments :D .. Honestly the forward models seem to be SOTA now and are probably used in production for Microsoft, AWS, google...

from forwardtacotron.

m-toman commented on June 23, 2024

My samples above used alignment via HTS and didnt notice a difference to the taco attention ones. Using those scripts https://github.com/CSTR-Edinburgh/merlin/tree/master/misc/scripts/alignment/state_align
Just a bit annoying to set up,

Think it's mostly Google that still clings to it. https://arxiv.org/abs/1910.10288

Yeah was astonished to see that Springer does TTS ;).

My ex-colleagues recorded this corpus https://speech.kfs.oeaw.ac.at/mmascs/
Unfortunately too small for deep learning stuff (was fine for HMM based synthesis) but it's good quality and in different speaking rates might be useful for the duration model.
We did lots of mocap recordings back then, was fun ;)

Edit: and obviously it's Austrian German (Vorarlberg in this case ;)).
Here some fun dialect interpolation samples http://mtoman.neuratec.com/thesis/interpolation/

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Cool, I actually just found a glitch in the duration extraction of my STT models and it seems fine now. Probably going to release that as its cumbersome to train a tacotron for extractions. Good stuff! Ill keep you updated on how it goes with pitch prediction, multispeaker etc.

from forwardtacotron.

m-toman commented on June 23, 2024

I struggled a week now with that suddenly I got a burp sound at the end of many sentences. Honestly still now idea why, seems to happen sometimes. I now force it to silence... I assumed it was that it aligned the final punctuation symbol to silence usually, but if the silence trimming trims to aggressively it has to align that symbol with voice. So I added a little bit of silence myself (quite common actually in older systems to have a silence symbol at the beginning and end and prepend and append artificial silence). But didn't help. Now I force it to silence after synthesis but no idea where it's coming from...

anyways, did you ever get the validation loss to make sense? For me it still gradually increases, although at probably 1/10th of what the training loss decreases. Tried really small model sizes, more dropout but still. Even the multispeaker model I currently train on 100k sentences does it, but admittedly less pronounced.

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Did you check the padding? I had a similar problem once and found that padding values were at zero (and not at -11.5 for silence in the log space). Validation loss in this case does not mean anything imo since the model has too much freedom in predicting intonation, pitch etc without teacher forcing. I don't even look at it (for durations it still makes sense though imo).

from forwardtacotron.

m-toman commented on June 23, 2024

Hmm, good idea, thanks.
You mean the mel padding here, right?

ForwardTacotron/utils/dataset.py

Line 207 in d5c5d88

mel = [pad2d(x[1], max_spec_len) for x in batch]

But have to check how my mel representation differs from yours.

The loss masking should work now, so it should mostly be about the context from the LSTMs and convolutions "leaking" into the actual speech.
Strangely I never had any issues until recently but checked all commits I think a dozen times, no model/dataset changes. I've retrained at different commit times but the answer was never clear. Sometimes there were slight issues at the end of the sentence, 3 of 4 trainings on the original commit were fine, but one had slight issues. Sometimes it occurs at some point during the training, then is gone again, sometimes gradually worsens.

from forwardtacotron.

m-toman commented on June 23, 2024

OK, checked it, for me silence is -4 and modified the padding.
But then I've noticed I made a mistake in my posting above - convs/RNNs apply to the input text, not to the mel spectra. So with the loss masking fixed this should not have any effect, or do I miss something?

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Any improvements with the padding? I agree that the glitch cant be from the loss directly due to masking, but I found something else - in my case the length regulator is slightly overexpanding, i.e. attaching some extra repeats of the last input vecs due to the padding within batches during training, I added this to put the repeated inputs to zero:

ForwardTacotron/models/forward_tacotron.py

Line 171 in 611dd81

for i in range(x.size(0)):

Imo this could be one of the causes of a 'leak' in the RNNS.

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Oh, btw, I am also now adding a pitch module and the first results seem very promising. Might be adding an 'energy' vec as in FastSpeech2 as well, although in their ablation study the gain was pretty small. I thought if one could instead of calculating F0 just use the mean of the frequency distribution along the mel axis? Imo this should be pretty similar and wouldnt require an external lib to do the calculation.

from forwardtacotron.

m-toman commented on June 23, 2024

I'll check the LR thing above. Not sure now if I used your solution of wrote something myself because I added positional index (didn't see a huge difference though, if at all).
Strange that I never noticed it for the first months and then suddenly it appeared. Resetting to the commit from my last OK training really reproduced a non-burpy version... So I checked and checked again but no modifications to the model or dataloader or training procedure. But run 4 then also produced burps. It feels so semi random. For now I just overwrite the last symbol (in combination with forcing the last symbol to be punctuation and making sure there is silence in the audio at the end) with silence and that fixes the symptoms but still bugs me ;).

I'll post a sample when back home.

For pitch i userd the approach from fastpitch (repo is out there) which works fine with the proposed mean pitch per symbol but I am thinking about a more complex parameterization that also allows to control some delta features (perhaps just categories of falling/rising/flat pitch or so).
EDIT: here it is https://drive.google.com/file/d/1p9dJjLzJ0p0R3v0XLz-Z6hr1Xwhj-5Gd/view?usp=sharing
after changing the mel padding

from forwardtacotron.

m-toman commented on June 23, 2024

Until now it seems to be better - https://drive.google.com/file/d/1LkusT0VO8cKw3nI5jJ1GBmDLCu4jP_vv/view?usp=sharing
Just 7k steps, it often started to happen after 60k steps.
Generally I often feel there is not really much improvement after 10k steps Training, which is quite cool considering how long Taco takes to get the attention right (if at all).

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Thats only 7k steps? Impressive. I found no real improvements with a pitch module, although its fun to play around with the pitch. May be a limitation of our dataset though (8hrs only). I also tested the Nvidia FastPitch implementation, not better. I thought about looking more into the data to be honest, e.g. cleaning inconsistent pronunciations with an STT model.

from forwardtacotron.

m-toman commented on June 23, 2024

Yeah I didn't see improvement either but we need it for implementing ssml tags and it definitely works better than synthesis-resynthsis methods which introduce too much noise. And yeah probably one can think about some generative model to produce/sample interesting pitch contours.

My results were a bit strange. With both paddings modifications above I started to get weird pauses/prosody. Trained 3 times to verify it's not some random effect and at different stages of training.

Integrating DiffWave might be interesting as well.

from forwardtacotron.

cschaefer26 commented on June 23, 2024

No improvements as in with the pitch? Generally I have the same problems as you have, it seems that trainings can vary by large degrees, probably there is some randomness in what the model really fits...

from forwardtacotron.

m-toman commented on June 23, 2024

Yeah sounds pretty much the same with and without pitch model.
Well, still much more robust than most taco 2 implementations. Quite a few smaller datasets that did not work at all (in the sense that the output was cut off, broken words etc.) now work well. Sometimes prosody is not as natural (rarely) but better than garbage generated.

Will try multispeaker again soon but it seemed to me as if it would average a bit too much.

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Sounds good. I am regularly comparing the model to other architectures and I find that the LSTM produces a bit more fluent output but tends to more mumbling compared to a transformer basted model a la FastSpeech. Multispeaker didn't really add some benefit so far, but that could be due to lack of data yet.

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Just a quick update, I merged all the pitch stuff to master, I really found a benefit using the pitch condition. I see the same as you, after 10-20k steps the model is almost done. Quick question: Did you see any improvement with positional indexing? I found some generalization problems on smaller datasets, where the voice mumbles quite a bit especially for shorter sentences, weirdly. Also, I tried to add an STT model to the training and added a CTC loss to hope that the model is forced to be clearer, first results seem quite promising actually.

from forwardtacotron.

m-toman commented on June 23, 2024

No new experiments from my side atm.
I am not fully sure about the positional index, I added it before all the other stuff and kept it in although I didn't hear a major difference.

    def expand(self, x, durations):
        idx = self.build_index(durations, x)
        y = torch.gather(x, 1, idx)
        if self.posidx:
            duration_sums = durations.sum(dim=1)
            max_target_len = duration_sums.max().int().item()
            batch_size = durations.shape[0]
            position_encodings = torch.zeros(
                batch_size, max_target_len, dtype=torch.float32).to(durations.device)
            for obj_idx in range(batch_size):
                positions = torch.cat([torch.linspace(0, 1, steps=int(dur), dtype=torch.float32)
                                       for dur in durations[obj_idx]]).to(durations.device)
                position_encodings[obj_idx, :positions.shape[0]] = positions
            y = torch.cat([y, position_encodings.unsqueeze(dim=2)], dim=2)
        return y

I've checked out the samples of Glow-TTS in Mozilla TTS but did not really seem convincing.
My main issue atm is that the prosody could be better.

from forwardtacotron.

cschaefer26 commented on June 23, 2024

Intuitively I wouldnt expect worlds difference with pos indexing with lstms though. As for prosody, did you try to use a smaller separate duration predictor as I do? I found that the model is hugely overfitting otherwise (e.g. when duration prediction is done after the encoder). Also for prosody I have an idea I want to try out soon - similar to the pitch frequency i want to leak some duration stats from the target, e.g. some running mean of durations to condition the duration predictor on. My hope is that the model would pick up some rhythm / prosody swings in longer sentences (similar to the pitch swings).

from forwardtacotron.

m-toman commented on June 23, 2024

I tried running the duration predictor before the CBHG once but the results were a bit strange. Will try again. Also wondered if it's really a good idea to train it together with the rest or should rather be separate (or at least stop training it at some earlier point).

So you already added some "global" pitch stats to the model? Have to check your code.
Could probably instead of just expanding states according to the duration value also feed the duration value itself to the Mel predictor, don't know if that would help it.

from forwardtacotron.

cschaefer26 commented on June 23, 2024

My best resuls were with a mere 64 dim gru on duration prediction with lots of dropout before. Yes, it probably makes sense to completely separate it (to be able to compare results at different stages to the least). Yeah I reimplemented the FastPitch version (with minor differences) with pitch averaged over chars.

from forwardtacotron.

m-toman commented on June 23, 2024

I'll try the different duration model as soon as I got the capacity.
Also wanted to try out https://pytorch.org/blog/stochastic-weight-averaging-in-pytorch/

from forwardtacotron.

Experiments / Discussion about forwardtacotron HOT 69 CLOSED

Comments (69)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent