Code Monkey home page Code Monkey logo

Comments (23)

Choons avatar Choons commented on May 18, 2024 4

an even lower tech solution I use-- insert "scat" words/syllables at the beginning and end of the sentence and somehow it fixes the gaps. For instance, the sentence "I have something important to tell you" gaps terribly on its own, but "skee diddly bop I have something important to tell you action jackson" renders perfectly. Then I just trim the "scat" off in Audacity. Perhaps that can provide a hint what is wrong in the code.

from real-time-voice-cloning.

 avatar commented on May 18, 2024 2

Would like to highlight this again:

The presence of gaps depends on the training data.

Trained a new synthesizer with a curated dataset, in #538 (tensorflow) and #472 (comment) (pytorch). This fixes the issue with gaps.

from real-time-voice-cloning.

CorentinJ avatar CorentinJ commented on May 18, 2024 1

Oh I'm well aware the issue is present in this repo only. It's something I must have introduced while modifying rayhane's tacotron. Considering I hate to work with that codebase, I have in mind to switch to fatchord's tacotron to try and fix this bug at the same time. But as I said, I really don't have the time to work on that now, as I have work and university projects that take priority. If someone wants to work on that in a separate branch, I can definitely look over that from time to time.

As for long sentences, it's just a matter of the attention mechanism implemented. By splitting sentences on punctuation, you're fine with most sentences anyway.

from real-time-voice-cloning.

CorentinJ avatar CorentinJ commented on May 18, 2024

I doubt it's because of the stop prediction. The stop prediction only occurs after the spectrogram is generated. Yes, this is an issue of the synthesizer. It would have to be replaced by a better one (eliminating other problems with that) such as fatchord's, but I just don't have the time to do it.

from real-time-voice-cloning.

TheButlah avatar TheButlah commented on May 18, 2024

I was referring to the stop prediction in Tacotron 2 (synthesizer not vocoder), I wasn't aware that stop prediction was used in WaveRNN as it can just stop outputting when it runs out of spectrogram frames to condition on.

What do you mean by "the stop prediction only occurs after the spectrogram is generated"?

from real-time-voice-cloning.

CorentinJ avatar CorentinJ commented on May 18, 2024

I wasn't talking about the vocoder. Tacotron's decoder being autoregressive, the first stop token above the threshold value will be predicted when the spectrogram is done being generated, by definition. Thus is has no impact on previous frames, in fact its output is not fed back to the model IIRC. I don't see how the stop token could be the issue.

from real-time-voice-cloning.

TheButlah avatar TheButlah commented on May 18, 2024

Ah yes I see what you mean. That makes sense, I agree that it has to be another issue.

One idea that I had was annealing the level of teacher forcing that takes place during training. I suspect that the issue is that due to the synthesizer being autoregressive, any errors (deviation from true mel frame) are going to compound on each other as they get fed into the predictions for the next Mel frame. Teacher forcing accelerates training convergence because it removes the ability of these errors to propogate, but I would expect that the network would never learn to account for its own errors because it always was fed real data during training. Hence annealing the probability that the spectrogram frame is teacher forced might get the best of both worlds.

What do you think?

from real-time-voice-cloning.

CorentinJ avatar CorentinJ commented on May 18, 2024

I think the issue is elsewhere, as in most likely a bug from my end or rayhane's work. I've talked with someone else whose work also stems from rayahane's and he's got the same problem. Meanwhile, other implementations elsewhere (mozilla, nvidia, fatchord) of tacotron/tacotron2 do not have that issue.

from real-time-voice-cloning.

TheButlah avatar TheButlah commented on May 18, 2024

Where is fatchord's implementation? I don't see it on his github

from real-time-voice-cloning.

CorentinJ avatar CorentinJ commented on May 18, 2024

It's included with his WaveRNN, the same I use: https://github.com/fatchord/WaveRNN

from real-time-voice-cloning.

TheButlah avatar TheButlah commented on May 18, 2024

Oh I thought that was just a fork of Keithitos. Regardless, I'll look into using a different implementation and/or try to figure out whats wrong with rahayne's. Thanks for the help!

from real-time-voice-cloning.

TheButlah avatar TheButlah commented on May 18, 2024

For what It's worth, Ive been working extensively on @fatchord's repo adding improvements to it. I've trained models on it and no longer experience the gaps in the audio we have observed using Rayhane's repo. However, the synthesizer is still somewhat sensitive to sentence length, particularly long sentences. Sentences four words or more in length are fine, but once sentences start to get really long, you get the same stammering you can observe in @CorentinJ 's repo. So yes, switching to @fatchord's synthesizer would probably be a big improvement, but you would also have to add to it the capability to do multi-speaker training, as right now it only has single-speaker capability.

I can also confirm that its an issue with the attention mechanism, not the stop token or anything else. @fatchord's repo just stops generating when the spectrogram frame is below a certain audio threshold. No stop tokens involved. You can also look at the attention graph and clearly see that the failure cases are due to the attention getting stuck on a particular time step and never progressing.

from real-time-voice-cloning.

TheButlah avatar TheButlah commented on May 18, 2024

@CorentinJ actually on going back through my synthesized recordings from @Rayhane-mamah's repo, I haven't been able to observe any of the gaps I observe in your repo. I think its actually unique to this repository

from real-time-voice-cloning.

TheButlah avatar TheButlah commented on May 18, 2024

200K-logs-eval.zip (Rayhane Taco2, Griffin-Lim)
Archive.zip(Fatchord Taco1, Fatchord WaveRNN)
Both @fatchord and @Rayhane-mamah repos do not exhibit gaps in middle of spectrograms like this repo does.

They both exhibit failure in the case of especially long sentences, which is expected. Taco 2 appears to fare much better in this case.

from real-time-voice-cloning.

TheButlah avatar TheButlah commented on May 18, 2024

Makes sense! I agree that I like fatchord's synthesizer more as its easier to work with, although I think it would perform better qualitatively if it were tacotron 2 instead of taco1. Maybe someone will do a fork for it at some point to upgrade it.

from real-time-voice-cloning.

 avatar commented on May 18, 2024

Thank you for referencing the issue @macriluke. I am going to reopen this issue since I have some interest in fixing it. Another possibility is that it goes away in #370 when @dathudeptrai modifies the tensorflowTTS/tacotron2 code to work with this repo.

from real-time-voice-cloning.

 avatar commented on May 18, 2024

I found a very low-tech fix for this, which is to always run "trim_long_silences" on the vocoder output. The function uses webrtcvad and is found in encoder/audio.py. Will submit a PR when I get a chance.

from real-time-voice-cloning.

 avatar commented on May 18, 2024

Confirm that the issue of gaps in spectrograms will be resolved if we merge fatchord's tacotron1 in #472. The presence of gaps depends on the training data. I get no gaps when training with VCTK, and plenty of gaps with LibriTTS.

from real-time-voice-cloning.

 avatar commented on May 18, 2024

The presence of gaps depends on the training data. I get no gaps when training with VCTK, and plenty of gaps with LibriTTS.

As mentioned in #472 (comment) the gaps in LibriSpeech/TTS can be resolved by using voice activation detection to trim silences. See #501 for the process.

from real-time-voice-cloning.

Choons avatar Choons commented on May 18, 2024

Wow, bluefish, you have done some incredible work on this! Can you clarify-- do we need to add BOTH the code from #538 and #472 , or do we choose just one of either? ie a tensorflow solution versus a pytorch solution.

And if it's a choice between the two solutions, which one do you recommend as best performing?

from real-time-voice-cloning.

 avatar commented on May 18, 2024

Can you clarify-- do we need to add BOTH the code from #538 and #472 , or do we choose just one of either? ie a tensorflow solution versus a pytorch solution.

Most users today will want #538 because we haven't formally switched to the pytorch synthesizer. Once #472 is merged we will update the pretrained models wiki page to point to pytorch.

And if it's a choice between the two solutions, which one do you recommend as best performing?

They're about the same in performance. They have different quirks since the tacotron is different (tacotron 1 vs 2). In tensorflow (Rayhane-taco2) the stop token prediction sometimes fails and it synthesizes a huge silence until the decoder limit is reached. In pytorch (fatchord-taco1) the attention may get stuck on a certain character and making inference quit suddenly. Pick your poison. The attention mechanism needs to be improved.

from real-time-voice-cloning.

Choons avatar Choons commented on May 18, 2024

Understood. I'm glad you have taken on improving this voice project. I have tried to use other voice cloning implementations, but could never get them working as well as this one, even with the gap problem. I will experiment with both of your solutions and report back in this post how well they work for me.

from real-time-voice-cloning.

 avatar commented on May 18, 2024

Feedback is appreciated @Choons , it's always helpful to hear from those who are using the software and models.

from real-time-voice-cloning.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.