Comments (17)
@AWAS666 It still works even with a single word
wink.
. It did generate noise if there is no punctuation after this. I think this is caused by the training data again, where all sentences end with some sort of punctuation. https://vocaroo.com/1728QKrk6PSU https://vocaroo.com/1iptqIqXNtRj
You are correct, adding the fullstop helps fix it.
But only if you have embedding scale at 1, as soon as you raise that, it does it again :)
from styletts2.
This is a very interesting issue. During training the guidance scale is 1, and for some reason when the input is small it fails to generalize to higher guidance scale. I think probably during training we may have to vary the guidance scale randomly from 1 to 2 then? I will try to do this just for the 2nd stage and see if the problem disappears.
from styletts2.
Nvm also seems to happen on single words, setting both alpha and beta to zero makes it return to normal though.
from styletts2.
This is not supposed to occur. Setting alpha and beta to 0 means not using the diffusion model at all. What are your packages versions?
from styletts2.
Windows 10
Python 3.11.4
Torch 2.1.0+cu118
And loading the model to the GPU instead of CPU, but I'll do some further testing to narrow it down
This doesn't seem to make a difference...
from styletts2.
Could you make a conda environment with Python 3.10 instead? You can run the colab demo and check package versions there and make sure you install these packages instead.
from styletts2.
Could you make a conda environment with Python 3.10 instead? You can run the colab demo and check package versions there and make sure you install these packages instead.
I tried 3.10 locally, but exactly the same issue.
from styletts2.
Tried it in collab aswell, if I put in just the word "wink" as inference text, it will give me bad white noise.
from styletts2.
I think it could be due to not such a training sample during training. The model has never seen a single word during training (because we removed speech shorter than one second).
from styletts2.
Is there a way around it without retraining it, like dropping the diffusion model on short inputs?
Otherwise that likely means having to retrain, right?
from styletts2.
You can add some filler words before or after the word you want to speak and cut the audio to only get the word you are interested in.
from styletts2.
not a pretty solution either :)
but at least it isnt my setup alone
from styletts2.
I got the same problem - it's probably not related to the sentence length but on some first words in the sentance like: "you", "me". For example 'You can do that too." [58] - produces white noise on Colab.
So for example if in the longer text there is a sentance starting with "You will need .." this will corrupt audio afterwards.
But as mentioned alpha = 0, beta = 0
fixes that.
from styletts2.
@easyrider I have tried this You can do that too.
on Colab and was able to synthesize the speech in any voice.
https://vocaroo.com/1f8Rpq84L8H4
https://vocaroo.com/110LoHbYIP9Y
https://vocaroo.com/155vjtpiSYLO
https://vocaroo.com/19lqIdQEM9uJ (LJSpeech)
from styletts2.
@AWAS666 It still works even with a single word wink.
. It did generate noise if there is no punctuation after this. I think this is caused by the training data again, where all sentences end with some sort of punctuation.
https://vocaroo.com/1728QKrk6PSU
https://vocaroo.com/1iptqIqXNtRj
from styletts2.
Great to hear
from styletts2.
I have the same problem specifically for short sentences/phrases (all with puncutation) running on MacOS M2. I noticed that it seems to be more likely when the sentence length is less than about 40 characters. I was already doing TTS on longform audio, so I wrote a script that splits up sentences but if any sentence is less than 40 characters it attaches it to the previous or next sentence. This way every block of text I processed with StyleTTS2 is longer than 40 characters. That fixed the problem entirely for me. I didn't make any changes to punctuation and didn't change any of the words in the text.
from styletts2.
Related Issues (20)
- Strange Loss Behavior During Stage Two Training - Not Decreasing after Diff Epoch HOT 2
- Finetune on ljspeech or libritts? HOT 1
- Better LJSpeech or LibriTTS for finetuning a single speaker voice? Or training from scratch with not so much data? HOT 3
- SLM Adversarial Training did not start when finetuning HOT 11
- Second stage training with smaller window size HOT 1
- Possible Bug in Style Diffusion Inference Code
- Issue with impropper pauses and random bursts of noise
- Cannot Convert float NaN to integer HOT 1
- HELP WANTED!!!!!!!!!!! HOT 3
- asr negative loss
- Resuming finetuning uses second to last epoch
- Help Wanted For Stage-1 HOT 2
- Inference with multilingual PL-BERT Model HOT 4
- During training, the graphics memory has been continuously increasing
- May be a bug? input parameters for model.predictor_encoder and model.style_encoder in train_finetune.py
- S_loss = 0 ... why? HOT 2
- Inference Error: context_features exists but no features provided HOT 1
- Speech conditioning like tortoise TTS HOT 1
- FP8 Fine Tuning Crashes HOT 1
- Error Message After Using a fine tuned ASR Model
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from styletts2.