I'm trying to fine-tune the LibriTTS checkpoint on ~1 hour of LJSpeech but get poor re

Yes, you can leave multispeaker setting to <code clas

Poor audio quality after fine-tuning about styletts2 HOT 3 CLOSED

yl4579 commented on July 30, 2024

Poor audio quality after fine-tuning

from styletts2.

Comments (3)

yl4579 commented on July 30, 2024 1

Yes, you can leave multispeaker setting to true. I used the same inference code as in the Colab notebook: https://colab.research.google.com/github/yl4579/StyleTTS2/blob/main/Colab/StyleTTS2_Finetune_Demo.ipynb

I haven't really tested with different max_len, but try to increase it as much as you can while keeping the batch size at least 2, and also do the SLM adversarial training run if you could (this is very RAM consuming though). I know right now the code is not very friendly to low RAM GPUs because of DP implementation. You can wait for fixed DDP implantations for mixed precision training.

from styletts2.

yl4579 commented on July 30, 2024

For 4, did you change multispeaker to true or false? The default is true, and the default settings do produce better results than you have. The only difference I can see is batch_size (from 16 to 4), but it shouldn't produce this big difference. max_len from 400 to 100 is probably the cause. This is what I got by finetuning with one hour of data: https://voca.ro/1aC4vr4jErDL using the default setting.

from styletts2.

danielmsu commented on July 30, 2024

For 4, did you change multispeaker to true or false?

I fine-tuned the model with multispeaker:true and then tried inference with both true and false. It definitely works better with true, the example I attached is also generated with multispeaker:true. I didn't try to fine-tune it with false, but I guess a model fine-tuned with true in the config should produce better results anyway, is that correct?

max_len from 400 to 100 is probably the cause

Do you know what is the minimal value for decent results? Unfortunately, I cannot use 400, but maybe I could set it a bit higher than 100 if I reduce batch_size even more. Training speed is not a concern for me.

This is what I got by finetuning with one hour of data: https://voca.ro/1aC4vr4jErDL using the default setting.

Yes, that sounds much better. Could you please share inference parameters? Would be awesome if you still have alpha/beta values and the name of the reference clip, so I can compare my results using the same values.

Thanks!

from styletts2.

Recommend Projects

Poor audio quality after fine-tuning about styletts2 HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent