Code Monkey home page Code Monkey logo

Comments (20)

yl4579 avatar yl4579 commented on July 30, 2024

When did this happen? Is it before or after diffusion model training? Is this before or after SLM adversarial training? I have noticed it happens several times myself which is why I put a set_trace here, and there are a few reasons:

  1. The prosodic style encoder is not initialized correctly (which is unlikely as long as you don’t change the code and stage one checkpoint is valid).
  2. The discriminator kicks in too early (i.e., diff_epoch is too low).
  3. SLM adversarial training makes the model unstable (so you need to make skip update higher and clip scale lower, though this is unlikely because the current setting works for many different datasets).
  4. PL-BERT is not set up correctly (for example, you aren’t training an English model but used PL-BERT trained on English).

from styletts2.

haoqi avatar haoqi commented on July 30, 2024

It is within the first epoch of training under train_second.py. I am doing it with LJspeech.

from styletts2.

yl4579 avatar yl4579 commented on July 30, 2024

What is your config? With the settings in this repo, I don't have this issue, so it's probably related to stuff like learning rate and batch size etc.

from styletts2.

yl4579 avatar yl4579 commented on July 30, 2024

Also, check if your first stage model has reasonable quality in reconstruction in tensorboard. It should be perceptually indistinguishable from the ground truth, otherwise something is wrong with your first stage too.

from styletts2.

haoqi avatar haoqi commented on July 30, 2024

I kept most of your config, except that I increased the batch size and learning rate since I use 8 GPUs with larger memory. I set batch_size: 48 and increased lr by 3 times. The reconstruction you mean the audio right, I checked the audio in eval tab, I feel those are good.
however, I found the gen_loss is increasing and d_loss does not decrease once the epoch > TMA_epoch. Is that something unexpected?

from styletts2.

yl4579 avatar yl4579 commented on July 30, 2024

You should not increase the learning rate by 3 times, especially for PL-BERT; I believe this is where the problem is. I suggest you keep the learning rate unchanged even though you have a higher batch size. The highest batch size I have tried was 32 but I used the same learning rate. The demo samples on styletts2.github.io were generated using the model trained with a batch size of 32 with the exact same learning rate (they are slightly different from the one trained with a batch size of 16, but the quality is pretty much the same).

The following is the learning curve I have for the first stage model. If this is what you see in your tensorboard too, it should be fine. The loss increase is mostly caused by feature matching loss, as the features are getting more and more hard to catch because the discriminator is overfitting. You can see figure 3 of https://dl.acm.org/doi/pdf/10.1145/3573834.3574506, this is normal.
image

from styletts2.

haoqi avatar haoqi commented on July 30, 2024

Thank you for the knowledge sharing, I think my stage 1 training loss trajectory plot looks good based on the comparison. Trying what you suggested, so far no issues are shown in the first several epochs. Will continue the model training, and keep you posted. Thank you again.

from styletts2.

haoqi avatar haoqi commented on July 30, 2024

Hi, I found the same issue happen again in the 9th epoch of second-stage training. The loss_mel is Nan. I use a batch size of 32 with 8 GPUs, and others are the same as your config.

from styletts2.

yl4579 avatar yl4579 commented on July 30, 2024

This is so weird. Can you try to lower it to 16 instead? Does it still happen if the batch size is 16?

from styletts2.

yl4579 avatar yl4579 commented on July 30, 2024

Any update on batch size 16? Or is it because you used a different learning rate for the first stage model?

from styletts2.

haoqi avatar haoqi commented on July 30, 2024

In the second stage of training, I kept the batch to 16, and the Nan issues are not shown again with 8 GPU training.
However, once the epoch reaches 50, which is set for joint_epoch in config, I have run into the errors:

Traceback (most recent call last): File "StyleTTS2/train_second.py", line 789, in <module> main() File "/home/ubuntu/miniconda3/envs/styletts2/lib/python3.10/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/home/ubuntu/miniconda3/envs/styletts2/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/home/ubuntu/miniconda3/envs/styletts2/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/home/ubuntu/miniconda3/envs/styletts2/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "StyleTTS2/train_second.py", line 497, in main loss_gen_lm.backward() File "/home/ubuntu/miniconda3/envs/styletts2/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/ubuntu/miniconda3/envs/styletts2/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 1, 1]] is at version 3; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

The issue happens during the backward for loss_gen_lm. my pytorch version is 2.1.0

from styletts2.

yl4579 avatar yl4579 commented on July 30, 2024

This is likely caused by having too many GPUs but too few samples in a batch. Can you change batch_percentage to 1 instead?

from styletts2.

haoqi avatar haoqi commented on July 30, 2024

Hi, the error still exists after setting the batch_percentage to 1.
I have not dived deep into the code, but had a quick look at lines 488-495,Is the error related to this issue mentioned in https://stackoverflow.com/questions/69163522/one-of-the-variables-modified-by-an-inplace-operation

if d_loss_slm != 0:

from styletts2.

yl4579 avatar yl4579 commented on July 30, 2024

Your errors are so weird. I think it works all fine for me. Can you use 4 GPUs instead of 8? Or is it related to the CUDA version?

from styletts2.

yl4579 avatar yl4579 commented on July 30, 2024

Or I guess this codebase probably has some bugs with PyTorch because it has several weird issues like predictor_encoder.train() makes the F0 loss higher, it has high frequency background noise for old GPUs, it causes NaN with batch size 32 etc. I hope someone can reimplement everything because there’s probably something wrong in my code. The training pipeline was all written by myself instead of modified from some existing codebase (except a few modules like iSTFTNet, diffusion models etc.), so weird glitches are very likely.

from styletts2.

haoqi avatar haoqi commented on July 30, 2024

Hi, thank you for sharing your concerns. I don't think this is related to GPU, since after I set a breakpoint, I found when the d_loss_slm is non-zero, and loss_gen_lm is non-zero, the error will happen. When the d_loss_slm is 0, it works without errors. I guess it should be related to the two times backward().

from styletts2.

yl4579 avatar yl4579 commented on July 30, 2024

Does it cause different behavior though?

from styletts2.

akshatgarg99 avatar akshatgarg99 commented on July 30, 2024

Hi, i am facing the same problem. I did not change the kearning rate. I changesd the batch size and mx_len. The nan values start coming in the very first step of training.
my configs were:

log_dir: "Models/LJSpeech"
first_stage_path: "/home/ubuntu/projects/python/akshat/StyleTTS2/Models/LJSpeech/epoch_1st_00130.pth"
save_freq: 2
log_interval: 10
device: "cuda"
epochs_1st: 200 # number of epochs for first stage training (pre-training)
epochs_2nd: 100 # number of peochs for second stage training (joint training)
batch_size: 12
max_len: 100 # maximum number of frames
pretrained_model: ""
second_stage_load_pretrained: False # set to true if the pre-trained model is for 2nd stage
load_only_params: false # set to true if do not want to load epoch numbers and optimizer parameters

F0_path: "Utils/JDC/bst.t7"
ASR_config: "Utils/ASR/config.yml"
ASR_path: "Utils/ASR/epoch_00080.pth"
PLBERT_dir: 'Utils/PLBERT/'

data_params:
train_data: "Data/train_list.txt"
val_data: "Data/val_list.txt"
root_path: "LJSpeech-1.1/wavs"
OOD_data: "Data/OOD_texts.txt"
min_length: 50 # sample until texts with this size are obtained for OOD texts

preprocess_params:
sr: 24000
spect_params:
n_fft: 2048
win_length: 1200
hop_length: 300

model_params:
multispeaker: false

dim_in: 64
hidden_dim: 512
max_conv_dim: 512
n_layer: 3
n_mels: 80

n_token: 178 # number of phoneme tokens
max_dur: 50 # maximum duration of a single phoneme
style_dim: 128 # style vector size

dropout: 0.2

config for decoder

decoder:
type: 'istftnet' # either hifigan or istftnet
resblock_kernel_sizes: [3,7,11]
upsample_rates : [10, 6]
upsample_initial_channel: 512
resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
upsample_kernel_sizes: [20, 12]
gen_istft_n_fft: 20
gen_istft_hop_size: 5

speech language model config

slm:
model: 'microsoft/wavlm-base-plus'
sr: 16000 # sampling rate of SLM
hidden: 768 # hidden size of SLM
nlayers: 13 # number of layers of SLM
initial_channel: 64 # initial channels of SLM discriminator head

style diffusion model config

diffusion:
embedding_mask_proba: 0.1
# transformer config
transformer:
num_layers: 3
num_heads: 8
head_features: 64
multiplier: 2

# diffusion distribution config
dist:
  sigma_data: 0.2 # placeholder for estimate_sigma_data set to false
  estimate_sigma_data: true # estimate sigma_data from the current batch if set to true
  mean: -3.0
  std: 1.0

loss_params:
lambda_mel: 5. # mel reconstruction loss
lambda_gen: 1. # generator loss
lambda_slm: 1. # slm feature matching loss

lambda_mono: 1. # monotonic alignment loss (1st stage, TMA)
lambda_s2s: 1. # sequence-to-sequence loss (1st stage, TMA)
TMA_epoch: 48 # TMA starting epoch (1st stage)

lambda_F0: 1. # F0 reconstruction loss (2nd stage)
lambda_norm: 1. # norm reconstruction loss (2nd stage)
lambda_dur: 1. # duration loss (2nd stage)
lambda_ce: 20. # duration predictor probability output CE loss (2nd stage)
lambda_sty: 1. # style reconstruction loss (2nd stage)
lambda_diff: 1. # score matching loss (2nd stage)

diff_epoch: 20 # style diffusion starting epoch (2nd stage)
joint_epoch: 50 # joint training starting epoch (2nd stage)

optimizer_params:
lr: 0.0001 # general learning rate
bert_lr: 0.00001 # learning rate for PLBERT
ft_lr: 0.00001 # learning rate for acoustic modules

slmadv_params:
min_len: 400 # minimum length of samples
max_len: 500 # maximum length of samples
batch_percentage: 0.5 # to prevent out of memory, only use half of the original batch size
iter: 10 # update the discriminator every this iterations of generator update
thresh: 5 # gradient norm above which the gradient is scaled
scale: 0.01 # gradient scaling factor for predictors from SLM discriminators
sig: 1.5 # sigma for differentiable duration modeling

Screenshot 2024-01-28 at 1 06 27 PM

from styletts2.

akshatgarg99 avatar akshatgarg99 commented on July 30, 2024

Also wheni tried to train the second stage from the checkpoint on Hugginface it worked fine. One thing i noticed was that the checkpoint is trained from scrathed is about 1.7gb but the one on huggingface is about 700mb. Am I doing something wring with the training in the stage 1 or you are not saving the discriminator in the checkpoint on huggingface?

from styletts2.

ethan-digi avatar ethan-digi commented on July 30, 2024

@yl4579 Could you please share your loss chart for the diffusion and duration losses? My model's diffusion doesn't seem to be decreasing and I'm curious what a successful run's diffusion loss looks like.

from styletts2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.