Comments (20)
When did this happen? Is it before or after diffusion model training? Is this before or after SLM adversarial training? I have noticed it happens several times myself which is why I put a set_trace
here, and there are a few reasons:
- The prosodic style encoder is not initialized correctly (which is unlikely as long as you don’t change the code and stage one checkpoint is valid).
- The discriminator kicks in too early (i.e.,
diff_epoch
is too low). - SLM adversarial training makes the model unstable (so you need to make skip update higher and clip scale lower, though this is unlikely because the current setting works for many different datasets).
- PL-BERT is not set up correctly (for example, you aren’t training an English model but used PL-BERT trained on English).
from styletts2.
It is within the first epoch of training under train_second.py. I am doing it with LJspeech.
from styletts2.
What is your config? With the settings in this repo, I don't have this issue, so it's probably related to stuff like learning rate and batch size etc.
from styletts2.
Also, check if your first stage model has reasonable quality in reconstruction in tensorboard. It should be perceptually indistinguishable from the ground truth, otherwise something is wrong with your first stage too.
from styletts2.
I kept most of your config, except that I increased the batch size and learning rate since I use 8 GPUs with larger memory. I set batch_size: 48 and increased lr by 3 times. The reconstruction you mean the audio right, I checked the audio in eval tab, I feel those are good.
however, I found the gen_loss is increasing and d_loss does not decrease once the epoch > TMA_epoch. Is that something unexpected?
from styletts2.
You should not increase the learning rate by 3 times, especially for PL-BERT; I believe this is where the problem is. I suggest you keep the learning rate unchanged even though you have a higher batch size. The highest batch size I have tried was 32 but I used the same learning rate. The demo samples on styletts2.github.io were generated using the model trained with a batch size of 32 with the exact same learning rate (they are slightly different from the one trained with a batch size of 16, but the quality is pretty much the same).
The following is the learning curve I have for the first stage model. If this is what you see in your tensorboard too, it should be fine. The loss increase is mostly caused by feature matching loss, as the features are getting more and more hard to catch because the discriminator is overfitting. You can see figure 3 of https://dl.acm.org/doi/pdf/10.1145/3573834.3574506, this is normal.
from styletts2.
Thank you for the knowledge sharing, I think my stage 1 training loss trajectory plot looks good based on the comparison. Trying what you suggested, so far no issues are shown in the first several epochs. Will continue the model training, and keep you posted. Thank you again.
from styletts2.
Hi, I found the same issue happen again in the 9th epoch of second-stage training. The loss_mel is Nan. I use a batch size of 32 with 8 GPUs, and others are the same as your config.
from styletts2.
This is so weird. Can you try to lower it to 16 instead? Does it still happen if the batch size is 16?
from styletts2.
Any update on batch size 16? Or is it because you used a different learning rate for the first stage model?
from styletts2.
In the second stage of training, I kept the batch to 16, and the Nan issues are not shown again with 8 GPU training.
However, once the epoch reaches 50, which is set for joint_epoch in config, I have run into the errors:
Traceback (most recent call last): File "StyleTTS2/train_second.py", line 789, in <module> main() File "/home/ubuntu/miniconda3/envs/styletts2/lib/python3.10/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/home/ubuntu/miniconda3/envs/styletts2/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/home/ubuntu/miniconda3/envs/styletts2/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/home/ubuntu/miniconda3/envs/styletts2/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "StyleTTS2/train_second.py", line 497, in main loss_gen_lm.backward() File "/home/ubuntu/miniconda3/envs/styletts2/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/ubuntu/miniconda3/envs/styletts2/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 1, 1]] is at version 3; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
The issue happens during the backward for loss_gen_lm. my pytorch version is 2.1.0
from styletts2.
This is likely caused by having too many GPUs but too few samples in a batch. Can you change batch_percentage
to 1 instead?
from styletts2.
Hi, the error still exists after setting the batch_percentage to 1.
I have not dived deep into the code, but had a quick look at lines 488-495,Is the error related to this issue mentioned in https://stackoverflow.com/questions/69163522/one-of-the-variables-modified-by-an-inplace-operation
Line 488 in 21f7cb9
from styletts2.
Your errors are so weird. I think it works all fine for me. Can you use 4 GPUs instead of 8? Or is it related to the CUDA version?
from styletts2.
Or I guess this codebase probably has some bugs with PyTorch because it has several weird issues like predictor_encoder.train() makes the F0 loss higher, it has high frequency background noise for old GPUs, it causes NaN with batch size 32 etc. I hope someone can reimplement everything because there’s probably something wrong in my code. The training pipeline was all written by myself instead of modified from some existing codebase (except a few modules like iSTFTNet, diffusion models etc.), so weird glitches are very likely.
from styletts2.
Hi, thank you for sharing your concerns. I don't think this is related to GPU, since after I set a breakpoint, I found when the d_loss_slm is non-zero, and loss_gen_lm is non-zero, the error will happen. When the d_loss_slm is 0, it works without errors. I guess it should be related to the two times backward().
from styletts2.
Does it cause different behavior though?
from styletts2.
Hi, i am facing the same problem. I did not change the kearning rate. I changesd the batch size and mx_len. The nan values start coming in the very first step of training.
my configs were:
log_dir: "Models/LJSpeech"
first_stage_path: "/home/ubuntu/projects/python/akshat/StyleTTS2/Models/LJSpeech/epoch_1st_00130.pth"
save_freq: 2
log_interval: 10
device: "cuda"
epochs_1st: 200 # number of epochs for first stage training (pre-training)
epochs_2nd: 100 # number of peochs for second stage training (joint training)
batch_size: 12
max_len: 100 # maximum number of frames
pretrained_model: ""
second_stage_load_pretrained: False # set to true if the pre-trained model is for 2nd stage
load_only_params: false # set to true if do not want to load epoch numbers and optimizer parameters
F0_path: "Utils/JDC/bst.t7"
ASR_config: "Utils/ASR/config.yml"
ASR_path: "Utils/ASR/epoch_00080.pth"
PLBERT_dir: 'Utils/PLBERT/'
data_params:
train_data: "Data/train_list.txt"
val_data: "Data/val_list.txt"
root_path: "LJSpeech-1.1/wavs"
OOD_data: "Data/OOD_texts.txt"
min_length: 50 # sample until texts with this size are obtained for OOD texts
preprocess_params:
sr: 24000
spect_params:
n_fft: 2048
win_length: 1200
hop_length: 300
model_params:
multispeaker: false
dim_in: 64
hidden_dim: 512
max_conv_dim: 512
n_layer: 3
n_mels: 80
n_token: 178 # number of phoneme tokens
max_dur: 50 # maximum duration of a single phoneme
style_dim: 128 # style vector size
dropout: 0.2
config for decoder
decoder:
type: 'istftnet' # either hifigan or istftnet
resblock_kernel_sizes: [3,7,11]
upsample_rates : [10, 6]
upsample_initial_channel: 512
resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
upsample_kernel_sizes: [20, 12]
gen_istft_n_fft: 20
gen_istft_hop_size: 5
speech language model config
slm:
model: 'microsoft/wavlm-base-plus'
sr: 16000 # sampling rate of SLM
hidden: 768 # hidden size of SLM
nlayers: 13 # number of layers of SLM
initial_channel: 64 # initial channels of SLM discriminator head
style diffusion model config
diffusion:
embedding_mask_proba: 0.1
# transformer config
transformer:
num_layers: 3
num_heads: 8
head_features: 64
multiplier: 2
# diffusion distribution config
dist:
sigma_data: 0.2 # placeholder for estimate_sigma_data set to false
estimate_sigma_data: true # estimate sigma_data from the current batch if set to true
mean: -3.0
std: 1.0
loss_params:
lambda_mel: 5. # mel reconstruction loss
lambda_gen: 1. # generator loss
lambda_slm: 1. # slm feature matching loss
lambda_mono: 1. # monotonic alignment loss (1st stage, TMA)
lambda_s2s: 1. # sequence-to-sequence loss (1st stage, TMA)
TMA_epoch: 48 # TMA starting epoch (1st stage)
lambda_F0: 1. # F0 reconstruction loss (2nd stage)
lambda_norm: 1. # norm reconstruction loss (2nd stage)
lambda_dur: 1. # duration loss (2nd stage)
lambda_ce: 20. # duration predictor probability output CE loss (2nd stage)
lambda_sty: 1. # style reconstruction loss (2nd stage)
lambda_diff: 1. # score matching loss (2nd stage)
diff_epoch: 20 # style diffusion starting epoch (2nd stage)
joint_epoch: 50 # joint training starting epoch (2nd stage)
optimizer_params:
lr: 0.0001 # general learning rate
bert_lr: 0.00001 # learning rate for PLBERT
ft_lr: 0.00001 # learning rate for acoustic modules
slmadv_params:
min_len: 400 # minimum length of samples
max_len: 500 # maximum length of samples
batch_percentage: 0.5 # to prevent out of memory, only use half of the original batch size
iter: 10 # update the discriminator every this iterations of generator update
thresh: 5 # gradient norm above which the gradient is scaled
scale: 0.01 # gradient scaling factor for predictors from SLM discriminators
sig: 1.5 # sigma for differentiable duration modeling
![Screenshot 2024-01-28 at 1 06 27 PM](https://private-user-images.githubusercontent.com/36241681/300245572-20d0b886-9b15-461a-b3f0-71ae422f55bd.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjIyNjIyNDIsIm5iZiI6MTcyMjI2MTk0MiwicGF0aCI6Ii8zNjI0MTY4MS8zMDAyNDU1NzItMjBkMGI4ODYtOWIxNS00NjFhLWIzZjAtNzFhZTQyMmY1NWJkLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MjklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzI5VDE0MDU0MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTU5Y2JiZGI1ZDk0YzQxNTVmMTAxNjcxZGRiZjQ5NjQwOGViNTIzY2U2N2I1MjZlMDA5NjVlYjQwYTg0M2IyYzkmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.0yO86Hq07M_3FPsFdR1RapBy8FUU_4rWfC1FqBw84pE)
from styletts2.
Also wheni tried to train the second stage from the checkpoint on Hugginface it worked fine. One thing i noticed was that the checkpoint is trained from scrathed is about 1.7gb but the one on huggingface is about 700mb. Am I doing something wring with the training in the stage 1 or you are not saving the discriminator in the checkpoint on huggingface?
from styletts2.
@yl4579 Could you please share your loss chart for the diffusion and duration losses? My model's diffusion doesn't seem to be decreasing and I'm curious what a successful run's diffusion loss looks like.
from styletts2.
Related Issues (20)
- FP8 Fine Tuning Crashes HOT 1
- Error Message After Using a fine tuned ASR Model
- Stage 2 Training Fails with NaN Loss on Single GPU Due to Inconsistent Checkpoint Keys
- Getting CUDA Out of memory error in Stage2 training HOT 13
- Multi-lingual training HOT 17
- In training Stage1 after 49th epoch getting RuntimeError: you can only change requires_grad flags of leaf variables, g_loss.requires_grad = True
- First stage training after 49th epoch (i.e., when epoch >= TMA_epoch)
- Getting error in d_loss.backward() of first_stage training
- Can the model learn accents not supported by espeak-ng?
- Joint training is failing with Assertion error
- In 2nd stage training AttributeError: 'AudioDiffusionConditional' object has no attribute 'module'
- Questions about Differentiable Duration Modeling HOT 1
- weird chinese pronunciation HOT 3
- Training PL-BERT on styletts2-community/multilingual-pl-bert
- Can anyone please share checkpoints that we get after we complete both stages of training HOT 3
- Model Size of fine tuned Model
- Can StyleTTS2 use phonemization from different languages to finetune or train?
- StyleTTS Python API doesn't detect devanagari script
- After training 1 epoch, train_first.py crashes: RuntimeError: Expected 2D (unbatched) or 3D (batched) input to conv1d, but got input of size: [1, 1, 1, 800] HOT 1
- Do we need lr scheduler?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from styletts2.