Code Monkey home page Code Monkey logo

keonlee9420 / comprehensive-transformer-tts Goto Github PK

View Code? Open in Web Editor NEW
317.0 14.0 41.0 146.61 MB

A Non-Autoregressive Transformer based Text-to-Speech, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS

License: MIT License

Dockerfile 0.30% Python 99.70%
text-to-speech supervised unsupervised non-autoregressive non-ar multi-speaker ultimate-tts tts pytorch comprehensive

comprehensive-transformer-tts's People

Contributors

keonlee9420 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

comprehensive-transformer-tts's Issues

RuntimeError: The size of tensor a (1191) must match the size of tensor b (1000) at non-singleton dimension 1

Hi,
Thanks for the great work.
I met an error when training on the ryanspeech dataset:

Traceback (most recent call last):
  File "train.py", line 254, in <module>
    train(0, args, configs, batch_size, num_gpus)
  File "train.py", line 110, in train
    losses = Loss(batch, output, step=step)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/Comprehensive-Transformer-TTS/model/loss.py", line 334, in forward
    pitch_loss = self.get_pitch_loss(pitch_predictions, pitch_targets)
  File "/root/Comprehensive-Transformer-TTS/model/loss.py", line 197, in get_pitch_loss
    losses["uv"] = (F.binary_cross_entropy_with_logits(uv_pred, uv, reduction="none") * nonpadding) \
RuntimeError: The size of tensor a (1191) must match the size of tensor b (1000) at non-singleton dimension 1

I printed the shape of both uv_pred and uv, and they were both [16, 1191].

My configuration is

 ---> Automatic Mixed Precision: True
 ---> Number of used GPU: 1
 ---> Batch size per GPU: 16
 ---> Batch size in total: 16
 ---> Type of Building Block: conformer
 ---> Type of Duration Modeling: supervised
 ---> Type of Prosody Modeling: liu2021

This happened at around 50k+ steps.
What am I missing? Thank you!

Prosody Loss

Hi, I am adding your MDN prosody modeling code segment to my tacotron but I encountered several problems about the code segment about prosody modeling. First, the prosody loss is added into the total loss only after the prosody_loss_enable_steps but in the training steps before the prosody_loss_enable_steps the prosody representation is already added with the text encoding. Does it means in the training steps before the prosody_loss_enable_steps, the prosody representation is optimized without the prosody loss?
Second, in the training steps, the backward gradient of training prosody predictor should be acted like "stop gradient" but it seems little relevant code.
Thanks!

eeObXqHdtF

Reason for std and input scaling in cwt?

Hey, I have some questions about your pitch predictor in cwt domain:

decoder_inp = decoder_inp.detach() + self.predictor_grad * (decoder_inp - decoder_inp.detach())
pitch_padding = mel2ph == 0


if self.pitch_type == "cwt":
    pitch_padding = None
    cwt = cwt_out = self.cwt_predictor(decoder_inp) * control
    stats_out = self.cwt_stats_layers(encoder_out[:, 0, :])  # [B, 2]
    mean = f0_mean = stats_out[:, 0]
    std = f0_std = stats_out[:, 1]
    cwt_spec = cwt_out[:, :, :10]
    if f0 is None:
        std = std * self.cwt_std_scale
        f0 = cwt2f0_norm(

I have three questions:

  1. What is the reason for the first line? Isn't the right side always zero and therefore no gradients flow back?
  2. Why do you scale inputs by 0.1?
  3. Why did you scale ground truth std by 0.8?

Thanks for any help in advance!

about duration predictor

In "learn_alignment: True" mode, the input of duration predictor is "x.detach() + self.predictor_grad * (x - x.detach())".

  1. Why do we need detach()?
  2. Why do we need add self.predictor_grad * (x - x.detach()) since it always is zero?

An errors with running the preprocess.py

I'm trying to preprocess the VCTK dataset, and stuck on the 'Computing statistic quantities' step. When I copy from repo preprocessed_data files instead, the training run successful.

Firstly, there is a runtime error:

preprocessor.py

625: cont_lf0_lpf_norm = (cont_lf0_lpf - logf0s_mean_org) / logf0s_std_org
RuntimeWarning: invalid value encountered in true_divide

After applying a simple crutch to fix a value of logf0s_std_org, next error appear:

165: energy_mean = energy_scaler.mean_[0]
'StandardScaler' object has no attribute 'mean_'

win 10
conda python 3.6.15
all packages from the requirements.txt is installed

Gibberish synthesized speech from my own model

Hi,
I am training a model on the ryanspeech dataset. Currently it is on 125k+ steps, and I tried to synthesize a speech with the checkpoint, but the result is rather hard to understand.

output.mp4

I tried adding the --duration_control 1.3 to the command, but I got

Traceback (most recent call last):
  File "synthesize.py", line 231, in <module>
    synthesize(device, model, args, configs, vocoder, batchs, control_values)
  File "synthesize.py", line 95, in synthesize
    output = model(
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/Comprehensive-Transformer-TTS/model/CompTransTTS.py", line 112, in forward
    ) = self.variance_adaptor(
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/Comprehensive-Transformer-TTS/model/modules.py", line 1088, in forward
    pitch_prediction, pitch_embedding = self.get_pitch_embedding(
  File "/root/Comprehensive-Transformer-TTS/model/modules.py", line 933, in get_pitch_embedding
    f0_denorm = denorm_f0(f0, uv, self.preprocess_config["preprocessing"]["pitch"], pitch_padding=pitch_padding)
  File "/root/Comprehensive-Transformer-TTS/utils/pitch_tools.py", line 79, in denorm_f0
    f0[uv > 0] = 0
IndexError: The shape of the mask [1, 154] at index 1 does not match the shape of the indexed tensor [1, 173] at index 1

My config is

block_type: "transformer_fs2"

duration_modeling:
  learn_alignment: False
  aligner_temperature: 0.0005

prosody_modeling:
  model_type: "liu2021"

What am I missing?
Thank you!

Preprocess error

█████████| 137/137 [18:49:05<00:00, 494.49s/it]
Computing statistic quantities ...
Traceback (most recent call last):
File "preprocess.py", line 19, in
preprocessor.build_from_path()
File "/GPUFS/sysu_hpcedu_123/Comprehensive-Transformer-TTS/preprocessor/preprocessor.py", line 267, in build_from_path
f0s_sup_stats = compute_f0_stats(f0s_sup)
File "/GPUFS/sysu_hpcedu_123/Comprehensive-Transformer-TTS/preprocessor/preprocessor.py", line 145, in compute_f0_stats return (f0_mean, f0_std)
UnboundLocalError: local variable 'f0_mean' referenced before assignment

It crashed after a long time I run prerprocess
For dataset,I use a dataset which simliar as VCTK but chinese,and It did not throw any error before this step.
Could anyone can help me?

Problem with Utterance-level Prosody extractor of DelightfulTTS

I've recently been experimenting with your implementation of DelightfulTTS and the voice quality is awesome. However I found out that the embedding vector output of Utterance-level Prosody extractor is very small, making the that of Utterance-level Prosody predictor small as well (L2 is roughly 12 and each element in the vector is roughly 0.2 to 0.3). Vectors with element close to zero means this layer mostly doesn't add any information at all. Have you find any solution to this?

requirements fail to install

seems packages may have updated names, specific python version appears to be required, requires c++ build tools
suggest update:

python~3.8.0 and <3.9

praat-parselmouth==0.3.3
g2p-en==2.1.0
scikit-learn==0.22.2.post1

if possible as it will not install 1.7:
torch>=1.7.0 (==2.0.0)

weird sounding voices with MelGAN

Hello,

Audio samples generated with multi-speaker MelGEN (haven't tried single-speaker) sound unnatural.

I know worse quality is expected, but all samples sound like having significantly too high pitch.

Maybe there is a bug in implementation ported from FastSpeech?

Multi-GPU training could not work normally?

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates th at your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel;
As suggested, I modified the model by adding find_unused_parameters=True, as followings, model = DistributedDataParallel(model, device_ids=[rank], find_unused_parameters=True).to(device), but I still got the same errors, could you train normally when with multi GPU? Any suggestions to fix this?
Many Thanks.

bug in calculate the energy in FastSpeechSTFT

I think here is a bug in audio/stft.py: 252
energy = np.sqrt(np.exp(mel) ** 2).sum(-1)
This code did nothing but just sum the abs of the np.exp(mel), while we expect it to calculate the sum before the sqrt.
The correct code should be
energy = np.sqrt((np.exp(mel) ** 2).sum(-1))

Unvoiced loss is too high for me.

Hello, I'm trying to train TTS model with frame level pitch prediction(not cwt) for my spanish dataset.

First, I made a small modification for training. before var_start_steps, I just detach encoder input from variational predictor instead of setting init_losses
like below.

if step < self.config.var_start_steps:
    x_pitch = x.detach()       
else:
    x_pitch = x

pitch_prediction, pitch_embed = self.get_pitch_embedding(x_pitch, pitch_target, uv_target, pitch_control)

When I do this and observe losses, synthesized mel spectrogram by ground truth pitch, uv, dur was perfect, and pitch, dur loss descends satisfactorily even before var_start_steps.

But only Unvoiced loss starts from 0.9 and descends too slowly(and not descends before var_start_steps)in my case. (for 50k ~ 100k steps)
And the mel synthesized in eval state have no pitch(uv is almost 1) .

Does anyone had the same problem like me?
Any help would be appreciated.

Preprocessed data in my dataset is like below. I think grount truth unvoiced segment have no problem.
image

Thank you.

unsupervised learn_alignment inference error

Dataset:LJSpeech
I use lstransformer and learn_alignment: True.
And I do not use any prosody and variance like:
loss:
lambda_uv: 0.0
lambda_ph_dur: 0.0
lambda_word_dur: 0.0
lambda_sent_dur: 0.0

variance_embedding:
use_pitch_embed: False
use_energy_embed: False

But when inference, the alignment learnt from LengthRegulator was very incorrect, often only 4-5 frames.
Below is training tensorboard picture:

Screenshot 2023-03-01 at 15 05 19

Screenshot 2023-03-01 at 15 05 40

Screenshot 2023-03-01 at 15 05 44

Weird sound in LONG sentence

Hi, it's a really nice work!
In my experiment, the speech goes weird after 10s (short sentences are all good). Losses decrese normally, and I checked the predicted dur/pitch/energy, they were all good as well. Only the mel goes weird.
Have you ever encontered this kind of problem in long sentences?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.