keonlee9420 / parallel-tacotron2 Goto Github PK

PyTorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

License: MIT License

Python 100.00%

neural-tts non-autoregressive vae self-attention duration parallel-tacotron parallel-tacotron2 speech-synthesis pytorch tts

parallel-tacotron2's People

Contributors

Stargazers

Watchers

Forkers

entn-at bridgettesong luckeryi sciai-ai saber5433 oytunturk loganhart02 triper1022 zhang-wy15 ishine trendingtechnology huypl53 jiawch batikim09 rielkim light1726 charlottecuc taras-sereda shaun95 windowxiaoming techthiyanes lycaojh juntengzhang kingfener tangliang9527 esoff kedengfeng auzxb friendmine jinwook-shim curiousme-lab reidefe orantake maxmax2016 lokmantsui qiaolinwang light-cao feng-yufei seongdonghyun sisanime zhangziliang04 ensky0 owayamaani

parallel-tacotron2's Issues

why use mean=0 and var=1 when infererce?

Parallel-Tacotron2/model/modules.py

Line 208 in a589311

    
           mu = log_var = torch.zeros([batch_size, max_src_len, self.d_latent], device=device)

LightWeightConv layer warnings during training

If just install specified requirements + Pillow and fairseq following warnings appear during training start:

No module named 'lightconv_cuda'

If install lightconv-layer from fairseq, the folllowing warning displayed:

WARNING: Unsupported filter length passed - skipping forward pass

Pytorch 1.7
Cuda 10.2
Fairseq 1.0.0a0+19793a7

Have you solved the bugs in soft-DTW?

Hi. The work is amazing. I notice that you mentioned there were some bugs in soft-DTW in "Updates". Have you already solved these problems?

training problem

  File "/data1/hjh/pycharm_projects/tts/parallel-tacotron2_try/model/parallel_tacotron2.py", line 68, in forward
    self.learned_upsampling(durations, V, src_lens, src_masks, max_src_len)
  File "/home/huangjiahong.dracu/miniconda2/envs/parallel_tc2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/data1/hjh/pycharm_projects/tts/parallel-tacotron2_try/model/modules.py", line 335, in forward
    mel_mask = get_mask_from_lengths(mel_len, max_mel_len)
  File "/data1/hjh/pycharm_projects/tts/parallel-tacotron2_try/utils/tools.py", line 87, in get_mask_from_lengths
    ids = torch.arange(0, max_len).unsqueeze(0).expand(batch_size, -1).to(device)
RuntimeError: upper bound and larger bound inconsistent with step sign

Thank you for you jobs. I got above problem when training. I guess it's a Duration prediction problem. How to solve it?

Soft DTW with Cython implementation

Hi @keonlee9420 , have you tried the Cython version of Soft DTW from this repo

https://github.com/mblondel/soft-dtw

Is it available to apply for Parallel Tacotron 2 ? I am trying that repo because the current batch is too small when using CUDA implement of @Maghoumi .

I just wonder that @Maghoumi in https://github.com/Maghoumi/pytorch-softdtw-cuda claims that experiment with batch size

But when applying for Para Taco, the batch size is too small, are there any gap?

Error with install torch by requirement.txt. ModuleNotFoundError: No module named 'torch.distributed.algorithms'

I was installing your repo(to see whether I can make it converse) on GG cloud T4.

Handle audios with long duration

When I load audios with mel-spectrogram frames larger than max sequence of mel len (1000 frames):

There is a problem when concatenating pos + speaker + mels: I try to set max_seq_len larger (1500),
Then lead to a problem with Soft DTW, they said the maximum is 1024

For solution, I tried to trim mels for fitting 1024 but it seems complicated, now I filter out all audios with frames > 1024

Any suggestion for handle Long Audios? I wonder how it work at inference steps.

Suggestion for adding open German "Thorsten" dataset

Hi.

According to text in README (will be added more) i would like to suggest to add my open German "Thorsten" dataset.

Thorsten: a single-speaker German open dataset consists of 22.668 short audio clips of a male speaker, approximately 23 hours in total (LJSpeech file/directory syntax).

https://github.com/thorstenMueller/deep-learning-german-tts/

Why no alignment at all?

I cloned the code, prepared data according to README, and just updated:

ljspeech data path in config/LJSpeech/train.yaml
unzip generator_LJSpeech.pth.tar.zip to get generator_LJSpeech.pth.tar
and the code can run!
But, no matter how many steps I trained, the images are always like this and demo audio sounds like noise:

Could you please share your audio samples, pretrained models and loss curves?

Hi, Thanks for your excellent work!
Could you possibly share your audio samples, pretrained models and loss curves with me?
Thanks so much for your help!

I following your command to run the code, but I get following error.
File "train.py", line 87, in main output = model(*(batch[2:])) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 162, in forward return self.gather(outputs, self.output_device) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 174, in gather return gather(outputs, output_device, dim=self.dim) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather res = gather_map(outputs) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return type(out)(map(gather_map, zip(*outputs))) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return type(out)(map(gather_map, zip(*outputs))) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map return Gather.apply(target_device, dim, *outputs) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 71, in forward return comm.gather(inputs, ctx.dim, ctx.target_device) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 230, in gather return torch._C._gather(tensors, dim, destination) RuntimeError: Input tensor at index 1 has invalid shape [1, 474, 80], but expected [1, 302, 80]

Soft DTW

Hello,
Has anybody been able to train with softdtw loss. It doesn't converge at all. I think there is a problem with the implementation but I could't spot it. When I train with the real alignments it works well

cannot import name II from omegaconf

Great work. But I encounter one problems when train this model :(
The error message:

ImportError: cannot import name II form omegaconf

The version of fairseq is 0.10.2 (latest releaser version) and omegaconf is 1.4.1. How to fix it?

Thank you

weights required

Can someone share the weights file link? I couldn't synthesize it or use its inference. If I am wrong please tell me the correct method of using it. Thanks

Training issue

Thanks for sharing the nice model implementation.

When I start training, the following warning appears, do you also get the same message?
I think it's a fairseq installation problem.
No module named 'lightconv_cuda'

And I'm training in batch size 5.... on 24G memory sized RTX 3090. Could the above problem be the cause?

Can we train with this yet?

Just wondering if we can train with LJS on this implementation thanks!

why Lconv block doesn't have stride argument?

Hi, Thanks for implement.

I think Parallel TacoTron2 using same residual Encoder as parallel tacotron 1.
In parallel tacotron, using five 17 × 1 LConv blocks interleaved with strided 3 × 1 convolutions

But, in your implementation, Lconvblock doesn't have stride argument.
How did you handle this part?

Thanks.

About FVAE

I think your code did not add a network for predicting latent represetation during inference.

how's the speed compare with tacotron2?

any inference speed?