rishikksh20 / tfgan Goto Github PK

TFGAN: Time and Frequency Domain Based Generative Adversarial Network for High-fidelity Speech Synthesis

License: Apache License 2.0

Python 100.00%

tfgan fidelity-speech-synthesis frequency-domain gan speech speech-synthesis tts

tfgan's Introduction

TFGAN

This repo is Unofficial implements of TFGAN: Time and Frequency Domain Based Generative Adversarial Network for High-fidelity Speech Synthesis using Pytorch.

Requirement

Tested on Python 3.6

pip install -r requirements.txt

Prepare Dataset

Download dataset for training. This can be any wav files with sample rate 22050Hz. (e.g. LJSpeech was used in paper)
preprocess: python preprocess.py -c config/default.yaml -d [data's root path]
Edit configuration yaml file

Train & Tensorboard

python trainer.py -c [config yaml file] -n [name of the run]
- cp config/default.yaml config/config.yaml and then edit config.yaml
- Write down the root path of train/validation files to 2nd/3rd line.
tensorboard --logdir logs/

Inference

python inference.py -p [checkpoint path] -i [input mel path]

Checkpoint :

LJSpeech checkpoint here .

tfgan's People

Contributors

Stargazers

Watchers

tfgan's Issues

[Question] Upsample in generator net

Hi, thanks for your code. I found in the generator net, the implement used a reflection padding before calling conv1d when processing upsample, which is not mentioned in the paper.

Then in addition to using transpose convolution, we also repeat the output of the first step by the up-sample factor directly and following a convolutional layer.

My question is why this reflection padding is necessary? Thanks in advance.

What should the time loss look like, why lr=0.00002

Why you set lr=0.00002, have you tried other settings, is lr=0.00002 the best?
In the paper, the time loss is an import point, but in my training, the time loss seems not decline.
I use my own dataset , sample rate of 16k, the mels are extracted using scripts in tacotron2. Training is still in progress, and the generated wavs become cleaner with more steps. And I started the trainning of disc at 100k steps.

@rishikksh20 what is the time loss in your trainning?

time domain loss and CUDA

If I run:

loss2 = TimeDomainLoss_v2()
a = torch.randn(32, 1, 16384).to(device='cuda')
b = torch.randn(32, 1, 16384).to(device='cuda')
final2 = loss2(a, b)

I get this error:


Traceback (most recent call last):
  File "c:\git\melganMono\melgan\models\timeloss.py", line 111, in <module>
    final2 = loss2(a, b)
  File "C:\Users\listener17\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "c:\git\melganMono\melgan\models\timeloss.py", line 90, in forward
    y_energy = F.conv1d(y**2, getattr(self, f'filters_{i}'), stride=self.strides[i])
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

Why does this network ask to enter any spectrogram at the last stage?

Hello, could you please tell me why does the network ask to enter any spectrogram at the time of the outputting the result?
I mean this command
python inference.py -p [checkpoint path] -i [input mel path]
Usually , GAN networks generate random noise by themselves, so why does network need mel to output the result ?

torch.nn.modules.module.ModuleAttributeError: 'Upsample' object has no attribute 'remove_weight_norm'

About frequency discriminator

Hi, the time loss calculated by convolution layer is very impressive.
I have questions about frequency discriminator

As we know, the output of discriminators are supposed to be real or fake labels for input waves. Your frequency discriminator seems to get [B, channels, T] shape output. Does that obey the rule of discriminator or there is something more I can learn from ?
Thank you !

Time-domain loss

Hi,

I am just wondering wether the time-domain loss is working properly. I just noticed that the samples before the discriminator comes into play (>200K steps) are a bit muffled/noisy. After that, the GAN training scheme seems to be helping the audio quality and reducing this effect.

TFGAN_samples.zip

What is your experience? Is this muffled noise to be expected before the discriminator network comes in?

Thanks in advance, and thank you for sharing your work.

subprocess.CalledProcessError Help me pls ``/

Hello.
I have a problem with the subprocess library . I tried on both computer and Google Colab. And I always get the same error: subprocess.CalledProcessError: Command '['git', 'rev-parse', '--short', 'HEAD']' returned non-zero exit status 128.
After "python /content/TFGAN/trainer.py -c /content/TFGAN/config/default.yaml -n name" command .
Any ideas how to fix it ?

Full log:

Traceback (most recent call last):
  File "/content/TFGAN/trainer.py", line 52, in <module>
    train(args, pt_dir, args.checkpoint_path, trainloader, valloader, writer, logger, hp, hp_str)
  File "/content/TFGAN/utils/train.py", line 31, in train
    githash = get_commit_hash()
  File "/content/TFGAN/utils/utils.py", line 16, in get_commit_hash
    message = subprocess.check_output(["git", "rev-parse", "--short", "HEAD"])
  File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['git', 'rev-parse', '--short', 'HEAD']' returned non-zero exit status 128.

Any samples from the checkpoint?

Hi,
I met two problems when testing your pretrained model "first_e0c8065_1380.pt",
1. If I ran the inference.py directly, some error will happen in the generator：torch.nn.modules.module.ModuleAttributeError: 'Upsample' object has no attribute 'remove_weight_norm'. I solved this by defining an extra function inside the generator.
2. After fixing problem 1, I checked the synthesized audio, clear artifacts can be heard and the quality is quite poor, I am not sure whether this is also your case. But absolutely it's far from the description in the original paper. I attached the audio as follows.

LJ026-0135_reconstructed_epoch1380.wav.zip

Thanks.

hifigan vs tfgan

Hi,
Thanks for your great work. did you compare the audio sample quality between hifigan v1 and tfgan?

Feature matching loss bug?

TFGAN/utils/train.py

Line 124 in 8c37770

for (feats_fake, score_fake), (feats_real, _) in zip(disc_fake, disc_real):

Shouldn't disc_fake be disc_fake_g?