Code Monkey home page Code Monkey logo

Comments (25)

rishikksh20 avatar rishikksh20 commented on June 28, 2024 1

@seungwonpark one more input, after 500 epochs I get good audible voice but with similar constant noise artifacts which present in no_patch_gan demo audio in Ablation section here.
patch_gan implementation and importance mentioned in Window-based objective section in research paper. Not sure this is the exact issue as patch_gan implemented by MelGAN but one thing is sure that model facing difficulty to generalize higher frequencies.
You can check tensorboard

from melgan.

seungwonpark avatar seungwonpark commented on June 28, 2024 1

I'm quite convinced that using beta1=0.5, beta2=0.9 for Adam optimizers harms the training of G, making D loss go to near zero.

Using the default value beta1=0.9, beta2=0.99 made things better. Also, Jaehyeon Kim's implementation use default values.

Blue: default / Orange: values from paper
(this shows first 30 minutes of training for each case)
image

from melgan.

seungwonpark avatar seungwonpark commented on June 28, 2024 1

Not sure about this, but I think I found out the point.

Observation: Order of D/G update makes result really different. (blue: D first, red: G first)

image

Hypothesis: We should use separate dataloader for G and D.

In the original GAN algorithm, random noise is sampled within training loop. However, when the conditional information is used, using identical data for G and D within single loop may harm the training.

Experimenting this hypothesis at twoloader branch.

from melgan.

rishikksh20 avatar rishikksh20 commented on June 28, 2024 1

@seungwonpark After 500 epochs I remove label noises and this line

audio = audio + (1/32768) * torch.randn_like(audio)
and fine-tune again.
Then what I noticed that the constant noise artifacts goes down slowly and audibility of audio get increase. Maybe we need to give model much more epochs though generator loss keeps increasing but so the quality of audio, like in paper they train more than 3200 epochs and getting good results after 800 epochs. Lets see after how many epochs audio get deteriorated.

from melgan.

rishikksh20 avatar rishikksh20 commented on June 28, 2024 1

@seungwonpark Order of G and D matter lot when we model a complex objective,though the number of updates corresponding to G:D(like G:D=N:1 updates) play a very crucial role. These two acts as a extra hyper-parameters when we dealing with GANs.

from melgan.

rishikksh20 avatar rishikksh20 commented on June 28, 2024 1

@seungwonpark Sound quality drastically improve after 1000 epoch, I used original code of noiselabel branch you check here tensorboard . I think training more epoch is the key to check improvement because significance of error plots reduced after some epochs in case of speech processing, I noticed similar thing in Wavernn when error is not reducing but quality of sound increases. Though I am also checking GAN-TTS paper by deepmind, the model GASN-TTS is very complex compare to MelGAN but thinks some hyper-parameter might help to improve overall performance and understand GAN more for audio synthesis task.

P.S. Thank you so much for testing my code and helping me out, @rishikksh20 . Would you mind if I write your nickname (or your real name) in "Implementation Authors" at README.md to acknowledge you?

Thanks for that and Sure, I don't mine my name Rishikesh ( @rishikksh20 )

from melgan.

G-Wang avatar G-Wang commented on June 28, 2024 1

@seungwonpark @rishikksh20 could you post some audio samples of 1000 epochs? I find not including some L1/L2 loss for the generator at least in the beginning of training makes the generator have to do much more unnecessary "work". If not on the raw audio level at least a spectral loss, etc would probably speed up training.

from melgan.

rishikksh20 avatar rishikksh20 commented on June 28, 2024 1

@G-Wang here are some samples generated at 1200 epochs : https://soundcloud.com/rishikesh-kumar-1/sets/melgan-output-after-1200

For loss you can check tensorboard

Note : Code used for training is from noiselabel branch of this repo

from melgan.

seungwonpark avatar seungwonpark commented on June 28, 2024 1

Getting good quality audio while generator loss is still going up, too.
(Audio samples at http://swpark.me/melgan/)

@rishikksh20 Even if generator loss is going up, it doesn't seem that discriminator is overly better than generator. How do you think? Shall we close this issue?

image

from melgan.

seungwonpark avatar seungwonpark commented on June 28, 2024 1

This is getting interesting! After about 2000 epochs of training, discriminator finds it almost impossible to discriminate real/fake.

image

from melgan.

seungwonpark avatar seungwonpark commented on June 28, 2024
  • I also found that there was a mistake in the average pooling of multiscale.py. Strides of each average pooling layer should be 2, not 2**i.

from melgan.

seungwonpark avatar seungwonpark commented on June 28, 2024

Currently working on this at fix/3, noisylabel.

from melgan.

seungwonpark avatar seungwonpark commented on June 28, 2024

Also thinking of G:D=N:1 updates.

from melgan.

seungwonpark avatar seungwonpark commented on June 28, 2024

The loss curve is unacceptable, but the quality of the generated audio is quite audible when compared to samples at https://melgan-neurips.github.io.

Our samples at 275 epoch: fix3_epoch275.zip
Note: Here, I used label noise ([-0.2, 0.2] for fake, [0.8, 1.2] for real) until epoch 250 and removed label noise until epoch 275.

from melgan.

rishikksh20 avatar rishikksh20 commented on June 28, 2024

@seungwonpark I have got some audible audio, but g_loss keeps increasing.
tensorboard

from melgan.

seungwonpark avatar seungwonpark commented on June 28, 2024

Thank you for sharing your results and insights, @rishikksh20
I'm also working hard on this.

from melgan.

seungwonpark avatar seungwonpark commented on June 28, 2024

I'm testing both swapping order of G/D, and using separate loader for G/D, but both of them are worser than results from original results at similar epoch.

Maybe we need to give model much more epochs though generator loss keeps increasing but so the quality of audio, like in paper they train more than 3200 epochs and getting good results after 800 epochs.

Can't agree more. Let's wait for them in all cases.

P.S. Thank you so much for testing my code and helping me out, @rishikksh20 . Would you mind if I write your nickname (or your real name) in "Implementation Authors" at README.md to acknowledge you?

from melgan.

seungwonpark avatar seungwonpark commented on June 28, 2024

When swapping G/D order,

Using default Adam(betas=(0.9, 0.999)) lead to this. I stopped training this.

image

Using paper's Adam(betas=(0.5, 0.9)) shows better results. I let this to train more.
image

from melgan.

rishikksh20 avatar rishikksh20 commented on June 28, 2024

@seungwonpark This model easily exports through torch.jit.script just by doing one change here:

features = list()

to
features:List[torch.Tensor] = [] and add import from typing import List
Check : https://colab.research.google.com/drive/187IHaEvwoh35xviDfpNvxuDzuBpcElyH
Using torch script model speed up training as well as inference by optimizing parameters. And later use to depoy model to IOT devices as well as smartphones.
Meanwhie I start getting good voice quaity after 1500 epochs though a little constant artifacts remains in audio.

from melgan.

seungwonpark avatar seungwonpark commented on June 28, 2024

Thanks, will consider using torchscript.

I've already merged fixlossavg branch(see #6), but I'm not sure this is correct since it'll make the loss proportional to hp.audio.segment_length. Though I've already posted some audio samples to GitHub pages(using version after #6), this can be changed anytime.

Your results (which was before merge of fixlossavg) looks good to me, too. Let's wait for our models to converge.

from melgan.

rishikksh20 avatar rishikksh20 commented on June 28, 2024

@seungwonpark don't require to do this :

mel_output = self.spectral_normalize(mel_output)

As per MelGAN paper, Using spectral normalization or removing the window-based discriminator loss makes it harder to learn sharp high frequency patterns, causing samples to sound significantly noisy.
I think this is the culprit we don't require to do spectral normalization.

from melgan.

seungwonpark avatar seungwonpark commented on June 28, 2024

TL; DR: That doesn't matter much.

@rishikksh20 Oh, thanks for pointing it out.
That part was copied from NVIDIA/tacotron2, and what it's doing is just normalizing the spectrogram.

return torch.log(torch.clamp(x, min=clip_val) * C)

The "spectral normalization" that paper mentioned differs from this: https://arxiv.org/abs/1802.05957

from melgan.

rishikksh20 avatar rishikksh20 commented on June 28, 2024

@seungwonpark that's my point and the same thing reflected in my experiment also it's the rate of change of error b/w generative vs discriminative model rather than the actual number. Yeah, close this issue.

from melgan.

rishikksh20 avatar rishikksh20 commented on June 28, 2024

The one problem which common in both of us sample is the last click like sound artifact at the end of each sample. Have any idea how to deal with it this kind of artifact not present in the original paper's samples.

from melgan.

seungwonpark avatar seungwonpark commented on June 28, 2024

@rishikksh20 Thanks.
I also could catch some click like sound artifact in the last. Can you open a new issue for that, please?
EDIT: I'll make one.

from melgan.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.