Comments (25)
@seungwonpark one more input, after 500 epochs I get good audible voice but with similar constant noise artifacts which present in no_patch_gan
demo audio in Ablation
section here.
patch_gan
implementation and importance mentioned in Window-based objective
section in research paper. Not sure this is the exact issue as patch_gan
implemented by MelGAN
but one thing is sure that model facing difficulty to generalize higher frequencies.
You can check tensorboard
from melgan.
I'm quite convinced that using beta1=0.5, beta2=0.9
for Adam optimizers harms the training of G, making D loss go to near zero.
Using the default value beta1=0.9, beta2=0.99
made things better. Also, Jaehyeon Kim's implementation use default values.
Blue: default / Orange: values from paper
(this shows first 30 minutes of training for each case)
from melgan.
Not sure about this, but I think I found out the point.
Observation: Order of D/G update makes result really different. (blue: D first, red: G first)
Hypothesis: We should use separate dataloader for G and D.
In the original GAN algorithm, random noise is sampled within training loop. However, when the conditional information is used, using identical data for G and D within single loop may harm the training.
Experimenting this hypothesis at twoloader
branch.
from melgan.
@seungwonpark After 500 epochs I remove label noises
and this line
Line 49 in 3041dcc
Then what I noticed that the constant noise artifacts goes down slowly and audibility of audio get increase. Maybe we need to give model much more epochs though generator loss keeps increasing but so the quality of audio, like in paper they train more than 3200 epochs and getting good results after 800 epochs. Lets see after how many epochs audio get deteriorated.
from melgan.
@seungwonpark Order of G and D matter lot when we model a complex objective,though the number of updates corresponding to G:D(like G:D=N:1 updates) play a very crucial role. These two acts as a extra hyper-parameters when we dealing with GANs.
from melgan.
@seungwonpark Sound quality drastically improve after 1000 epoch, I used original code of noiselabel
branch you check here tensorboard . I think training more epoch is the key to check improvement because significance of error plots reduced after some epochs in case of speech processing, I noticed similar thing in Wavernn when error is not reducing but quality of sound increases. Though I am also checking GAN-TTS
paper by deepmind, the model GASN-TTS is very complex compare to MelGAN but thinks some hyper-parameter might help to improve overall performance and understand GAN more for audio synthesis task.
P.S. Thank you so much for testing my code and helping me out, @rishikksh20 . Would you mind if I write your nickname (or your real name) in "Implementation Authors" at README.md to acknowledge you?
Thanks for that and Sure, I don't mine my name Rishikesh
( @rishikksh20 )
from melgan.
@seungwonpark @rishikksh20 could you post some audio samples of 1000 epochs? I find not including some L1/L2 loss for the generator at least in the beginning of training makes the generator have to do much more unnecessary "work". If not on the raw audio level at least a spectral loss, etc would probably speed up training.
from melgan.
@G-Wang here are some samples generated at 1200 epochs : https://soundcloud.com/rishikesh-kumar-1/sets/melgan-output-after-1200
For loss you can check tensorboard
Note : Code used for training is from
noiselabel
branch of this repo
from melgan.
Getting good quality audio while generator loss is still going up, too.
(Audio samples at http://swpark.me/melgan/)
@rishikksh20 Even if generator loss is going up, it doesn't seem that discriminator is overly better than generator. How do you think? Shall we close this issue?
from melgan.
This is getting interesting! After about 2000 epochs of training, discriminator finds it almost impossible to discriminate real/fake.
from melgan.
- I also found that there was a mistake in the average pooling of
multiscale.py
. Strides of each average pooling layer should be 2, not 2**i.
from melgan.
Currently working on this at fix/3
, noisylabel
.
from melgan.
Also thinking of G:D=N:1 updates.
from melgan.
The loss curve is unacceptable, but the quality of the generated audio is quite audible when compared to samples at https://melgan-neurips.github.io.
Our samples at 275 epoch: fix3_epoch275.zip
Note: Here, I used label noise ([-0.2, 0.2] for fake, [0.8, 1.2] for real) until epoch 250 and removed label noise until epoch 275.
from melgan.
@seungwonpark I have got some audible audio, but g_loss keeps increasing.
tensorboard
from melgan.
Thank you for sharing your results and insights, @rishikksh20
I'm also working hard on this.
from melgan.
I'm testing both swapping order of G/D, and using separate loader for G/D, but both of them are worser than results from original results at similar epoch.
Maybe we need to give model much more epochs though generator loss keeps increasing but so the quality of audio, like in paper they train more than 3200 epochs and getting good results after 800 epochs.
Can't agree more. Let's wait for them in all cases.
P.S. Thank you so much for testing my code and helping me out, @rishikksh20 . Would you mind if I write your nickname (or your real name) in "Implementation Authors" at README.md
to acknowledge you?
from melgan.
When swapping G/D order,
Using default Adam(betas=(0.9, 0.999)
) lead to this. I stopped training this.
Using paper's Adam(betas=(0.5, 0.9)
) shows better results. I let this to train more.
from melgan.
@seungwonpark This model easily exports through torch.jit.script
just by doing one change here:
Line 44 in 1e31df8
to
features:List[torch.Tensor] = []
and add import from typing import List
Check : https://colab.research.google.com/drive/187IHaEvwoh35xviDfpNvxuDzuBpcElyH
Using torch script model speed up training as well as inference by optimizing parameters. And later use to depoy model to IOT devices as well as smartphones.
Meanwhie I start getting good voice quaity after 1500 epochs though a little constant artifacts remains in audio.
from melgan.
Thanks, will consider using torchscript.
I've already merged fixlossavg
branch(see #6), but I'm not sure this is correct since it'll make the loss proportional to hp.audio.segment_length
. Though I've already posted some audio samples to GitHub pages(using version after #6), this can be changed anytime.
Your results (which was before merge of fixlossavg
) looks good to me, too. Let's wait for our models to converge.
from melgan.
@seungwonpark don't require to do this :
Line 183 in d6017e8
As per MelGAN paper,
Using spectral normalization or removing the window-based discriminator loss makes it harder to learn sharp high frequency patterns, causing samples to sound significantly noisy.
I think this is the culprit we don't require to do spectral normalization.
from melgan.
TL; DR: That doesn't matter much.
@rishikksh20 Oh, thanks for pointing it out.
That part was copied from NVIDIA/tacotron2, and what it's doing is just normalizing the spectrogram.
melgan/utils/audio_processing.py
Line 84 in d6017e8
The "spectral normalization" that paper mentioned differs from this: https://arxiv.org/abs/1802.05957
from melgan.
@seungwonpark that's my point and the same thing reflected in my experiment also it's the rate of change of error b/w generative vs discriminative model rather than the actual number. Yeah, close this issue.
from melgan.
The one problem which common in both of us sample is the last click like sound artifact at the end of each sample. Have any idea how to deal with it this kind of artifact not present in the original paper's samples.
from melgan.
@rishikksh20 Thanks.
I also could catch some click like sound artifact in the last. Can you open a new issue for that, please?
EDIT: I'll make one.
from melgan.
Related Issues (20)
- mel
- 32bit?
- hop_length
- Why do you add 2 to self.mel_segment_length in line 29 of dataloader.py?
- multi gpu training HOT 2
- Loading without cuda results in an error
- LJSpeech Checkpoints HOT 1
- Mel-Gan 학습데이터 전처리 관련해서 질문이 있습니다.
- torchscript implementation
- Sound artifact at the end of the sample
- Better audio quality with larger resnet
- TypeError: Cannot handle this data type: (1, 1, 1200), |u1 HOT 1
- Hi
- Bad cases or artifacts in synthesiszed audios?
- KeyError: '__getstate__' HOT 1
- RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same
- KeyError: '__getstate__'
- NotImplementedError
- Why is the generator loss continuously increasing instead of decreasing? Is continuous increase correct? HOT 1
- MelGAN vocoder's Weight file
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from melgan.