I've put wav files at 48k SR into a folder using the following : <a href="https:

There is a memory leak: python HiFiGAN_combined.py <p dir="auto"

Would this work as an alternative to the in memory list? <a href="ht

Getting "BrokenPipeError", "ConnectionResetError", and "EOFError" errors for hifigan training. about ims-toucan HOT 17 CLOSED

digitalphonetics commented on August 20, 2024

Getting "BrokenPipeError", "ConnectionResetError", and "EOFError" errors for hifigan training.

from ims-toucan.

Comments (17)

Flux9665 commented on August 20, 2024

I used to get broken pipe errors when I ran the code over ssh and the connection failed at some point. Since I started calling every script with nohup I haven't seen a broken pipe error. So maybe it's a server/connection issue? Assuming it's not, here are my thoughts:

The error happens when a wave is added to the list of all eligible waves. The list is handled by a resource manager to avoid multiple processes interfering with each other. The call to librosa.resample in the same line returns the identity if the samplingrate is already the same as the desired one, so that is probably not the cause. Adding something to the resource-manager list is the only possible cause I see. But then it's weird that it would work 817 times and then suddenly not.

You could try to check if wave contains what you think it contains right as it crashes. Also you could reduce the loading_processes argument when creating a HiFiGAN dataset. The default is 40 which assumes that the server it is run on has a lot of CPU cores (58 cores in the case of the one I'm using).

from ims-toucan.

Flux9665 commented on August 20, 2024

Were you able to resolve the issue? If not, I'll add the option to create the HiFiGAN dataset without multiprocessing. Multiprocessing is not needed for this, it just makes it a lot quicker.

from ims-toucan.

michael-conrad commented on August 20, 2024

I haven't had the opportunity yet to try any changes.

It would be useful to have a single threaded version that reports the filename if any error occurs though.

from ims-toucan.

Flux9665 commented on August 20, 2024

Should be simple, I'll do it in the next few days.

from ims-toucan.

Flux9665 commented on August 20, 2024

I just made a commit that completely disables multiprocessig in the HiFiGAN dataset creation if you set the amount of loading processes to 1.

It will take much longer to process everything this way of course, but it's the safest way I can think of. Please let me know if there is still a problem when you have the time to test this.

from ims-toucan.

michael-conrad commented on August 20, 2024

There is a memory leak:

python HiFiGAN_combined.py

Preparing
57%|██████████████████████████████████████████████████████████▌ | 33778/59360 [00:50<07:38, 55.78it/s]
Killed

I have a 32 GB machine and python was up to 33 GB VIRT 92% MEM before getting killed.

No sure where to look to find the objects not being released after use...

The code is trying to hold all the wavs and mels in memory? Is there any way around that?

from ims-toucan.

michael-conrad commented on August 20, 2024

Would this work as an alternative to the in memory list?

https://github.com/Belval/disklist

from ims-toucan.

Flux9665 commented on August 20, 2024

It's not a memory leak, it works as intended. It's just built for servers with a lot of memory, to make the training oders of magnitude faster than when using lazy loading.

Yes, the code is processing all the wavs and then holds them during training. The disklist you suggest sounds cool, I'll add an option to use it, however only during single processing, since it's incompatible with the resource manager. The decrease in speed during training might be ok, because the torch dataloader prepares batches in the background. The HiFiGAN authors suggest to train for a very long time, that's why I did everything I could think of to make the training faster, but the models become already really good after around 200,000 steps, so if the speed decrease is too bad, you can just train for less steps.

I'm also looking for a good way to share pretrained models, since HiFiGAN is pretty speaker independent, you can use the same HiFiGAN pretrained model for any purposes and wouldn't need to train one yourself.

from ims-toucan.

michael-conrad commented on August 20, 2024

I'm also looking for a good way to share pretrained models, since HiFiGAN is pretty speaker independent, you can use the same HiFiGAN pretrained model for any purposes and wouldn't need to train one yourself.

Use the release function of github, it allows adding binaries manually to the release.

Tag repo with a version. Go into release in github. Mark tag as release. Add binaries.

from ims-toucan.

Flux9665 commented on August 20, 2024

Thanks for the suggestion, I just did that for now.

I'll also be looking for ways to load models auomatically in the future.

from ims-toucan.

michael-conrad commented on August 20, 2024

Meh... I tried to put together a HiFiGANDataset_disklist variant, but is seems that DiskList isn't pickling either librosa numpy data nor torch.tensors at all. I'll try looking into alternatives when I can.

from ims-toucan.

michael-conrad commented on August 20, 2024

Ok, I just did a direct to disk file caching of the tensors using torch.save, my GPU usage ranges 93% to 97%. This is on a GTX 3090.

Is it safe to increase the batch size? It is currently only using 13GBs of RAM and I have some to spare above that.

Here is my nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.86       Driver Version: 470.86       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 51%   74C    P2   328W / 350W |  13463MiB / 24259MiB |     93%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1333      G   /usr/lib/xorg/Xorg                207MiB |
|    0   N/A  N/A      2305      G   /usr/bin/telegram-desktop           4MiB |
|    0   N/A  N/A      2702      G   /usr/lib/firefox/firefox          189MiB |
|    0   N/A  N/A      2928      G   /usr/lib/firefox/firefox            4MiB |
|    0   N/A  N/A      2931      G   /usr/lib/firefox/firefox            4MiB |
|    0   N/A  N/A     64587      C   python                          13049MiB |
+-----------------------------------------------------------------------------+

from ims-toucan.

Flux9665 commented on August 20, 2024

Yes, if the GPU-Util is not at 100% yet it makes sense to increase the batch size. Beyond that however i'm not sure if there's a benefit, even if there is VRAM left. The gradient estimate will become better, but the batches will take a little bit longer and the gradient estimate was already good enough.

So I'd say a slight increase until GPU-Util hits 100% is the best trade-off

from ims-toucan.

michael-conrad commented on August 20, 2024

Batch size = 32

How does this look so far? Should Generator Loss be increasing?

Epoch:              18
Time elapsed:       562 Minutes
Steps:              31518
Generator Loss:     160.41
    Mel Loss:       11.005
    FeatMatch Loss: 147.153
    Adv Loss:       2.252
Discriminator Loss: 0.292
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1854/1854 [33:02<00:00,  1.07s/it]
Epoch:              19
Time elapsed:       595 Minutes
Steps:              33372
Generator Loss:     164.81
    Mel Loss:       10.822
    FeatMatch Loss: 151.747
    Adv Loss:       2.241
Discriminator Loss: 0.294
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1854/1854 [33:01<00:00,  1.07s/it]
Epoch:              20
Time elapsed:       628 Minutes
Steps:              35226
Generator Loss:     168.52
    Mel Loss:       10.689
    FeatMatch Loss: 155.577
    Adv Loss:       2.253
Discriminator Loss: 0.292
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1854/1854 [33:11<00:00,  1.07s/it]
Epoch:              21
Time elapsed:       661 Minutes
Steps:              37080
Generator Loss:     173.101
    Mel Loss:       10.658
    FeatMatch Loss: 160.177
    Adv Loss:       2.265
Discriminator Loss: 0.29

from ims-toucan.

Flux9665 commented on August 20, 2024

It looks good to me. Generator loss increasing is fine, since the feature matching loss gets a lot higher very quickly while the discriminator is improving. When you look at the generator loss in the very long term you'll see that the generator loss increases a lot in the beginning and then the increase gets slower and slower. At the point where the generator loss starts going down, that's when I usually stop the training.

A good thing to look for is the Mel Loss. As long as that one is decreasing, the model gets better.

from ims-toucan.

michael-conrad commented on August 20, 2024

Update:

Epoch:              30
Time elapsed:       972 Minutes
Steps:              381924
Generator Loss:     395.299
    Mel Loss:       9.369
    FeatMatch Loss: 382.977
    Adv Loss:       2.953
Discriminator Loss: 0.177
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1854/1854 [34:40<00:00,  1.12s/it]
Epoch:              31
Time elapsed:       1007 Minutes
Steps:              383778
Generator Loss:     393.296
    Mel Loss:       9.309
    FeatMatch Loss: 381.034
    Adv Loss:       2.953
Discriminator Loss: 0.177
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1854/1854 [35:01<00:00,  1.13s/it]
Epoch:              32
Time elapsed:       1042 Minutes
Steps:              385632
Generator Loss:     392.862
    Mel Loss:       9.336
    FeatMatch Loss: 380.578
    Adv Loss:       2.947
Discriminator Loss: 0.178

from ims-toucan.

Flux9665 commented on August 20, 2024

Closing the issue, because it is stale. A new version of the toolkit that will come soon (hopefully latest by end of next week) includes a procedure that allows training hifiGAN with fewer RAM by loading only a random chunk of the data for a certain amount of epochs and then loading another random chunk.

from ims-toucan.

Getting "BrokenPipeError", "ConnectionResetError", and "EOFError" errors for hifigan training. about ims-toucan HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent