Code Monkey home page Code Monkey logo

diffusion-svc's People

Contributors

bfloat16 avatar cnchtu avatar huanlinoto avatar kakaruhayate avatar mlbv avatar narusemioshirakana avatar rainaobi avatar ricecakey06 avatar ylzz1997 avatar yxlllc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

diffusion-svc's Issues

8G显存推理时OOM

使用的命令为python .\main.py -i 'input.wav' -model .\models\murasame.ptc -o 'output.wav' -k 0 -kstep 100 -pe rmvpe
日志如下

2023-12-25 01:45:06 | INFO | fairseq.tasks.text_to_speech | Please install tensorboardX: pip install tensorboardX
 [Loading] .\models\murasame.ptc
C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\utils\weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
 [Encoder Model] Content Vec
 [Loading] pretrain/contentvec/checkpoint_best_legacy_500.pt
2023-12-25 01:45:09 | INFO | fairseq.tasks.hubert_pretraining | current directory is F:\C#\MurasameQQBot\MurasameQQBot\bin\Debug\net8.0\Tools\Diffusion-SVC
2023-12-25 01:45:09 | INFO | fairseq.tasks.hubert_pretraining | HubertPretrainingTask Config {'_name': 'hubert_pretraining', 'data': 'metadata', 'fine_tuning': False, 'labels': ['km'], 'label_dir': 'label', 'label_rate': 50.0, 'sample_rate': 16000, 'normalize': False, 'enable_padding': False, 'max_keep_size': None, 'max_sample_size': 250000, 'min_sample_size': 32000, 'single_target': False, 'random_crop': True, 'pad_audio': False}
2023-12-25 01:45:09 | INFO | fairseq.models.hubert.hubert | HubertModel Config: {'_name': 'hubert', 'label_rate': 50.0, 'extractor_mode': default, 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'activation_fn': gelu, 'layer_type': transformer, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'final_dim': 256, 'untie_final_proj': True, 'layer_norm_first': False, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 0.1, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': False}
Units Forced Mode:nearest
 [INFO] Extract f0 volume and mask: Use rmvpe, start...
 [INFO] Extract f0 volume and mask: Done. Use time:5.367121458053589
  0%|                                                                                            | 0/2 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "F:\C#\MurasameQQBot\MurasameQQBot\bin\Debug\net8.0\Tools\Diffusion-SVC\main.py", line 195, in <module>
    out_wav, out_sr = diffusion_svc.infer_from_long_audio(
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "F:\C#\MurasameQQBot\MurasameQQBot\bin\Debug\net8.0\Tools\Diffusion-SVC\tools\infer_tools.py", line 380, in infer_from_long_audio
    seg_units = self.units_encoder.encode(seg_input, sr, hop_size)
  File "F:\C#\MurasameQQBot\MurasameQQBot\bin\Debug\net8.0\Tools\Diffusion-SVC\tools\tools.py", line 471, in encode
    units = self.model(audio_res, padding_mask=padding_mask)
  File "F:\C#\MurasameQQBot\MurasameQQBot\bin\Debug\net8.0\Tools\Diffusion-SVC\tools\tools.py", line 601, in __call__
    logits = self.hubert.extract_features(**inputs)
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\fairseq\models\hubert\hubert.py", line 535, in extract_features
    res = self.forward(
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\fairseq\models\hubert\hubert.py", line 467, in forward
    x, _ = self.encoder(
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\fairseq\models\wav2vec\wav2vec2.py", line 1003, in forward
    x, layer_results = self.extract_features(x, padding_mask, layer)
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\fairseq\models\wav2vec\wav2vec2.py", line 1049, in extract_features
    x, (z, lr) = layer(
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\fairseq\models\wav2vec\wav2vec2.py", line 1260, in forward
    x, attn = self.self_attn(
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\fairseq\modules\multihead_attention.py", line 538, in forward
    return F.multi_head_attention_forward(
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\functional.py", line 5440, in multi_head_attention_forward
    attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.00 GiB. GPU 0 has a total capacty of 8.00 GiB of which 4.45 GiB is free. Of the allocated memory 1.36 GiB is allocated by PyTorch, and 239.44 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

理论上推理应该用不到8G显存吧。。。
训练模型的时候是租的4090跑的,推理的时候用的是rtx2060s 8GB(会不会和这个有关)

使用实时GUI时报错

Traceback (most recent call last):
File "G:\Diffusion-SVC-barbara\Diffusion-SVC-barbara\gui_realtime.py", line 8, in
from tools.infer_tools import DiffusionSVC
File "G:\Diffusion-SVC-barbara\Diffusion-SVC-barbara\tools\infer_tools.py", line 11, in
from tools.tools import F0_Extractor, Volume_Extractor, Units_Encoder, SpeakerEncoder, cross_fade
File "G:\Diffusion-SVC-barbara\Diffusion-SVC-barbara\tools\tools.py", line 12, in
from fairseq import checkpoint_utils
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\fairseq_init_.py", line 20, in
from fairseq.distributed import utils as distributed_utils
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\fairseq\distributed_init_.py", line 7, in
from .fully_sharded_data_parallel import (
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\fairseq\distributed\fully_sharded_data_parallel.py", line 10, in
from fairseq.dataclass.configs import DistributedTrainingConfig
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\fairseq\dataclass_init_.py", line 6, in
from .configs import FairseqDataclass
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\fairseq\dataclass\configs.py", line 1104, in
@DataClass
^^^^^^^^^
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\dataclasses.py", line 1230, in dataclass
return wrap(cls)
^^^^^^^^^
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\dataclasses.py", line 1220, in wrap
return _process_class(cls, init, repr, eq, order, unsafe_hash,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\dataclasses.py", line 958, in _process_class
cls_fields.append(_get_field(cls, name, type, kw_only))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\dataclasses.py", line 815, in _get_field
raise ValueError(f'mutable default {type(f.default)} for field '
ValueError: mutable default <class 'fairseq.dataclass.configs.CommonConfig'> for field common is not allowed: use default_factory

音域问题

介绍说naive存在音域问题。但我实际测试中,在没有训练naive情况下,仅训练浅扩散模型或完全扩散模型,两者练出的模型中低音正常,高音(F5以上)都比较虚弱,主要表现为音量低、附带电音(tensorboard中播放的效果),貌似是扩散模型的问题。数据集有90多分钟,数据集应该不算小吧。我尝试修改配置文件中f0的频率上限并重新预处理训练,但貌似不起作用。

Loss value is not decrease during training

Hello! I have been training a model using 1204 files, with each file taking between 5 and 25 seconds to process. The total duration of the training process has been 14031 seconds. However, despite training for approximately 3 hours (4722 epochs), I have noticed that the loss function is not decreasing. I would appreciate any insights into why this might be happening and any suggestions on how to resolve the issue. Thank you!

configs/duffision.yaml file:

data:
  block_size: 512
  cnhubertsoft_gate: 10
  duration: 2
  encoder: whisper-ppg
  encoder_hop_size: 320
  encoder_out_channels: 1024
  encoder_sample_rate: 16000
  extensions:
  - wav
  sampling_rate: 44100
  training_files: filelists/train.txt
  unit_interpolate_mode: nearest
  validation_files: filelists/val.txt
device: cuda
env:
  expdir: logs/44k/diffusion
  gpu_id: 0
infer:
  method: dpm-solver
  speedup: 10
model:
  n_chans: 512
  n_hidden: 256
  n_layers: 20
  n_spk: 1
  type: Diffusion
  use_pitch_aug: true
spk:
  speaker_all: 0
train:
  amp_dtype: fp32
  batch_size: 384
  cache_all_data: true
  cache_device: cpu
  cache_fp16: true
  decay_step: 100000
  epochs: 100000
  gamma: 0.5
  interval_force_save: 10000
  interval_log: 10
  interval_val: 2000
  lr: 0.00008
  num_workers: 2
  save_opt: false
  weight_decay: 0.01
vocoder:
  ckpt: pretrain/nsf_hifigan/model
  type: nsf-hifigan

train loss

validation

UPD: Train model as part of repository: https://github.com/svc-develop-team/so-vits-svc

Voice got stuttered and lagged during realtime-inference

During realtime inference, the output voice got stuttered and lagged sometimes. Changing input/output devices or combo models didn't work(Decreasing historical blocks may mitigate the problem). Problem can be recurred when switching foreground programs or decreasing speedup. When the problem occurs, there is no explicit abnormal resource usage(cuda and video memory usage). By the way, the same problem can be met in DDSP-SVC project.

Following is demo video.

Desktop.2023.06.29.-.17.05.11.03.-.Trim.mp4

Have any idea or workaround?

How to use whisper-ppg as encoder?

Hello! In training Diffusion model table has "whisper-ppg(only can use with sovits)" row, but in config encoder has choices: "'hubertsoft', 'hubertbase', 'hubertbase768', 'contentvec', 'contentvec768' or 'contentvec768l12' or 'cnhubertsoftfish'". How can i use pretrain whisper model? Thank you!

[Help Wanted] 代码中的initial_global_step为什么要-2?

如题
在train.py中,
81行

param_group['lr'] = args.train.lr * args.train.gamma ** max((initial_global_step - 2) // args.train.decay_step, 0)

83行

scheduler = lr_scheduler.StepLR(optimizer, step_size=args.train.decay_step, gamma=args.train.gamma, last_epoch=initial_global_step-2)

What is the effect of using mean speaker embedding in Naive model training?

Thank you for this wonderful project sharing as open source.

I couldn't see args.model.use_speaker_encoder in any of the config files in the configs folder. but an option is there in preprocess.py

I have a couple of questions.

  1. What is the effect of using mean speaker embedding in Naive model training / Shallow diffusion? Will it increase the speaker similarity along with quality?
  2. Why computed mean embedding instead of utterance level speaker embedding?
  3. Any naive models released which is enabled as args.model.use_speaker_encoder true?

Thanks. I am looking forward to seeing your commands :)

Suggestion: rename gui.py

It should be something like rtvc-gui.py, to avoid confusion with potential future GUIs for training or inference.

关于 v2.0 的一些疑问

有一些关于最新 2.0 版本的一些疑问想请教一下大佬:

  1. 当前 v2.0 分支还会不会有大的改动?是否已经可以基于当前分支进行模型训练和推理?
  2. v2.0 与 v1 的模型是否兼容呢?
  3. v2.0 相比 v1 有哪些进步呢?
  4. 如果要使用新的 hifi-vaegan 作为 vocoder,在训练模型时是否必须使用 hifi-vaegan,而不能使用 nsf-hifigan?
  5. hifi-vaegan 相比 nsf-hifigan 有哪些优点或缺点呢?
  6. 如果新旧模型不兼容,当前是否已计划训练发布新的底模呢?
  7. 如果要使用 v2.0 进行底模训练,配置文件需要针对性做一些修改吗(以及有无推荐的 batch size 和 lr 参数)?

如果有不方便回答的问题还请略过就好。

Does anyone know if the whisper-ppg-largev2 or v3 model can train diffusion models and use them In so-vits-svc model

Does anyone know if the whisper-ppg-largev2 or v3 model can train diffusion models and use them In so-vits-svc project?Actrually,I have trained a diffusionmodel with the largev2 being the vencoder,but it is just that the diff model cannot be used,it shows a bug as follows.I noticed that this project shows that whisper-ppg cannot use the diff module,but why the so-vits-svc team supplys a way to train whisper-ppg-largev2 diff model? I like the diff module,and think it is very useful, so I really want to know the reason why I cannot use it in my so-vits-svc4.1 model or hope to find the solutions.thanks!
image
image

FCPE fails to import

When trying to use FCPE as a f0 extractor the following error appears: [Errno 2] No such file or directory: 'exp/f0bce_test_R004_cu0\\config.yamI'
It's looking for a config.yaml file for the FCPE model, but this shouldn't be the case.

Solution to serious memory leaks in preprocessing under Linux | 在Linux下面进行预处理发生严重内存泄漏的解决方法

Please use the following command to force pytorch to update to the nightly version
请用下面的命令把pytorch强制更新到nightly版本

cu118:pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118 --force-reinstall
cu121:pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121 --force-reinstall

Hoarse sound when using realtime inference

I trained one combo model, and it works good under offline inference in main.py, but using gui_realtime.py can only make a hoarse sound and cannot hear any pronunciation clearly(pitch seems to be correct). I've tried to change input devices\upgrade torch, but nothing works.

Following are setting snapshots.

Offline Inference
image

Realtime Inference
image

Have any ideas or workarounds?

Too long time to load data.

如题。在启动cache到cpu/cuda时,速率仅有2-3it/s,对于10w+的音频需要接近七个小时来load。
关闭cache后在Load the f0, volume data from : data/train 一步速率可以到20it/s,但仍然偏慢,且训练速度很慢。
系统为K8S Pod,配置12核心,2xA800,存储系统为Lustre集群。

samples

Do you have any kind of samples ? of opencpop or kiritan ? or anyone's just to compare quality against the other svc algos ?

The sample rate of the input audio may cause problem

In my case, only use shallow model, the input audio sample rate is 22050Hz, but the training audio and config_shallow.yaml sample rate are both 44100Hz

its report "RuntimeError: The size of tensor a (336) must match the size of tensor b (760) at non-singleton dimension 2"

Then I convert sample rate to 44100Hz, it works fine

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.