cnchtu / diffusion-svc Goto Github PK

View Code? Open in Web Editor NEW

380.0 380.0 57.0 4.08 MB

License: MIT License

Python 87.81% Jupyter Notebook 12.19%

diffusion-svc's People

Contributors

Stargazers

Watchers

Forkers

whitefu ylzz1997 yxlllc paladinj timeismylife w-okada hongwen-sun nullnan2023 sdlibowen kimzuo bugoverdose asdfw13 muruganr96 ricecakey06 nzpeng zainlau 12si27 snowleo819 narusemioshirakana umoufuton huanlinoto thiagonocera fox2011622 learning-group1 rainaobi xiaosilao xdonedude mlbv yalin814 akeboshi1 infinity-inf assmdx 5l1v3r1 cherrylcherryl sulizhao adelacvg sunsetmkt tankh64 bfloat16 studentyeh ishine yang182 lzydabb zscharlie songjn baaayeees innnky cyborgparadisum tps-f ma5onic kakaruhayate hgjkim lwzzz7 funky-synatra sandyzikun wangshuniguang ivy-consulting

diffusion-svc's Issues

8G显存推理时OOM

使用的命令为python .\main.py -i 'input.wav' -model .\models\murasame.ptc -o 'output.wav' -k 0 -kstep 100 -pe rmvpe
日志如下

2023-12-25 01:45:06 | INFO | fairseq.tasks.text_to_speech | Please install tensorboardX: pip install tensorboardX
 [Loading] .\models\murasame.ptc
C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\utils\weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
 [Encoder Model] Content Vec
 [Loading] pretrain/contentvec/checkpoint_best_legacy_500.pt
2023-12-25 01:45:09 | INFO | fairseq.tasks.hubert_pretraining | current directory is F:\C#\MurasameQQBot\MurasameQQBot\bin\Debug\net8.0\Tools\Diffusion-SVC
2023-12-25 01:45:09 | INFO | fairseq.tasks.hubert_pretraining | HubertPretrainingTask Config {'_name': 'hubert_pretraining', 'data': 'metadata', 'fine_tuning': False, 'labels': ['km'], 'label_dir': 'label', 'label_rate': 50.0, 'sample_rate': 16000, 'normalize': False, 'enable_padding': False, 'max_keep_size': None, 'max_sample_size': 250000, 'min_sample_size': 32000, 'single_target': False, 'random_crop': True, 'pad_audio': False}
2023-12-25 01:45:09 | INFO | fairseq.models.hubert.hubert | HubertModel Config: {'_name': 'hubert', 'label_rate': 50.0, 'extractor_mode': default, 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'activation_fn': gelu, 'layer_type': transformer, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'final_dim': 256, 'untie_final_proj': True, 'layer_norm_first': False, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 0.1, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': False}
Units Forced Mode:nearest
 [INFO] Extract f0 volume and mask: Use rmvpe, start...
 [INFO] Extract f0 volume and mask: Done. Use time:5.367121458053589
  0%|                                                                                            | 0/2 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "F:\C#\MurasameQQBot\MurasameQQBot\bin\Debug\net8.0\Tools\Diffusion-SVC\main.py", line 195, in <module>
    out_wav, out_sr = diffusion_svc.infer_from_long_audio(
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "F:\C#\MurasameQQBot\MurasameQQBot\bin\Debug\net8.0\Tools\Diffusion-SVC\tools\infer_tools.py", line 380, in infer_from_long_audio
    seg_units = self.units_encoder.encode(seg_input, sr, hop_size)
  File "F:\C#\MurasameQQBot\MurasameQQBot\bin\Debug\net8.0\Tools\Diffusion-SVC\tools\tools.py", line 471, in encode
    units = self.model(audio_res, padding_mask=padding_mask)
  File "F:\C#\MurasameQQBot\MurasameQQBot\bin\Debug\net8.0\Tools\Diffusion-SVC\tools\tools.py", line 601, in __call__
    logits = self.hubert.extract_features(**inputs)
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\fairseq\models\hubert\hubert.py", line 535, in extract_features
    res = self.forward(
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\fairseq\models\hubert\hubert.py", line 467, in forward
    x, _ = self.encoder(
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\fairseq\models\wav2vec\wav2vec2.py", line 1003, in forward
    x, layer_results = self.extract_features(x, padding_mask, layer)
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\fairseq\models\wav2vec\wav2vec2.py", line 1049, in extract_features
    x, (z, lr) = layer(
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\fairseq\models\wav2vec\wav2vec2.py", line 1260, in forward
    x, attn = self.self_attn(
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\fairseq\modules\multihead_attention.py", line 538, in forward
    return F.multi_head_attention_forward(
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\functional.py", line 5440, in multi_head_attention_forward
    attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.00 GiB. GPU 0 has a total capacty of 8.00 GiB of which 4.45 GiB is free. Of the allocated memory 1.36 GiB is allocated by PyTorch, and 239.44 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

理论上推理应该用不到8G显存吧。。。
训练模型的时候是租的4090跑的，推理的时候用的是rtx2060s 8GB（会不会和这个有关）

Traceback (most recent call last):
File "G:\Diffusion-SVC-barbara\Diffusion-SVC-barbara\gui_realtime.py", line 8, in
from tools.infer_tools import DiffusionSVC
File "G:\Diffusion-SVC-barbara\Diffusion-SVC-barbara\tools\infer_tools.py", line 11, in
from tools.tools import F0_Extractor, Volume_Extractor, Units_Encoder, SpeakerEncoder, cross_fade
File "G:\Diffusion-SVC-barbara\Diffusion-SVC-barbara\tools\tools.py", line 12, in
from fairseq import checkpoint_utils
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\fairseq_init_.py", line 20, in
from fairseq.distributed import utils as distributed_utils
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\fairseq\distributed_init_.py", line 7, in
from .fully_sharded_data_parallel import (
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\fairseq\distributed\fully_sharded_data_parallel.py", line 10, in
from fairseq.dataclass.configs import DistributedTrainingConfig
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\fairseq\dataclass_init_.py", line 6, in
from .configs import FairseqDataclass
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\fairseq\dataclass\configs.py", line 1104, in
@DataClass
^^^^^^^^^
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\dataclasses.py", line 1230, in dataclass
return wrap(cls)
^^^^^^^^^
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\dataclasses.py", line 1220, in wrap
return _process_class(cls, init, repr, eq, order, unsafe_hash,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\dataclasses.py", line 958, in _process_class
cls_fields.append(_get_field(cls, name, type, kw_only))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\dataclasses.py", line 815, in _get_field
raise ValueError(f'mutable default {type(f.default)} for field '
ValueError: mutable default <class 'fairseq.dataclass.configs.CommonConfig'> for field common is not allowed: use default_factory

对于MaxMax2016碰瓷各大开源语音合成模型以及开盒开发者行为的谴责 | The condemnation of MaxMax2016's attempts to exploit various major open-source speech synthesis models and their unethical behavior in open-box development

既然你来我们仓库搞事情，行，那就让你出名
Since you've come to mess with our repository, alright, we'll make you famous

音域问题

介绍说naive存在音域问题。但我实际测试中，在没有训练naive情况下，仅训练浅扩散模型或完全扩散模型，两者练出的模型中低音正常，高音（F5以上）都比较虚弱，主要表现为音量低、附带电音（tensorboard中播放的效果），貌似是扩散模型的问题。数据集有90多分钟，数据集应该不算小吧。我尝试修改配置文件中f0的频率上限并重新预处理训练，但貌似不起作用。

你们的完整扩散模型

要训练多少步啊没看的太明白默认的参数
10万步？

Loss value is not decrease during training

Hello! I have been training a model using 1204 files, with each file taking between 5 and 25 seconds to process. The total duration of the training process has been 14031 seconds. However, despite training for approximately 3 hours (4722 epochs), I have noticed that the loss function is not decreasing. I would appreciate any insights into why this might be happening and any suggestions on how to resolve the issue. Thank you!

configs/duffision.yaml file:

data:
  block_size: 512
  cnhubertsoft_gate: 10
  duration: 2
  encoder: whisper-ppg
  encoder_hop_size: 320
  encoder_out_channels: 1024
  encoder_sample_rate: 16000
  extensions:
  - wav
  sampling_rate: 44100
  training_files: filelists/train.txt
  unit_interpolate_mode: nearest
  validation_files: filelists/val.txt
device: cuda
env:
  expdir: logs/44k/diffusion
  gpu_id: 0
infer:
  method: dpm-solver
  speedup: 10
model:
  n_chans: 512
  n_hidden: 256
  n_layers: 20
  n_spk: 1
  type: Diffusion
  use_pitch_aug: true
spk:
  speaker_all: 0
train:
  amp_dtype: fp32
  batch_size: 384
  cache_all_data: true
  cache_device: cpu
  cache_fp16: true
  decay_step: 100000
  epochs: 100000
  gamma: 0.5
  interval_force_save: 10000
  interval_log: 10
  interval_val: 2000
  lr: 0.00008
  num_workers: 2
  save_opt: false
  weight_decay: 0.01
vocoder:
  ckpt: pretrain/nsf_hifigan/model
  type: nsf-hifigan

UPD: Train model as part of repository: https://github.com/svc-develop-team/so-vits-svc

Voice got stuttered and lagged during realtime-inference

During realtime inference, the output voice got stuttered and lagged sometimes. Changing input/output devices or combo models didn't work(Decreasing historical blocks may mitigate the problem). Problem can be recurred when switching foreground programs or decreasing speedup. When the problem occurs, there is no explicit abnormal resource usage(cuda and video memory usage). By the way, the same problem can be met in DDSP-SVC project.

Following is demo video.

Desktop.2023.06.29.-.17.05.11.03.-.Trim.mp4

Have any idea or workaround?

How to use whisper-ppg as encoder?

Hello! In training Diffusion model table has "whisper-ppg(only can use with sovits)" row, but in config encoder has choices: "'hubertsoft', 'hubertbase', 'hubertbase768', 'contentvec', 'contentvec768' or 'contentvec768l12' or 'cnhubertsoftfish'". How can i use pretrain whisper model? Thank you!

Where can I find `semantic_codebook.pt`?

[Help Wanted] 代码中的initial_global_step为什么要-2?

如题
在train.py中，
81行

param_group['lr'] = args.train.lr * args.train.gamma ** max((initial_global_step - 2) // args.train.decay_step, 0)

83行

scheduler = lr_scheduler.StepLR(optimizer, step_size=args.train.decay_step, gamma=args.train.gamma, last_epoch=initial_global_step-2)

What is the effect of using mean speaker embedding in Naive model training?

Thank you for this wonderful project sharing as open source.

I couldn't see args.model.use_speaker_encoder in any of the config files in the configs folder. but an option is there in preprocess.py

I have a couple of questions.

What is the effect of using mean speaker embedding in Naive model training / Shallow diffusion? Will it increase the speaker similarity along with quality?
Why computed mean embedding instead of utterance level speaker embedding?
Any naive models released which is enabled as args.model.use_speaker_encoder true?

Thanks. I am looking forward to seeing your commands :)

想问一下项目的方法论来源

请问此项目是基于论文DIFFSVC: A DIFFUSION PROBABILISTIC MODEL FOR SINGING VOICE CONVERSION，https://arxiv.org/abs/2105.13871 实现的吗？

Suggestion: rename gui.py

It should be something like rtvc-gui.py, to avoid confusion with potential future GUIs for training or inference.

关于 v2.0 的一些疑问

有一些关于最新 2.0 版本的一些疑问想请教一下大佬：

当前 v2.0 分支还会不会有大的改动？是否已经可以基于当前分支进行模型训练和推理？
v2.0 与 v1 的模型是否兼容呢？
v2.0 相比 v1 有哪些进步呢？
如果要使用新的 hifi-vaegan 作为 vocoder，在训练模型时是否必须使用 hifi-vaegan，而不能使用 nsf-hifigan？
hifi-vaegan 相比 nsf-hifigan 有哪些优点或缺点呢？
如果新旧模型不兼容，当前是否已计划训练发布新的底模呢？
如果要使用 v2.0 进行底模训练，配置文件需要针对性做一些修改吗（以及有无推荐的 batch size 和 lr 参数）？

如果有不方便回答的问题还请略过就好。

也许可以增加其他训练方式

Does anyone know if the whisper-ppg-largev2 or v3 model can train diffusion models and use them In so-vits-svc model

Does anyone know if the whisper-ppg-largev2 or v3 model can train diffusion models and use them In so-vits-svc project?Actrually,I have trained a diffusionmodel with the largev2 being the vencoder,but it is just that the diff model cannot be used,it shows a bug as follows.I noticed that this project shows that whisper-ppg cannot use the diff module,but why the so-vits-svc team supplys a way to train whisper-ppg-largev2 diff model? I like the diff module,and think it is very useful, so I really want to know the reason why I cannot use it in my so-vits-svc4.1 model or hope to find the solutions.thanks!

FCPE fails to import

When trying to use FCPE as a f0 extractor the following error appears: [Errno 2] No such file or directory: 'exp/f0bce_test_R004_cu0\\config.yamI'
It's looking for a config.yaml file for the FCPE model, but this shouldn't be the case.

Solution to serious memory leaks in preprocessing under Linux | 在Linux下面进行预处理发生严重内存泄漏的解决方法

Please use the following command to force pytorch to update to the nightly version
请用下面的命令把pytorch强制更新到nightly版本

cu118：pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118 --force-reinstall
cu121：pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121 --force-reinstall

Does Diffusion-SVC use f0 coarse :?

欢迎使劲锤我

@NaruseMioShirakana 我2022年的项目，都能被你锤成碰瓷你2023年的diffusion_svc，给你使劲点赞。

Hoarse sound when using realtime inference

I trained one combo model, and it works good under offline inference in main.py, but using gui_realtime.py can only make a hoarse sound and cannot hear any pronunciation clearly(pitch seems to be correct). I've tried to change input devices\upgrade torch, but nothing works.

Following are setting snapshots.

Offline Inference

Realtime Inference

Have any ideas or workarounds?

Then I convert sample rate to 44100Hz, it works fine