Code Monkey home page Code Monkey logo

knn-vc's Introduction

Voice Conversion With Just Nearest Neighbors (kNN-VC)

The official code repo! This repo contains training and inference code for kNN-VC -- an any-to-any voice conversion model from our paper, "Voice Conversion With Just k-Nearest Neighbors". The trained checkpoints are available under the 'Releases' tab and through torch.hub.

Links:

kNN-VC method

Figure: kNN-VC setup. The source and reference utterance(s) are encoded into self-supervised features using WavLM. Each source feature is assigned to the mean of the k closest features from the reference. The resulting feature sequence is then vocoded with HiFi-GAN to arrive at the converted waveform output.

Authors:

*Equal contribution

Quickstart

We use torch.hub to make loading the model easy -- no cloning of the repo needed. The steps to perform inference are simple:

  1. Install dependancies: we have 3 inference dependencies only torch, torchaudio, and numpy. Python must be at version 3.10 or greater, and torch must be v2.0 or greater.
  2. Load models: load the WavLM encoder and HiFiGAN vocoder:
import torch, torchaudio

knn_vc = torch.hub.load('bshall/knn-vc', 'knn_vc', prematched=True, trust_repo=True, pretrained=True)
# Or, if you would like the vocoder trained not using prematched data, set prematched=False.
  1. Compute features for input and reference audio:
src_wav_path = '<path to arbitrary 16kHz waveform>.wav'
ref_wav_paths = ['<path to arbitrary 16kHz waveform from target speaker>.wav', '<path to 2nd utterance from target speaker>.wav', ...]

query_seq = knn_vc.get_features(src_wav_path)
matching_set = knn_vc.get_matching_set(ref_wav_paths)
  1. Perform the kNN matching and vocoding:
out_wav = knn_vc.match(query_seq, matching_set, topk=4)
# out_wav is (T,) tensor converted 16kHz output wav using k=4 for kNN.

That's it! These default settings provide pretty good results, but feel free to modify the kNN topk or use the non-prematched vocoder. Note: the target speaker from ref_wav_paths can be anything, but should be clean speech from the desired speaker. The longer the cumulative duration of all reference waveforms, the better the quality will be (but the slower it will take to run). The improvement in quality diminishes beyond 5 minutes of reference speech.

Checkpoints

Under the releases tab of this repo we provide three checkpoints:

  • The frozen WavLM encoder taken from the original WavLM authors, which we host here for convenience and torch hub integration.
  • The HiFiGAN vocoder trained on layer 6 of WavLM features.
  • The HiFiGAN vocoder trained on prematched layer 6 of WavLM features (the best model in the paper).

For the HiFiGAN models we provide both the generator inference checkpoint and full training checkpoint with optimizer states.

The performance on the LibriSpeech dev-clean set is summarized:

checkpoint WER (%) CER (%) EER (%)
kNN-VC with prematched HiFiGAN 6.29 2.34 35.73
kNN-VC with regular HiFiGAN 6.39 2.41 32.55

Training

We follow the typical encoder-converter-vocoder setup for voice conversion. The encoder is WavLM, the converter is k-nearest neighbors regression, and vocoder is HiFiGAN. The only component that requires training is the vocoder:

  1. WavLM encoder: we simply use the pretrained WavLM-Large model and do not train it for any part of our work. We suggest checking out the original WavLM repo to train your own SSL encoder.
  2. kNN conversion model: kNN is non-parametric and does not require any training :)
  3. HiFiGAN vocoder: we adapt the original HiFiGAN author's repo for vocoding WavLM features. This is the only part which requires training.

HiFiGAN training

For training we require the same dependencies as the original HiFiGAN training here -- namely librosa, tensorboard, matplotlib, fastprogress, scipy. Then, to train the HiFiGAN:

  1. Precompute WavLM features of the vocoder dataset: we provide a utility for this for the LibriSpeech dataset in prematch_dataset.py:

    usage: prematch_dataset.py [-h] --librispeech_path LIBRISPEECH_PATH
                            [--seed SEED] --out_path OUT_PATH [--device DEVICE]
                            [--topk TOPK] [--matching_layer MATCHING_LAYER]
                            [--synthesis_layer SYNTHESIS_LAYER] [--prematch]
                            [--resume]

    where you can specify --prematch or not to determine whether to use prematching when generating features or not. For example, to generate the dataset used to train the prematched HiFiGAN from the paper: python prematch_dataset.py --librispeech_path /path/to/librispeech/root --out_path /path/where/you/want/outputs/to/go --topk 4 --matching_layer 6 --synthesis_layer 6 --prematch

  2. Train HiFiGAN: we adapt the training script from the original HiFiGAN repo to work for WavLM features in hifigan/train.py. To train a hifigan model on the features you produced above:

    python -m hifigan.train --audio_root_path /path/to/librispeech/root/ --feature_root_path /path/to/the/output/of/previous/step/ --input_training_file data_splits/wavlm-hifigan-train.csv --input_validation_file data_splits/wavlm-hifigan-valid.csv --checkpoint_path /path/where/you/want/to/save/checkpoint --fp16 False --config hifigan/config_v1_wavlm.json --stdout_interval 25 --training_epochs 1800 --fine_tuning

    That's it! Once it is run up till 2.5M updates (or it starts to sound worse) you can stop training and use the pretrained checkpoint.

Repository structure

├── data_splits                             # csv train/validation splits for librispeech train-clean-100
│   ├── wavlm-hifigan-train.csv
│   └── wavlm-hifigan-valid.csv
├── hifigan                                 # adapted hifigan code to vocode wavlm features
│   ├── config_v1_wavlm.json                # hifigan config for use with wavlm features
│   ├── meldataset.py                       # mel-spectrogram transform used during hifigan training
│   ├── models.py                           # hifigan model definition
│   ├── train.py                            # hifigan training script
│   └── utils.py                            # utilities used for hifigan inference/training
├── hubconf.py                              # torchhub integration
├── matcher.py                              # kNN matching logic and model wrapper
├── prematch_dataset.py                     # script to precompute wavlm features for librispeech
├── README.md                               
└── wavlm                                   
    ├── modules.py                          # wavlm helper functions (from original WavLM repo)
    └── WavLM.py                            # wavlm modules (from original WavLM repo)

Acknowledgements

Parts of code for this project are adapted from the following repositories -- please make sure to check them out! Thank you to the authors of:

Citation

@inproceedings{baas2023knnvc,
  author={Matthew Baas and Benjamin van Niekerk and Herman Kamper},
  title={Voice Conversion With Just Nearest Neighbors},
  year=2023,
  booktitle={Interspeech},
}

knn-vc's People

Contributors

bandanban1 avatar eschmidbauer avatar keikinn avatar rf5 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

knn-vc's Issues

Considering context around source features

Hi,

I had an idea, wanted to run it by you. So right now, for each source feature, you are doing k-means with the reference features. I'm thinking that the surrounding source features also might have useful information that could help you better nail down the correct reference feature.

So for example my source features are [s1 s2 ... s100] and reference features (lets just assume k = 1) are [r1 r2 ... r100]. If you consider the sources features by themselves, maybe s1 maps to r22 and s2 maps to r77. But if you were to consider s1 and s2 together, they combined would map to [r23, r24] which is more correct.

Let me know what you think about this. Does this make sense/is my scenario plausible?

Thank you.

SoX effect fails on Windows with SoundFile backend

Traceback (most recent call last):
  File "\.cache\torch\hub\bshall_knn-vc_master\matcher.py", line 72, in get_matching_set
    feats.append(self.get_features(p, weights=self.weighting if weights is None else weights, vad_trigger_level=vad_trigger_level))
  File "\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "\.cache\torch\hub\bshall_knn-vc_master\matcher.py", line 105, in get_features
    waveform_reversed, sr = apply_effects_tensor(x_front_trim, sr, [["reverse"]])
  File "\lib\site-packages\torchaudio\_internal\module_utils.py", line 73, in wrapped
    raise RuntimeError(f"{func.__module__}.{func.__name__} {message}")
RuntimeError: torchaudio.sox_effects.sox_effects.apply_effects_tensor requires sox extension, but TorchAudio is not compiled with it. Please build TorchAudio with libsox support.

It seems to work correctly after patching the apply_effects_tensor method since reverse is the only effect used but probably not the most elegant solution.

torchaudio.sox_effects.apply_effects_tensor = lambda waveform, sample_rate, _: (
    torch.flip(waveform, (-1,)),
    sample_rate,
)

Some questions about implementation

  1. Since the structure of the LibriSpeech dataset is root/subset/speaker/chapter/file, should
    uttrs_from_same_spk = sorted(list(path.parent.rglob('**/*.flac')))
    be modified to uttrs_from_same_spk = sorted(list(path.parent.parent.rglob('**/*.flac')))to return other utterances of the same speaker?
  2. Since matching and synthesis do not necessarily use the same features, can I use ASR features (like from Whisper) for matching and its corresponding original Mel spectrogram for synthesis?

prematch argument

Hi,

Thank you for this great repo!

In the readme file you mention that prematch option applies prematching the features. But in the code I see that prematch saves all the source features AS-IS without matching them one to each other. Can you please elaborate, I think I'm missing something.

Thanks

loss issues encountered in fine-tuning the model

Hello author, this project is great. I am trying to add some Chinese speeches for fine-tuning, but my ’validation/mel_spec_error‘ has almost stopped decreasing at 15k, and ’training/gen_loss_total' has also increased. I would like to ask if this loss change is normal. Thank you so much.
11

An error when loading models

I try:
import torch, torchaudio
knn_vc = torch.hub.load('/home/knn-vc', 'knn_vc', prematched=True, trust_repo=True, pretrained=True)

There is an error when loading models:
Traceback (most recent call last):
File "/home/knn-vc/inf.py", line 3, in
knn_vc = torch.hub.load('knn-vc', 'knn_vc', prematched=True, trust_repo=True, pretrained=True)
File "/home/miniconda3/envs/knvc/lib/python3.10/site-packages/torch/hub.py", line 555, in load
repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, trust_repo, "load",
File "/home/miniconda3/envs/knvc/lib/python3.10/site-packages/torch/hub.py", line 199, in _get_cache_or_reload
repo_owner, repo_name, ref = _parse_repo_info(github)
File "/home/miniconda3/envs/knvc/lib/python3.10/site-packages/torch/hub.py", line 135, in _parse_repo_info
repo_owner, repo_name = repo_info.split('/')
ValueError: not enough values to unpack (expected 2, got 1)

Torch Hub CPU inference support

Currently it seems your repository only supports running on GPU, and gives the error

knn_vc = torch.hub.load('bshall/knn-vc', 'knn_vc', prematched=True, trust_repo=True, pretrained=True) 
Using cache found in C:\Users\Skyler/.cache\torch\hub\bshall_knn-vc_master
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\Users\Skyler\Documents\startai_tts_foss\.conda\lib\site-packages\torch\hub.py", line 558, in load
    model = _load_local(repo_or_dir, model, *args, **kwargs)
  File "c:\Users\Skyler\Documents\startai_tts_foss\.conda\lib\site-packages\torch\hub.py", line 587, in _load_local
    model = entry(*args, **kwargs)
  File "C:\Users\Skyler/.cache\torch\hub\bshall_knn-vc_master\hubconf.py", line 20, in knn_vc
    hifigan, hifigan_cfg = hifigan_wavlm(pretrained, progress, prematched, device)
  File "C:\Users\Skyler/.cache\torch\hub\bshall_knn-vc_master\hubconf.py", line 36, in hifigan_wavlm
    generator = HiFiGAN(h).to(device)
  File "c:\Users\Skyler\Documents\startai_tts_foss\.conda\lib\site-packages\torch\nn\modules\module.py", line 1145, in to
    return self._apply(convert)
  File "c:\Users\Skyler\Documents\startai_tts_foss\.conda\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
    module._apply(fn)
  File "c:\Users\Skyler\Documents\startai_tts_foss\.conda\lib\site-packages\torch\nn\modules\module.py", line 820, in _apply
    param_applied = fn(param)
  File "c:\Users\Skyler\Documents\startai_tts_foss\.conda\lib\site-packages\torch\nn\modules\module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "c:\Users\Skyler\Documents\startai_tts_foss\.conda\lib\site-packages\torch\cuda\__init__.py", line 239, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

Can you modify the hubconf to support cpu only systems too.

Discriminator checkpoint

Hey! Thank you for sharing your work, I really like ur idea!

I am trying to finetune vocoder for a specific voice to see whether it would improve voice matching, because voice doesn't match well at zero-shot setting. Could you please share discriminator weights of vocoder with prematching?

Request

Do you think you could make this 48000hz compatible?

extend to other SSL model features

Hi authors,

This is an interesting work on VC! Have you tried applying the same idea on codec latents as well? I read that you've tried on hubert features and it worked too, but I'm wondering if you tested on models like encodec / soundstream, or if you have any insights on them. Thanks!

Best,
Dongyao

prematch_dataset run very slow

`
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:1C:00.0 On | N/A |
| 0% 46C P8 18W / 170W | 3902MiB / 12288MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1362 G /usr/lib/xorg/Xorg 53MiB |
| 0 N/A N/A 2014 G /usr/lib/xorg/Xorg 119MiB |
| 0 N/A N/A 2144 G /usr/bin/gnome-shell 54MiB |
| 0 N/A N/A 384873 C python 3659MiB |
+-----------------------------------------------------------------------------+
`

2023-07-06 22-42-03屏幕截图

It takes up video memory, but it should not be used, and the power has not increased. Is there a problem?

Training HiFiGAN on higher quality data

Hey, I was wondering what sort of changes it would take to the training script to be able to train HiFiGAN on higher quality data like LibriTTS or LibriTTS-R. The dataset uses wav files instead of flac files and is 24kHz sampling rate. I can preprocess the dataset to be 16kHz and make changes to the files in data_splits to work with wav files, but I wanted to know what the best way to work with this kind of data would be. If there are other ways to help improve the general quality of the outputs, I'd be happy to explore those too. Any help would be great, thanks!

Some question about features KNN

Have you tried using Wavlm, which has been fine-tuned on an ASR dataset, to extract semantic features for querying KNN instead of directly using SSL features? Using KNN to obtain timestamps only, then using the timestamps of the reference Wavlm SSL to generate the output.

Using another batch size in training

I encountered a strange bug or rather a strange behaviour, which I can not really pinpoint to the exact issue.
I used the standard training, as you described and it worked fine. However when I changed the batch_size parameter to 12 in config_v1_wavlm.json the train.py was only executed until line 136 for i, batch in pb:. Its not an memory issue as I still have more than 12GB free on my GPU but it seems for some reason the script skips the for loop if you increase the batch size in the json file.

Error in quickstart

Hi, I'd like to test the KNN-VC model, but I'm getting an error at the very beginning:

Python 3.8.10 (default, May 26 2023, 14:05:08)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'2.0.1+cu117'
>>> import torch, torchaudio
>>> knn_vc = torch.hub.load('bshall/knn-vc', 'knn_vc', prematched=True, trust_repo=True, pretrained=True)
Downloading: "https://github.com/bshall/knn-vc/zipball/master" to /local/home/fa125436/.cache/torch/hub/master.zip
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data1/is156025/fa125436/N2D2/env/lib/python3.8/site-packages/torch/hub.py", line 558, in load
    model = _load_local(repo_or_dir, model, *args, **kwargs)
  File "/data1/is156025/fa125436/N2D2/env/lib/python3.8/site-packages/torch/hub.py", line 584, in _load_local
    hub_module = _import_module(MODULE_HUBCONF, hubconf_path)
  File "/data1/is156025/fa125436/N2D2/env/lib/python3.8/site-packages/torch/hub.py", line 98, in _import_module
    spec.loader.exec_module(module)
  File "<frozen importlib._bootstrap_external>", line 848, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/local/home/fa125436/.cache/torch/hub/bshall_knn-vc_master/hubconf.py", line 15, in <module>
    from matcher import KNeighborsVC
  File "/local/home/fa125436/.cache/torch/hub/bshall_knn-vc_master/matcher.py", line 32, in <module>
    class KNeighborsVC(nn.Module):
  File "/local/home/fa125436/.cache/torch/hub/bshall_knn-vc_master/matcher.py", line 58, in KNeighborsVC
    def get_matching_set(self, wavs: list[Path] | list[Tensor], weights=None, vad_trigger_level=7) -> Tensor:
TypeError: 'type' object is not subscriptable
>>>

Can you please help? Thanks a lot

WavLM Base+ over Large?

First, thanks for the paper and the code, this is very interesting!
Did you happen to do any testing with other versions of WavLM, such as Base or Base+? I was wondering if it would be possible to make this lighter without impacting the quality too much.

torchaudio version

Hi, When I execute the program according to the Quickstart in README.md, the following error is reported:

AttributeError: module 'torchaudio.functional' has no attribute 'loudness'

I guess it is not supported in the torchaudio version I am using, the version I am using is: pytorch==1.12.1 torchaudio==0.12.1.

Can you tell me which version you are using? Thanks.

Will this work for singing voice conversion (svc)?

Great repo! Ran some tests with it and it sounds good for speech, but the limited testing I did for singing didn't sound too great. Is this expected / is there a way to adapt it to work well with singing? Perhaps switch it to use NSF-HiFiGAN as so-vits-svc does?

P.S. I especially like the zero-shot any-to-any nature of this model, not sure if there are other projects out there now for zero shot svc.

Size mismatch error

Hi! I'm trying to run basic quickstart script, but it gives me

Traceback (most recent call last):
  File "/data/code_jb/knn-vc/test_run.py", line 11, in <module>
    out_wav = knn_vc.match(query_seq, matching_set, topk=4)
  File "/data/SOFT/miniconda/envs/ml2/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/i.beskrovnyy/.cache/torch/hub/bshall_knn-vc_master/matcher.py", line 158, in match
    dists = fast_cosine_dist(query_seq, matching_set, device=device)
  File "/home/i.beskrovnyy/.cache/torch/hub/bshall_knn-vc_master/matcher.py", line 25, in fast_cosine_dist
    dotprod = -torch.cdist(source_feats[None].to(device), matching_pool[None], p=2)[0]**2 + source_norms[:, None]**2 + matching_norms[None]**2
RuntimeError: The size of tensor a (782) must match the size of tensor b (543) at non-singleton dimension 2

My src and target files are different in samplerate and length, can it be the problem?

Output is a bit shaky, how to fix that?

Thanks for the great work and making code with all weights available!
Really appreciate it..

Can you please guide me on how to improve the output further?
If we change the vocoder to HiFIGAN V2 or train on more data, how do you think output will change?

Also, how much time does it take to train on train-100 data from librispeech?

Hints on improvements for training and matching

First of all thanks for the great model! I tested it extensively by now and ran across a few problems and performance issues which you might can help with.

  1. Matching takes a lot of time with big datasets (1000+ 2min files), since it is not multi-gpu, do you intend to change that in the future?

  2. General behaviour: For training in general, it seems to be better to have a few big files rather than many small files (2min vs 10sec). I think this might be related to the overhead introduced by all the small .pt "models". Can you confirm this or believe this is plausible?

  3. My biggest issue so far is when I try to fine tune the hifi-gan vocoder. My Notebook with a A4500 seems to be on par, even outperforming my DGX Station with 4 x V100 32GB GPUs, which is strange.

I identified the following things:

During validation the operation is performed on all files rather than in a batch (1000+ files). The station and the notebook are both about equally fast in validating all files. However the station uses 4 GPUs which are working at 100% all the time and should be a lot faster. Since this is really slow how often do you think should I perform validation?

During batch training the notebook also outperforms the station by a little bit completing one epoch in about 40sec (station 48sec).
However when I look at nvidia-smi on the station the GPU usage is at 0% all the time.

Unfortunately there seem to be some serious issues with a multi-gpu approach. If I only use one GPU on the station one epoch takes about 17 seconds. Maybe you have an idea what goes wrong here?

Edit:
When I monitored the epochs in a multi-gpu setting it seems the epoch itself was trained really fast in 5 seconds. However before the progressbar appears there seems to happen some loading which takes the other 50 seconds. Do you know which process is responsible for that time gap or how to minmize it?

123

What I tried so far:
I tried to use different batch sizes and adjusted the number of workers in the config, but it did no really change results that much,

Choice for k

Hi, the paper goes over the choice for k very briefly, so I was wondering if you could share some results of the preliminary experiments. It says "when more reference audio is available (e.g. ≥10 mins), the conversion quality may even be improved by using larger values of k (in the order of k = 20)"; does the quality keeps getting better past k=20, or does it start degrading after certain point? Also, did you try k=1, which happens to be the approach this project uses? If so, what were the results?

Matching pool empty

I sometimes experience a bug when performing matching with big datasets (20k samples+).

This is the Stacktrace:

Feature has shape:  torch.Size([445, 1024])---------------------------------------------------------| 0.02% [1/5293 00:10<15:54:29 train_clean2/102/102-83.flac]
Feature has shape:  torch.Size([400, 1024])---------------------------------------------------------| 0.04% [2/5293 00:15<11:44:33 train_clean2/102/102-60.flac]
Done 1,000/5,293████-------------------------------------------------------------------------------| 18.89% [1000/5293 05:40<24:20 train_clean2/14/14-55.flac]c]
Traceback (most recent call last):-----------------------------------------------------------------| 21.56% [1141/5293 06:36<24:02 train_clean2/15/15-5.flac]]]
  File "/raid/nils/projects/knn-vc/prematch_dataset.py", line 172, in <module>
    main(args)
  File "/raid/nils/projects/knn-vc/prematch_dataset.py", line 51, in main
    extract(ls_df, wavlm, args.device, Path(args.librispeech_path), Path(args.out_path), SYNTH_WEIGHTINGS, MATCH_WEIGHTINGS)
  File "/raid/nils/projects/knn-vc/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/raid/nils/projects/knn-vc/prematch_dataset.py", line 128, in extract
    matching_pool, synth_pool = path2pools(row.path, wavlm, match_weights, synth_weights, device)
  File "/raid/nils/projects/knn-vc/prematch_dataset.py", line 75, in path2pools
    matching_pool = torch.concat(matching_pool, dim=0)
RuntimeError: torch.cat(): expected a non-empty list of Tensors

However the problem seems not to be the file itself. When I change the matching directory to train_clean2/15 the algorithm runs through without any problems. I used the German Distant Speech Data Corpus 2014 / 2015 to run this experiment. I wonder what the root of this error might be, I had directories of 5000+ files run through without a problem but sometimes this bug still appears. For some reason it seems not to find a matching vector.

Question about WavLM layer choice

In your paper, you say:

Recent work confirms that later layers give poorer predictions of pitch, prosody, and speaker identity. Based on these observations, we found that using a layer with high correlation with speaker identification – layer 6 in WavLM-Large – was necessary for good speaker similarity and retention of the prosody information from the source utterance.

The reference associated with that passage, though, doesn't seem to examine WavLM-Large, only Base, and my reading of it is that WavLM-Base's earlier layers (0-2) are more correlated with pitch and energy reconstruction, common speaker ID features.

I'm wondering how you came to use layer 6 of the Large model and whether you tried other layers. I'm having trouble locating other research that dives into layerwise feature correlations for these models, so any pointers you can provide are helpful.

Thanks!

Link to paper

Hi, I was wondering if you can provide me the link to the research paper, thanks

Maybe mention memory consumption in readme.md?

I just tried to use and test your model, unfortunately I only have a GPU with 16GB of RAM. Apparently WavLM takes about 12 GB and HifiGAN needs another 5 GB so you need at least 20GB of RAM to run inference. Would be nice to clarify that in the requirements section :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.