tinglok / cvc Goto Github PK

View Code? Open in Web Editor NEW

54.0 5.0 12.0 890 KB

CVC: Contrastive Learning for Non-parallel Voice Conversion (INTERSPEECH 2021, in PyTorch)

License: MIT License

Python 100.00%

voice-conversion speech pytorch cyclegan contrastive-learning

cvc's Introduction

Contrastive Voice Conversion (CVC)

Video (3m) | Website | Paper

This implementation is based on CUT, thanks Taesung and Junyan for sharing codes.

We provide a PyTorch implementation of non-parallel voice conversion based on patch-wise contrastive learning and adversarial learning. Compared to baseline CycleGAN-VC, CVC only requires one-way GAN training when it comes to non-parallel one-to-one voice conversion, while improving speech quality and reducing training time.

Prerequisites

Linux or macOS
Python 3
CPU or NVIDIA GPU + CUDA CuDNN

Kick Start

Clone this repo:

git clone https://github.com/Tinglok/CVC
cd CVC

Install PyTorch 1.6 and other dependencies.

For pip users, please type the command pip install -r requirements.txt.

For Conda users, you can create a new Conda environment using conda env create -f environment.yaml.
Download pre-trained Parallel WaveGAN vocoder to ./checkpoints/vocoder.

CVC Training and Test

Download the VCTK dataset

cd dataset
wget http://datashare.is.ed.ac.uk/download/DS_10283_2651.zip
unzip DS_10283_2651.zip
unzip VCTK-Corpus.zip
cp -r ./VCTK-Corpus/wav48/p* ./voice/trainA
cp -r ./VCTK-Corpus/wav48/p* ./voice/trainB

where the speaker folder could be any speakers (e.g. p256, and p270).

Train the CVC model:

python train.py --dataroot ./datasets/voice --name CVC

The checkpoints will be stored at ./checkpoints/CVC/.

Test the CVC model:

python test.py --dataroot ./datasets/voice --validation_A_dir ./datasets/voice/trainA --output_A_dir ./checkpoints/CVC/converted_sound

The converted utterance will be saved at ./checkpoints/CVC/converted_sound.

Baseline CycleGAN-VC Training and Test

Train the CycleGAN-VC model:

python train.py --dataroot ./datasets/voice --name CycleGAN --model cycle_gan

Test the CycleGAN-VC model:

python test.py --dataroot ./datasets/voice --validation_A_dir ./datasets/voice/trainA --output_A_dir ./checkpoints/CycleGAN/converted_sound --model cycle_gan

The converted utterance will be saved at ./checkpoints/CycleGAN/converted_sound.

Pre-trained CVC Model

Pre-trained models on p270-to-p256 and many-to-p249 are avaliable at this URL.

TensorBoard Visualization

To view loss plots, run tensorboard --logdir=./checkpoints and click the URL http://localhost:6006/.

Citation

If you use this code for your research, please cite our paper.

@inproceedings{li2021cvc,
  author={Tingle Li and Yichen Liu and Chenxu Hu and Hang Zhao},
  title={{CVC: Contrastive Learning for Non-Parallel Voice Conversion}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={1324--1328}
}

cvc's People

Contributors

Stargazers

Watchers

Forkers

xiaozhuo12138 taalua kimjj-geek muyangdu ishine wangmou21 saleh73 wendonggan abcdxyz8 bprihasto lukaspfeifenberger markbr828

cvc's Issues

Could you upload the Cyclegan-vc3 baseline's code?

The cyclegan in CVC baseline is the baseline of CUT. Do you have any feature extraction for wavforms? Like mcep or filterbank? Could you upload the Cyclegan-vc3 baseline's code? Thanks for your help.

And when I running the cyclegan in CVC by VCC2020 database. There are some error:

"UserWarning: Using a target size (torch.Size([1, 80, 135])) that is different to the input size (torch.Size([1, 80, 136])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.

The size of tensor a (136) must match the size of tensor b (135) at non-singleton dimension 2"

I found the single wavform feature tensor's size is different. Have you met this kind of error when you run it on VCTK? Thank you so much.

error "there should be a subclass of BaseModel with class name that...."

Hey, thank you for the implementation !!
Im getting the error on line 45 in init.py in models, when starting cvc train as described in readme
In models.cvc_model.py, there should be a subclass of BaseModel with class name that matches cvcmodel in lowercase.
also, could you please specify the parallelwavegan config you used ? the one in the parameters is hard coded to your folder ?

Can make share pre-traied model?

Thanks you

testing error

Hi, I tried to run the CVC testing and it shows the error message :

FileNotFoundError: [Errno 2] No such file or directory: './checkpoints/vocoder/checkpoint-1000000steps.pkl'

is there some setting I missed? as I can't found the "checkpoint-1000000steps.pkl" file after training, there is not .pkl file, just .pth, thanks for your help.

Evaluation of CycleGAN-VC3 in your paper

Hi,
Thank you for sharing the paperwork!
I wonder if you trained the Resnet-9blocked generator in CycleGAN-VC3 for your paper evaluation.
Since I followed your command instruction for training CyleGAN-VC3, it uses the same "netG" as the CVC framework.

About converted samples.

Hi, I preprared dataset for One-to-One(p226 -> p225)VC experiment (p226 in trainA, p225 in trainB).
After training, I ran the test code, but the converted samples is strange.

sample : https://drive.google.com/file/d/1AfZlx3tTxWZyegFsrcrbKqPNsebs4ysm/view

The first few seconds are silence and then a voice is heard momentarily at the end.
I wonder why this is happening and what I should do to get a normal converted sample.
Thanks!

about python and cuda version

hi, may I ask what python and cuda version are required for this project? as I tried install it and it shows :

The NVIDIA driver on your system is too old (found version 10010).
Please update your GPU driver by downloading and installing a new
version from the URL: http://www.nvidia.com/Download/index.aspx
Alternatively, go to: https://pytorch.org to install
a PyTorch version that has been compiled with your version
of the CUDA driver.

I run on rtx 2080 with cuda 10.1, ubuntu 18.04, thanks.

Many-to-One implementation

Hi, the code for training provided seems to be for one-to-one conversion, whereas the associated paper suggests that the model should also be capable of many-to-one conversion. Is there any change required for training the model for many-to-one conversion on the VCTK dataset?

Is it the right way to use all speakers in VCTK in training? or just use sample two person A, B?

Thanks!