Code Monkey home page Code Monkey logo

fastersvc's Introduction

FasterSVC: Fast voice conversion based distillated models and kNN method

(This repository is in the experimental stage. The content may change without notice.)

Other languages

Model architecture

Architecture The structure of the decoder is designed with reference to FastSVC, StreamVC, Hifi-GAN, etc. Low latency is achieved by using a "causal" convolution layer that does not refer to future information.

Features

  • Realtime conversion
  • Low latency (approximately 0.2 seconds, subject to change based on the environment and optimizations)
  • Stable phase and pitch (based on the source-filter model)
  • Speaker style conversion using k-nearest neighbors (kNN) method

Requirements

  • Python 3.10 or later
  • PyTorch 2.0 or later with GPU environment
  • When training from scratch, prepare a large amount of human speech data (e.g., LJ Speech, JVS Corpus)

Installation

  1. clone this repository.
git clone https://github.com/uthree/fastersvc.git
  1. install requirements
pip3 install -r requirements.txt

Download pretrained model

The model pretrained with the JVS corpus is published here.

Pre-training

Train a model for basic voice conversion. At this stage, the model is not specialized for a specific speaker, but having a model that can perform basic voice synthesis allows for easy adaptation to a specific speaker with minimal adjustments.

Here are the steps:

  1. Preprocess.
python3 preprocess.py <Dataset directory>
  1. Train pitch estimator. Distill pitch estimation using a fast and parallelizable 1D CNN with the harvest algorithm from WORLD.
python3 train_pe.py
  1. Train content encoder Distill HuBERT-base.
python3 train_ce.py
  1. Train decoder The goal of the decoder is to reconstruct the original waveform from pitch and content.

sh

python3 train_dec.py

Fine-tuning

By adjusting the pre-trained model to a model specialized for conversion to a specific speaker, it is possible to create a more accurate model. This process takes much less time than pre-learning.

  1. Combine only the audio files of a specific speaker into one folder, and preprocess.
python3 preorpcess.py <Dataset directory>
  1. Fine tune the decoder.
python3 train_dec.py
  1. Create a dictionary for vector search. This eliminates the need to encode audio files each time.
python3 extract_index.py -o <Dictionary output destination (optional)>
  1. When inferring, you can load arbitrary dictionary data by adding the -idx <dictionary file> option.

Training Options

  • add -fp16 True to accelerate training with float16 if you have RTX series GPU.
  • add -b <number> to set batch size. default is 16.
  • add -e <number> to set epoch. default is 60.
  • add -d <device name> to set training device, default is cuda.

Inference

  1. Create an directory inputs
  2. Put audio files in inputs
  3. Run inference script
python3 infer.py -t <target audio file>

Additional options

  • You can set the transparency of the original audio information with -a <number from 0.0 to 1.0>.
  • You can normalize the volume with --normalize True.
  • You can change the calculation device with =d <device name>. Although it may not make much sense since it is originally high speed.
  • Pitch shift can be performed with -p <scale>. Useful for voice conversion between men and women.

Realtime Inference with PyAudio (This is a feature in the testing stage)

  1. Confirm the ID of the audio device
python3 audio_device_list.py
  1. Run inference
python3 infer_streaming.py -i <input device id> -o <output device id> -l <loopback device id> -t <target audio file>

(The loopback option is optional.)

References

This document is translated from Japanese using ChatGPT.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.