This repository is an implementation of Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS) with a vocoder that works in real-time.
- Both Windows and Linux are supported. A GPU is recommended for training and for inference speed, but is not mandatory.
- Python 3.7 is recommended. Python 3.5 or greater should work, but you'll probably have to tweak the dependencies' versions. I recommend setting up a virtual environment using
venv
, but this is optional. - Install ffmpeg. This is necessary for reading audio files.
- Install PyTorch. Pick the latest stable version, your operating system, your package manager (pip by default) and finally pick any of the proposed CUDA versions if you have a GPU, otherwise pick CPU. Run the given command.
- Install the remaining requirements with
pip install -r requirements.txt
Run wrapper code by :
python3 wrapper_code.py
The wrapper code is written on top of the above stated framework to take the input as an audio file, then passing this audio input to the code and cloning the voice from the audio file using the framework. The the cloned voice is used to generate audio file for the text using the framework.