Thanks to StyleSpeech, we built up our codes based on Link
- LibriTTS dataset (train-clean-100 and train-clean-360) is used.
- Sampling rate is set to 16000Hz.
- This is the implementation of
SC-StyleSpeech
.
For SC-TransferTTS, please refer to the branchSC-TransferTTS
- Clone this repository.
- Install python requirements. Please refer requirements.txt
- Run
python prepare_align.py --data_path [LibriTTS DATAPATH]
for some preparations.
- Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. 1-1. Download MFA following the command in the website. 1-2. Run the below codes
$ conda activate aligner
$ mfa model download acoustic english_mfa
$ mfa align ......LibriTTS/wav16 lexicon.txt english_us_arpa .........LibriTTS/Textgrid
- Run
python preprocess.py
2-0. Check input&output data paths.
python train.py
- Change default settings --data_path [Preprocessed LibriTTS DATAPATH] --save_path [Experiment SAVEPATH]
- You can change hyperparameters of SC-CNN (kernel_size, channels), or other model configurations in configs/config.json
- Mel generation
python synthesize.py --checkpoint_path [CKPT PATH] --ref_audio [REF AUDIO PATH] --text [INPUT TEXT]