Speaker Embedding for Diarization in PyTorch
Installation
Conda environment
To use this conda environment, you need to install Miniconda, then run this command
conda env create -f environment.yml
Install OpenVINO (Optional)
If you would like to use OpenVINO for inference. Please check OpenVINO official Documentation for the installation. This code utilizes OpenVINO 2020.1.023.
Step 0: Download VCTK Dataset
We can easily download VCTK dataset using torchaudio
. Please do to use Part 0 - Download VCTK Dataset.ipynb
if you are stuck.
๐ช
Step 1: Training Simply run the code in the Part 1 - Training.ipynb
notebook and you are good to go.
Here is the training process if you would want to learn more:
-
Dataset Preparation
If you would like to use your own dataset, use folder structure as in VCTK:vctk_dataset txt speaker_1 sentence_1.txt sentence_2.txt speaker_2 sentence_1.txt sentence_2.txt wav48 speaker_1 sentence_1.wav sentence_2.wav speaker_2 sentence_1.wav sentence_2.wav
-
Dataset & Dataloader
I have createdVCTKTripletDataset
andVCTKTripletDataloader
dan would prepare Triplet data to train the speaker embedding. Here is a quick look for youdataset = VCTKTripletDataset(wav_path, txt_path, n_data=3000, min_dur=1.5) dataloader = VCTKTripletDataloader(dataset, batch_size=32)
In a nutshell, here is what it does:
- Only audio longer than
min_dur
seconds is considered as data. - Randomly choose 2 speakers, A and B, from the dataset folder.
- Randomly choose 2 audios from A and 1 from B, mark it as anchor, positive, and negative.
- Repeat
n_data
times. Now you have a dataset. - The dataset is loaded as minibatch of size
batch_size
.
- Only audio longer than
-
Architecture & Config
Simply create the model architecture you would like to use. I have made one sample for you insrc/model.py
. You can directly use it by simply changing the hyperparameters. Please follow the notebook if you are confused.โ Note: If you would like to use custom architecture in OpenVINO, make sure it is compatible with OpenVINO model optimizer and inference engine. -
Training Preparation
Set up themodel
,criterion
,optimizer
, andcallback
here. -
Training
As what it says, running the code will train the model
๐
Step 2: Sanity Check While training, you can visualize the embedding to confirm if the model actually learns. Similarly, I have prepared VCTKSpeakerDataset
and VCTKSpeakerDataloader
for this purpose. Here is a sample visualization using Uniform Manifold Approximation and Projection.
๐ค
Step 3: Speaker Diarization I have prepared a PyTorchPredictor
and OpenVINOPredictor
object for speaker diarization. Here is a sneak peek:
p = PyTorchPredictor(config_path, model_path, max_frame=45, hop=3)
p = OpenVINOPredictor(model_xml, model_bin, config_path, max_frame=45, hop=3)
To use it, simply input the arguments and use .predict(wav_path)
and it will return the diarization timestamp and speakers. The timestamps are in seconds.
You can use this sample dataset in Kaggle to test the speaker diarization. For example:
p = PyTorchPredictor("model/weights_best.pth", "model/configs.pth")
timestamps, speakers = p.predict("dataset/test/test_1.wav", plot=True)
setting plot=True
provides you with the diarization visualization like this
If you would like to use OpenVINO, use .to_onnx(fname, outdir)
to convert the model into onnx format.
p.to_onnx("speaker_diarization.onnx", "model/openvino")
โค๏ธ
End notes I hope you like the projects. I made this repo for educational purposes so it might need further tweaking to reach production level. Feel free to create an issue if you need help, and I hope I'll have the time to help you. Thank you.