Code Monkey home page Code Monkey logo

pytorch_speaker_embedding_for_diarization's Introduction

Speaker Embedding for Diarization in PyTorch

Installation

Conda environment

To use this conda environment, you need to install Miniconda, then run this command

conda env create -f environment.yml

Install OpenVINO (Optional)

If you would like to use OpenVINO for inference. Please check OpenVINO official Documentation for the installation. This code utilizes OpenVINO 2020.1.023.

Step 0: Download VCTK Dataset

We can easily download VCTK dataset using torchaudio. Please do to use Part 0 - Download VCTK Dataset.ipynb if you are stuck.

Step 1: Training ๐Ÿ’ช

Simply run the code in the Part 1 - Training.ipynb notebook and you are good to go.

Here is the training process if you would want to learn more:

  • Dataset Preparation
    If you would like to use your own dataset, use folder structure as in VCTK:

    vctk_dataset
        txt
            speaker_1
                sentence_1.txt
                sentence_2.txt
            speaker_2
                sentence_1.txt
                sentence_2.txt
        wav48
            speaker_1
                sentence_1.wav
                sentence_2.wav
            speaker_2
                sentence_1.wav
                sentence_2.wav
    
  • Dataset & Dataloader
    I have created VCTKTripletDataset and VCTKTripletDataloader dan would prepare Triplet data to train the speaker embedding. Here is a quick look for you

    dataset = VCTKTripletDataset(wav_path, txt_path, n_data=3000, min_dur=1.5)
    dataloader = VCTKTripletDataloader(dataset, batch_size=32)
    

    In a nutshell, here is what it does:

    • Only audio longer than min_dur seconds is considered as data.
    • Randomly choose 2 speakers, A and B, from the dataset folder.
    • Randomly choose 2 audios from A and 1 from B, mark it as anchor, positive, and negative.
    • Repeat n_data times. Now you have a dataset.
    • The dataset is loaded as minibatch of size batch_size.
  • Architecture & Config
    Simply create the model architecture you would like to use. I have made one sample for you in src/model.py. You can directly use it by simply changing the hyperparameters. Please follow the notebook if you are confused.

    โ— Note: If you would like to use custom architecture in OpenVINO, make sure it is compatible with OpenVINO model optimizer and inference engine.

  • Training Preparation
    Set up the model, criterion, optimizer, and callback here.

  • Training
    As what it says, running the code will train the model

Step 2: Sanity Check ๐Ÿ’Š

While training, you can visualize the embedding to confirm if the model actually learns. Similarly, I have prepared VCTKSpeakerDataset and VCTKSpeakerDataloader for this purpose. Here is a sample visualization using Uniform Manifold Approximation and Projection.

Visualized Diarization

Step 3: Speaker Diarization ๐Ÿค–

I have prepared a PyTorchPredictor and OpenVINOPredictor object for speaker diarization. Here is a sneak peek:

p = PyTorchPredictor(config_path, model_path, max_frame=45, hop=3)
p = OpenVINOPredictor(model_xml, model_bin, config_path, max_frame=45, hop=3)

To use it, simply input the arguments and use .predict(wav_path) and it will return the diarization timestamp and speakers. The timestamps are in seconds.

You can use this sample dataset in Kaggle to test the speaker diarization. For example:

p = PyTorchPredictor("model/weights_best.pth", "model/configs.pth")
timestamps, speakers = p.predict("dataset/test/test_1.wav", plot=True)

setting plot=True provides you with the diarization visualization like this Visualized Diarization

If you would like to use OpenVINO, use .to_onnx(fname, outdir) to convert the model into onnx format.

p.to_onnx("speaker_diarization.onnx", "model/openvino")

End notes โค๏ธ

I hope you like the projects. I made this repo for educational purposes so it might need further tweaking to reach production level. Feel free to create an issue if you need help, and I hope I'll have the time to help you. Thank you.

pytorch_speaker_embedding_for_diarization's People

Contributors

wiradkp avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.