HuBERT

Training and inference scripts for the HuBERT content encoders in A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion. For more details see soft-vc. Audio samples can be found here. Colab demo can be found here.

^{Fig 1: Architecture of the voice conversion system. a) The discrete content encoder clusters audio features to produce a sequence of discrete speech units. b) The soft content encoder is trained to predict the discrete units. The acoustic model transforms the discrete/soft speech units into a target spectrogram. The vocoder converts the spectrogram into an audio waveform.}

Example Usage

Programmatic Usage

import torch, torchaudio

# Load checkpoint (either hubert_soft or hubert_discrete)
hubert = torch.hub.load("bshall/hubert:main", "hubert_soft", trust_repo=True).cuda()

# Load audio
wav, sr = torchaudio.load("path/to/wav")
assert sr == 16000
wav = wav.unsqueeze(0).cuda()

# Extract speech units
units = hubert.units(x)

Script-Based Usage

usage: encode.py [-h] [--extension EXTENSION] {soft,discrete} in-dir out-dir

Encode an audio dataset.

positional arguments:
  {soft,discrete}       available models (HuBERT-Soft or HuBERT-Discrete)
  in-dir                path to the dataset directory.
  out-dir               path to the output directory.

optional arguments:
  -h, --help            show this help message and exit
  --extension EXTENSION
                        extension of the audio files (defaults to .flac).

Training

Step 1: Dataset Preparation

Download and extract the LibriSpeech corpus. The training script expects the following tree structure for the dataset directory:

│   lengths.json
│
└───wavs
    ├───dev-*
    │   ├───84
    │   ├───...
    │   └───8842
    └───train-*
        ├───19
        ├───...
        └───8975

The train-* and dev-* directories should contain the training and validation splits respectively. Note that there can be multiple train and dev folders e.g., train-clean-100, train-other-500, etc. Finally, the lengths.json file should contain key-value pairs with the file path and number of samples:

{
    "dev-clean/1272/128104/1272-128104-0000": 93680,
    "dev-clean/1272/128104/1272-128104-0001": 77040,
}

Step 2: Extract Discrete Speech Units

Encode LibriSpeech using the HuBERT-Discrete model and encode.py script:

usage: encode.py [-h] [--extension EXTENSION] {soft,discrete} in-dir out-dir

Encode an audio dataset.

positional arguments:
  {soft,discrete}       available models (HuBERT-Soft or HuBERT-Discrete)
  in-dir                path to the dataset directory.
  out-dir               path to the output directory.

optional arguments:
  -h, --help            show this help message and exit
  --extension EXTENSION
                        extension of the audio files (defaults to .flac).

for example:

python encode.py discrete path/to/LibriSpeech/wavs path/to/LibriSpeech/discrete

At this point the directory tree should look like:

│   lengths.json
│
├───discrete
│   ├───...
└───wavs
    ├───...

Step 3: Train the HuBERT-Soft Content Encoder

usage: train.py [-h] [--resume RESUME] [--warmstart] [--mask] [--alpha ALPHA] dataset-dir checkpoint-dir

Train HuBERT soft content encoder.

positional arguments:
  dataset-dir      path to the data directory.
  checkpoint-dir   path to the checkpoint directory.

optional arguments:
  -h, --help       show this help message and exit
  --resume RESUME  path to the checkpoint to resume from.
  --warmstart      whether to initialize from the fairseq HuBERT checkpoint.
  --mask           whether to use input masking.
  --alpha ALPHA    weight for the masked loss.

Citation

If you found this work helpful please consider citing our paper:

@inproceedings{
    soft-vc-2022,
    author={van Niekerk, Benjamin and Carbonneau, Marc-André and Zaïdi, Julian and Baas, Matthew and Seuté, Hugo and Kamper, Herman},
    booktitle={ICASSP}, 
    title={A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion}, 
    year={2022}
}

abylouw / hubert Goto Github PK

hubert's Introduction

HuBERT

Example Usage

Programmatic Usage

Script-Based Usage

Training

Step 1: Dataset Preparation

Step 2: Extract Discrete Speech Units

Step 3: Train the HuBERT-Soft Content Encoder

Links

Citation

hubert's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent