RepCodec: A Speech Representation Codec for Speech Tokenization

RepCodec: A Speech Representation Codec for Speech Tokenization

Introduction

RepCodec is a speech tokenization method for converting a speech waveform into a sequence of discrete semantic tokens. The main idea is to train a representation codec which learns a vector quantization codebook through reconstructing the input speech representations from speech encoders like HuBERT or data2vec. Extensive experiments show that RepCodec significantly outperforms the widely used k-means clustering approach in both speech understanding and generation. Also, RepCodec generalizes well across various speech encoders and languages.

RepCodec Models

Feature Type	Speech Data	RepCodec Model
HuBERT base layer 9	Librispeech train-clean-100	hubert_base_l9
HuBERT large layer 18	Librispeech train-clean-100	hubert_large_l18
data2vec base layer 6	Librispeech train-clean-100	data2vec_base_l6
data2vec large layer 18	Librispeech train-clean-100	data2vec_large_l18
Whisper medium layer 24	Librispeech train-clean-100	whisper_medium_l24
Whisper large-v2 layer 32	Librispeech train-clean-100	whisper_large_l32

Speech Tokenization Using Pre-Trained Models

Installation

Please first install RepCodec by

git clone https://github.com/mct10/RepCodec.git
cd RepCodec
pip install .

We used Python 3.9.18 and PyTorch 1.12.1 to test the usage, but the code should be compatible with other recent Python and PyTorch versions.

Representation Preparation

We adapt the dump_hubert_feature.py script from fairseq to support dumping representations from data2vec, HuBERT, or Whisper encoders.

If you use our script (examples/dump_feature.py), please also install the following packages:

pip install npy_append_array soundfile

Additionally, if you want to dump representations from

data2vec or HuBERT: please follow fairseq's instruction to install the latest fairseq.
Whisper: please follow Whispers'instruction to install the latest Whisper.

Then, you can follow the given examples to dump representations:

# Example 1: dump from HuBERT base layer 9 
# (for data2vec, simply change "model_type" to data2vec and "ckpt_path" to the path of data2vec model)

layer=9

python3 examples/dump_feature.py \
    --model_type hubert \
    --tsv_path /path/to/tsv/file \
    --ckpt_path /path/to/HuBERT/model  \
    --layer ${layer} \
    --feat_dir /dir/to/save/representations


# Example 2: dump from Whisper medium layer 24

layer=24

python3 examples/dump_feature.py \
    --model_type whisper \
    --tsv_path /path/to/tsv/file \
    --whisper_root /directory/to/save/whisper/model \
    --whisper_name medium \
    --layer ${layer} \
    --feat_dir /dir/to/save/representations

Explanations about the args:

model_type: choose from data2vec, hubert, and whisper.
tsv_path: path of the tsv file. Should have the format of

/dir/to/dataset
path_of_utterance_1 number_of_frames
path_of_utterance_2 number_of_frames

You can follow this script to generate the tsv file.

For example, by running

python wav2vec_manifest.py \
  /dir/to/LibriSpeech/dev-clean \
  --dest /dir/to/manifest \
  --ext flac \
  --valid-percent 0

you can obtain the dev-clean.tsv in /dir/to/manifest for LibriSpeech. (By default, the output file name is train.tsv. Remember to rename the file.)

It should be similar to:

/dir/to/LibriSpeech/dev-clean
2277/149896/2277-149896-0026.flac	78720
2277/149896/2277-149896-0005.flac	89600
2277/149896/2277-149896-0033.flac	45520

ckpt_path: must provide for data2vec and HuBERT. You need to download the model from data2vec website or HuBERT website yourself. --ckpt_path is the path of the data2vec/HuBERT model.
whisper_root and whisper_name: must provide BOTH --whisper_root and --whisper_name for Whisper. If there is no corresponding model in --whisper_root, the script will download for you.
layer: which Transformer encoder layer of the model should the representations be extracted from. It is 1-based. For example, if layer=9, then the outputs from the 9^th Transformer encoder layer are dumped. Range: [1, number of Transformer encoder layers]
feat_dir: The output representations will be saved to ${feat_dir}/0_1.npy and ${feat_dir}/0_1.len.

For other useful functionalities (e.g., sharding), please check the argument list in examples/dump_feature.py.

Command Line Usage

We expect to have ${feat_dir}/0_1.npy and ${feat_dir}/0_1.len in the provided directory /dir/to/representaitons.

Also, the tsv file should be the same as the one used in Representation Preparation.

repcodec /dir/to/representaitons \
    --model /path/to/repcodec/model \
    --tsv_path /path/to/tsv/file \
    [--use_gpu] \
    [--out_dir /path/to/output]

This command will tokenize the representations and the output discrete tokens will be saved to ${out_dir}/tokens. The tokens are in the same order as the provided tsv file.

An example of the output file:

/dir/to/LibriSpeech/dev-clean
2277/149896/2277-149896-0026.flac	696 696 198 198 198 498 ...
2277/149896/2277-149896-0005.flac	696 696 198 198 198 907 ...
2277/149896/2277-149896-0033.flac	696 696 198 198 198 696 ...

Under examples/tokens, we provide some token files as references. They are obtained from LibriSpeech dev-clean subset using the 6 types of representations and corresponding RepCodec Models. Your results should be very similar to ours.

Python Usage

import torch
import yaml

from repcodec.RepCodec import RepCodec

# for feature types of HubERT base & data2vec base, please use repcodec_dim768.yaml;
# for feature types of HuBERT large & data2vec large & Whisper medium, please use repcodec_dim1024.yaml;
# for feature types of Whisper large-v2, please use repcodec_dim1280.yaml
config = "repcodec/configs/repcodec_dim768.yaml"
with open(config) as fp:
    conf = yaml.load(fp, Loader=yaml.FullLoader)

model = RepCodec(**conf)
model.load_state_dict(torch.load("./hubert_base_l9.pkl", map_location="cpu")["model"]["repcodec"])
model.quantizer.initial()
model.eval()

# input shape: (batch size, hidden dim, sequence length)
random_features = torch.randn(size=(1, 768, 100))
with torch.no_grad():
    x = model.encoder(random_features)
    z = model.projector(x)
    _, idx = model.quantizer.codebook.forward_index(z.transpose(2, 1))
    tokens = idx.cpu().data.numpy().tolist()[0]

Acknowledge

Our implementation is based on facebookresearch/AudioDec. We thank them for open-sourcing their code!

Citation

If you find our work useful, please cite the following article.

@misc{huang2023repcodec,
      title={RepCodec: A Speech Representation Codec for Speech Tokenization}, 
      author={Zhichao Huang and Chutong Meng and Tom Ko},
      year={2023},
      eprint={2309.00169},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

sx-tts / repcodec Goto Github PK

repcodec's Introduction

RepCodec: A Speech Representation Codec for Speech Tokenization

Introduction

RepCodec Models

Speech Tokenization Using Pre-Trained Models

Installation

Representation Preparation

Command Line Usage

Python Usage

Acknowledge

Citation

repcodec's People

Contributors

Stargazers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent