Code Monkey home page Code Monkey logo

repcodec's Introduction

RepCodec: A Speech Representation Codec for Speech Tokenization

RepCodec: A Speech Representation Codec for Speech Tokenization

Introduction

RepCodec is a speech tokenization method for converting a speech waveform into a sequence of discrete semantic tokens. The main idea is to train a representation codec which learns a vector quantization codebook through reconstructing the input speech representations from speech encoders like HuBERT or data2vec. Extensive experiments show that RepCodec significantly outperforms the widely used k-means clustering approach in both speech understanding and generation. Also, RepCodec generalizes well across various speech encoders and languages.

se

RepCodec Models

Feature Type Speech Data RepCodec Model
HuBERT base layer 9 Librispeech train-clean-100 hubert_base_l9
HuBERT large layer 18 Librispeech train-clean-100 hubert_large_l18
data2vec base layer 6 Librispeech train-clean-100 data2vec_base_l6
data2vec large layer 18 Librispeech train-clean-100 data2vec_large_l18
Whisper medium layer 24 Librispeech train-clean-100 whisper_medium_l24
Whisper large-v2 layer 32 Librispeech train-clean-100 whisper_large_l32

Speech Tokenization Using Pre-Trained Models

Installation

Please first install RepCodec by

git clone https://github.com/mct10/RepCodec.git
cd RepCodec
pip install .

We used Python 3.9.18 and PyTorch 1.12.1 to test the usage, but the code should be compatible with other recent Python and PyTorch versions.

Representation Preparation

We adapt the dump_hubert_feature.py script from fairseq to support dumping representations from data2vec, HuBERT, or Whisper encoders.

If you use our script (examples/dump_feature.py), please also install the following packages:

pip install npy_append_array soundfile 

Additionally, if you want to dump representations from

Then, you can follow the given examples to dump representations:

# Example 1: dump from HuBERT base layer 9 
# (for data2vec, simply change "model_type" to data2vec and "ckpt_path" to the path of data2vec model)

layer=9

python3 examples/dump_feature.py \
    --model_type hubert \
    --tsv_path /path/to/tsv/file \
    --ckpt_path /path/to/HuBERT/model  \
    --layer ${layer} \
    --feat_dir /dir/to/save/representations


# Example 2: dump from Whisper medium layer 24

layer=24

python3 examples/dump_feature.py \
    --model_type whisper \
    --tsv_path /path/to/tsv/file \
    --whisper_root /directory/to/save/whisper/model \
    --whisper_name medium \
    --layer ${layer} \
    --feat_dir /dir/to/save/representations

Explanations about the args:

  • model_type: choose from data2vec, hubert, and whisper.

  • tsv_path: path of the tsv file. Should have the format of

/dir/to/dataset
path_of_utterance_1 number_of_frames
path_of_utterance_2 number_of_frames

You can follow this script to generate the tsv file.

For example, by running

python wav2vec_manifest.py \
  /dir/to/LibriSpeech/dev-clean \
  --dest /dir/to/manifest \
  --ext flac \
  --valid-percent 0

you can obtain the dev-clean.tsv in /dir/to/manifest for LibriSpeech. (By default, the output file name is train.tsv. Remember to rename the file.)

It should be similar to:

/dir/to/LibriSpeech/dev-clean
2277/149896/2277-149896-0026.flac	78720
2277/149896/2277-149896-0005.flac	89600
2277/149896/2277-149896-0033.flac	45520
  • ckpt_path: must provide for data2vec and HuBERT. You need to download the model from data2vec website or HuBERT website yourself. --ckpt_path is the path of the data2vec/HuBERT model.

  • whisper_root and whisper_name: must provide BOTH --whisper_root and --whisper_name for Whisper. If there is no corresponding model in --whisper_root, the script will download for you.

  • layer: which Transformer encoder layer of the model should the representations be extracted from. It is 1-based. For example, if layer=9, then the outputs from the 9th Transformer encoder layer are dumped. Range: [1, number of Transformer encoder layers]

  • feat_dir: The output representations will be saved to ${feat_dir}/0_1.npy and ${feat_dir}/0_1.len.

For other useful functionalities (e.g., sharding), please check the argument list in examples/dump_feature.py.

Command Line Usage

We expect to have ${feat_dir}/0_1.npy and ${feat_dir}/0_1.len in the provided directory /dir/to/representaitons.

Also, the tsv file should be the same as the one used in Representation Preparation.

repcodec /dir/to/representaitons \
    --model /path/to/repcodec/model \
    --tsv_path /path/to/tsv/file \
    [--use_gpu] \
    [--out_dir /path/to/output]

This command will tokenize the representations and the output discrete tokens will be saved to ${out_dir}/tokens. The tokens are in the same order as the provided tsv file.

An example of the output file:

/dir/to/LibriSpeech/dev-clean
2277/149896/2277-149896-0026.flac	696 696 198 198 198 498 ...
2277/149896/2277-149896-0005.flac	696 696 198 198 198 907 ...
2277/149896/2277-149896-0033.flac	696 696 198 198 198 696 ...

Under examples/tokens, we provide some token files as references. They are obtained from LibriSpeech dev-clean subset using the 6 types of representations and corresponding RepCodec Models. Your results should be very similar to ours.

Python Usage

import torch
import yaml

from repcodec.RepCodec import RepCodec

# for feature types of HubERT base & data2vec base, please use repcodec_dim768.yaml;
# for feature types of HuBERT large & data2vec large & Whisper medium, please use repcodec_dim1024.yaml;
# for feature types of Whisper large-v2, please use repcodec_dim1280.yaml
config = "repcodec/configs/repcodec_dim768.yaml"
with open(config) as fp:
    conf = yaml.load(fp, Loader=yaml.FullLoader)

model = RepCodec(**conf)
model.load_state_dict(torch.load("./hubert_base_l9.pkl", map_location="cpu")["model"]["repcodec"])
model.quantizer.initial()
model.eval()

# input shape: (batch size, hidden dim, sequence length)
random_features = torch.randn(size=(1, 768, 100))
with torch.no_grad():
    x = model.encoder(random_features)
    z = model.projector(x)
    _, idx = model.quantizer.codebook.forward_index(z.transpose(2, 1))
    tokens = idx.cpu().data.numpy().tolist()[0]

Acknowledge

Our implementation is based on facebookresearch/AudioDec. We thank them for open-sourcing their code!

Citation

If you find our work useful, please cite the following article.

@misc{huang2023repcodec,
      title={RepCodec: A Speech Representation Codec for Speech Tokenization}, 
      author={Zhichao Huang and Chutong Meng and Tom Ko},
      year={2023},
      eprint={2309.00169},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

repcodec's People

Contributors

mct10 avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.