IndicWav2Vec2

All the data preparation scripts are present in data_prep_scripts dir.

The sequence of data preperation pipeline involve

Downloading the data
Pass the data through VAD pipeline (using vad.py)
Pass the obtained data through SNR Filtering pipeline (using snr_filter.py)
Finally chunk the data (using chunking.py) All the 4 steps are automatically handled by process_data.sh

Click here for more extended documentation on how to execute these individual steps.

Manifest Creation

For creating language-wise pretraining manifest

$ python path/to/lang_wise_manifest_creation.py /path/to/wave/files --dest /manifest/path --ext $ext --valid-percent $valid

For /path/to/wav/files/ we expect the directory to have one folder per language under the parent directory

In our pretraing, we use a --valid-percent as 0.03

For creating a combined validation file for all languages, we concatenate all individual *_valid.tsv files to create a valid.tsv file.

import pandas as pd
import glob

filenames = glob.glob("*_valid.tsv")

combined = []
for f in filename:
    df = pd.read_csv(f, skiprows=1, names=['f', 'd'], sep='\t')
    combined.append(df)

df_combined = pd.concat(combined, axis=0, ignore_index=True)
df_combined.to_csv('valid.tsv', index=True, header=False, sep='\t')

We then add the /path/to/wav/files/ to the first line of the valid.tsv file

Pretraining

For pretraining the model we do multi-node training and schedule the runs with slurm.

Following is the invocation script for training IndicWav2Vec base starting from Wav2Vec2.0 English base ckeckpoint

$ sbatch --job-name <NAME> --gres gpu:<N_GPU_PER_NODE> --cpus-per-task <N_CPUS> \
    --nodes <N_NODES> --ntasks-per-node <N_TASKS> \
    --wrap "srun --output train.log.node%t --error train.stderr.node%t.%j \
        $(which fairseq-hydra-train) \
        task.data=/path/to/manifest/directory \
        common.wandb_project=<wandb project name> \
        task._name=temp_sampled_audio_pretraining \
        +task.sampling_alpha=0.7 \
        common.log_interval=200 \
        common.log_format=tqdm \
        dataset.max_tokens=3000000 \
        common.user_dir=/path/to/custom_task/directory \
        checkpoint.save_dir=/path/to/save/model/checkpoints \
        checkpoint.restore_file=/path/to wav2vec2-english-base/checkpoint.pt \
        +optimization.update_freq='[2]' \
        optimization.clip_norm=0.5 \
        checkpoint.reset_optimizer=true \
        distributed_training.distributed_world_size=<total GPUs> \
        distributed_training.distributed_port=$PORT \
        --config-dir /path/to/configs/directory \
        --config-name wav2vec2_base_librispeech"

For Large model we override the above configuration with

checkpoint.restore_file=/path/to wav2vec2-english-large/checkpoint.pt \
+optimization.update_freq='[6]' \
lr_scheduler.warmup_updates=0 \
--config-name wav2vec2_large_librivox"

Configs for both the models are provided in the configs directory

Fine-tuning process

Dataprocess and Manifest creation

The scripts for the same can be found here link

Fine-tune

Following is the invocation script for finetuning IndicWav2Vec large on a particular language

sbatch --job-name <NAME> --gres gpu:<N_GPU_PER_NODE> --cpus-per-task <N_CPUS> \
    --nodes <N_NODES> --ntasks-per-node <N_TASKS> \
    --wrap "srun --output finetune.log.node%t --error finetune.stderr.node%t.%j \
        $(which fairseq-hydra-train) \
        task.data=/path/to/finetune/manifest/directory/for/a/particular/language \
        common.wandb_project=<wandb project name> \
        model.w2v_path=/path/to/pretrained/model_large.pt \
        common.log_interval=50 \
        common.log_format=tqdm \
        dataset.max_tokens=1000000 \
        checkpoint.save_dir=/path/to/save/model/fine_tune_checkpoints \
        +optimization.update_freq='[1]' \
        distributed_training.distributed_world_size=<total GPUs> \
        --config-dir /path/to/configs/directory \
        --config-name ai4b_xlsr"

For IndicWav2Vec Base model we override the above configuration with

model.w2v_path=/path/to/pretrained/model_base.pt \
--config-name ai4b_base"

Configs for both the models are provided in the finetune_configs directory

Training Language Model

Scripts for installing, preparing data and training language model is present in lm_training folder.

Inference/Evaluation

Evaluation Scripts with complete documentation are present in w2v_inference folder.

Link to arXiv

The link to arXiv can be found here

Cite

Please cite as:

@inproceedings{javed2021building,
    title = {Towards Building ASR Systems for the Next Billion Users},
    author = {Tahir Javed and Sumanth Doddapaneni and Abhigyan Raman and Kaushal Santosh Bhogale and Gowtham Ramesh and Anoop Kunchukuttan and Pratyush Kumar and Mitesh M. Khapra},
    booktitle = "Proceedings of the AAAI Conference on Artificial Intelligence",
    year = "2022 (to appear)",
}

License

IndicWav2Vec is MIT-licensed. The license applies to all the pretrained, fine-tuned and language models

techthiyanes / indicwav2vec Goto Github PK

indicwav2vec's Introduction

IndicWav2Vec2

All the data preparation scripts are present in data_prep_scripts dir.

Manifest Creation

Pretraining

Fine-tuning process

Dataprocess and Manifest creation

Fine-tune

Training Language Model

Inference/Evaluation

Link to arXiv

Cite

License

indicwav2vec's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent