WhisperX

Made by Max Bain • 🌐 https://www.maxbain.com

Whisper-Based Automatic Speech Recognition (ASR) with improved timestamp accuracy using forced alignment.

What is it 🔎

This repository refines the timestamps of openAI's Whisper model via forced aligment with phoneme-based ASR models (e.g. wav2vec2.0), multilingual use-case.

Whisper is an ASR model developed by OpenAI, trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds.

Phoneme-Based ASR A suite of models finetuned to recognise the smallest unit of speech distinguishing one word from another, e.g. the element p in "tap". A popular example model is wav2vec2.0.

Forced Alignment refers to the process by which orthographic transcriptions are aligned to audio recordings to automatically generate phone level segmentation.

Setup ⚙️

Install this package using

pip install git+https://github.com/m-bain/whisperx.git

You may also need to install ffmpeg, rust etc. Follow openAI instructions here https://github.com/openai/whisper#setup.

Example usage💬

English

Run whisper on example segment (using default params)

whisperx examples/sample01.wav

For increased timestamp accuracy, at the cost of higher gpu mem, use a bigger alignment model e.g.

whisperx examples/sample01.wav --model medium.en --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --output_dir examples/whisperx

Result using WhisperX with forced alignment to wav2vec2.0 large:

sample01.mp4

Compare this to original whisper out the box, where many transcriptions are out of sync:

sample_whisper_og.mov

Other Languages

For non-english ASR, it is best to use the large whisper model.

French

whisperx examples/sample_fr_01.wav --model large --language fr --align_model VOXPOPULI_ASR_BASE_10K_FR --output_dir examples/whisperx

sample_fr_01_vis.mov

German

whisperx examples/sample_de_01.wav --model large --language de --align_model VOXPOPULI_ASR_BASE_10K_DE --output_dir examples/whisperx

sample_de_01_vis.mov

Italian

whisperx examples/sample_it_01.wav --model large --language it --align_model VOXPOPULI_ASR_BASE_10K_IT --output_dir examples/whisperx

sample_it_01_vis.mov

Japanese

whisperx --model large --language ja examples/sample_ja_01.wav  --align_model jonatasgrosman/wav2vec2-large-xlsr-53-japanese --output_dir examples/whisperx --align_extend 2

sample_ja_01_vis.mov

Limitations ⚠️

Not thoroughly tested, especially for non-english, results may vary -- please post issue to let me know the results on your data
Whisper normalises spoken numbers e.g. "fifty seven" to arabic numerals "57". Need to perform this normalization after alignment, so the phonemes can be aligned. Currently just ignores numbers.
Assumes the initial whisper timestamps are accurate to some degree (within margin of 2 seconds, adjust if needed -- bigger margins more prone to alignment errors)
Hacked this up quite quickly, there might be some errors, please raise an issue if you encounter any.

Coming Soon 🗓

[x] Multilingual init

[x] Subtitle .ass output

[x] Automatic align model selection based on language detection

[ ] Incorporating word-level speaker diarization

[ ] Inference speedup with batch processing

Contact

Contact maxbain[at]robots[dot]ox[dot]ac[dot]uk if using this commerically.

Acknowledgements 🙏

Of course, this is mostly just a modification to openAI's whisper. As well as accreditation to this PyTorch tutorial on forced alignment

Citation

If you use this in your research, just cite the repo,

@misc{bain2022whisperx,
  author = {Bain, Max},
  title = {WhisperX},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/m-bain/whisperX}},
}

as well as the whisper paper,

@article{radford2022robust,
  title={Robust speech recognition via large-scale weak supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}

and any alignment model used, e.g. wav2vec2.0.

@article{baevski2020wav2vec,
  title={wav2vec 2.0: A framework for self-supervised learning of speech representations},
  author={Baevski, Alexei and Zhou, Yuhao and Mohamed, Abdelrahman and Auli, Michael},
  journal={Advances in Neural Information Processing Systems},
  volume={33},
  pages={12449--12460},
  year={2020}
}

mbrukman / whisperx Goto Github PK

whisperx's Introduction