Code Monkey home page Code Monkey logo

thonburian-whisper's Introduction

๐Ÿค– Model | ๐Ÿ“” Jupyter Notebook | ๐Ÿค— Huggingface Space Demo | ๐Ÿ“ƒ Medium Blog (Thai)

Thonburian Whisper is an Automatic Speech Recognition (ASR) model for Thai, fine-tuned using Whisper model originally from OpenAI. The model is released as a part of Huggingface's Whisper fine-tuning event (December 2022). We fine-tuned Whisper models for Thai using Commonvoice 13, Gowajee corpus, Thai Elderly Speech, Thai Dialect datasets. Our models demonstrate robustness under environmental noise and fine-tuned abilities to domain-specific audio such as financial and medical domains. We release models and distilled models on Huggingface model hubs (see below).

Usage

Open in Colab

Use the model with Huggingface's transformers as follows:

import torch
from transformers import pipeline

MODEL_NAME = "biodatlab/whisper-th-medium-combined"  # see alternative model names below
lang = "th"

device = 0 if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    chunk_length_s=30,
    device=device,
)

# Perform ASR with the created pipe.
pipe("audio.mp3", generate_kwargs={"language":"<|th|>", "task":"transcribe"}, batch_size=16)["text"]

Requirements

Use pip to install the requirements as follows:

!pip install git+https://github.com/huggingface/transformers
!pip install librosa
!sudo apt install ffmpeg

Model checkpoint and performance

We measure word error rate (WER) of the model with deepcut tokenizer after normalizing special tokens (โ– to _ and โ€” to -) and simple text-postprocessing (เน€เน€ to เน and เนเธฒ to เธณ).

Model WER (Commonvoice 13) Model URL
Thonburian Whisper (small) 13.14 Link
Thonburian Whisper (medium) 7.42 Link
Thonburian Whisper (large-v2) 7.69 Link
Thonburian Whisper (large-v3) 6.59 Link

Thonburian Whisper is fine-tuned with a combined dataset of Thai speech including common voice, google fleurs, and curated datasets. The common voice test splitting is based on original splitting from datasets.

Inference time

We have performed benchmark average inference speed on 1 minute audio with different model sizes (small, medium, and large) on NVIDIA A100 with 32 fp, batch size of 32. The medium model presents a balanced trade-off between WER and computational costs.

Certainly! Here's the modified table with the model URL separated into a new column:

Model Memory usage (Mb) Inference time (sec / 1 min) Number of Parameters Model URL
Thonburian Whisper (small) 7,194 4.83 242M Link
Thonburian Whisper (medium) 10,878 7.11 764M Link
Thonburian Whisper (large) 18,246 9.61 1540M Link
Distilled Thonburian Whisper (small) 4,944 TBA 166M Link
Distilled Thonburian Whisper (medium) 7,084 TBA 428M Link

This new table structure separates the model URL into its own column at the end, making it clearer and easier to read. The links are preserved and will still function as clickable URLs in the markdown format.

Long-form Inference

Thonburian Whisper can be used for long-form audio transcription by combining VAD, Thai-word tokenizer, and chunking for word-level alignment. We found that this is more robust and produce less insertion error rate (IER) comparing to using Whisper with timestamp. See README.md in longform_transcription folder for detail usage.

Developers

Citation

If you use the model, you can cite it with the following bibtex.

@misc {thonburian_whisper_med,
    author       = { Zaw Htet Aung, Thanachot Thavornmongkol, Atirut Boribalburephan, Vittavas Tangsriworakan, Knot Pipatsrisawat, Titipat Achakulvisut },
    title        = { Thonburian Whisper: A fine-tuned Whisper model for Thai automatic speech recognition },
    year         = 2022,
    url          = { https://huggingface.co/biodatlab/whisper-th-medium-combined },
    doi          = { 10.57967/hf/0226 },
    publisher    = { Hugging Face }
}

thonburian-whisper's People

Contributors

titipata avatar nameatirut avatar z-zawhtet-a avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.