Code Monkey home page Code Monkey logo

mms-code-switching's Introduction

ADAPTING THE ADAPTERS FOR CODE-SWITCHING IN MULTILINGUAL ASR

Improving performance of Meta AI's MMS in code-switching.

Atharva Kulkarni, Ajinkya Kulkarni, Miguel Couceiro, Hanan Aldarmaki

ABSTRACT

Recently, large pre-trained multilingual speech models have shown potential in scaling Automatic Speech Recogni- tion (ASR) to many low-resource languages. Some of these models employ language adapters in their formulation, which helps to improve monolingual performance and avoids some of the drawbacks of multi-lingual modeling on resource-rich languages. However, this formulation restricts the usability of these models on code-switched speech, where two lan- guages are mixed together in the same utterance. In this work, we propose ways to effectively fine-tune such mod- els on code-switched speech, by assimilating information from both language adapters at each language adaptation point in the network. We also model code-switching as a sequence of latent binary sequences that can be used to guide the flow of information from each language adapter at the frame level. The proposed approaches are evaluated on three code-switched datasets encompassing Arabic, Mandarin, and Hindi languages paired with English, showing consistent im- provements in code-switching performance with at least 10% absolute reduction in CER across all test sets.

Brief description of our approaches

We modify the Wav2Vec2 transformer blocks used in MMS to use 2 pretrained adapter modules corresponding to the matrix and embedded languages to incorporate information from both. Based on this modification, we propose two code-switching approaches:

image

1) Post Adapter Switching

We add a Post-Adapyter-Code-Switcher network (PACS) inside every transformer block after the 2 adapter modules (see Figure 1a) . Output from the adapter modules is concatenated and fed to PACS which learns to assimilate information from both. The base model and the 2 pretrained adapter modules are kept frozen during the training hence only PACS and the output layer is trainable. PACS follows the same architectures as the adapter modules used in MMS: two feedforward layers with a LayerNorm layer and a linear projection to 16 dimensions with ReLU activation

2) Transformer Code Switching

We use a transformer network with sigmoid as output activation as a Transformer Code Switcher (TCS). It learns to predict a code-switch-sequence O CS using output of the Wav2Vec2 Feature Projection block (Figure 1b). The code-switch-sequence is a latent binary sequence that helps to identify code-switching boundaries at frame level. It regulates the flow of information from two adapters to enable the network to handle code-switched speech by dynamically masking out one of the languages as per the switching equation :

image

We use a threshold value of 0.5 to the output of the sigmoid activation to create binarized latent codes O CS. The base model and adapter are kept frozen, only TCS and the output layers are trained on code-switched data.

Usage

Installation

Clone this repository

git clone https://github.com/Atharva7K/MMS-Code-Switching

NOTE: This repo includes the entire codebase of hugging face transformers. We write our modifications on top of their codebase. Most of our modified code is in this file.

Install dependancies

First we recommend creating a new conda environment especially if you have transformers already installed. We will be installing modified code for the transformers library from this repo which can cause conflicts with your existing installation. Hence create and activate new environment using

conda create -n mms-code-switching python=3.10.2
conda activate mms-code-switching 

Install modified transformers code

cd transformers/
pip install -e .

Install other dependancies

pip install -r requirements.txt

Download model checkpoints:

Model ASCEND (MER / CER) ESCWA (WER / CER) MUCS (WER / CER)
MMS with single language adapter:
English 98.02 / 87.85 92.73 / 71.14 101.72 / 74.02
Matrix-language 71.98 / 66.76 75.98 / 46.38 58.05 / 49.20
Proposed models for fine-tuning:
Matrix-language-FT 45.97 / 44.13 Download 77.47 / 37.69 Download 66.19 / 41.10 Download
Post Adapter Code Switching 44.41 / 40.24 Download 75.50 / 46.69 Download 63.32 / 42.66 Download
Transformer Code Switching 41.07 / 37.89 Download 74.42 / 35.54 Download 57.95 / 38.26 Download

We also provide MMS checkpoints after finetuning matrix-language adapters on the 3 datasets. NOTE: In order to do inference on these finetuned checkpoints, one should use standard implementation of MMS from huggingface instead of our modified transformers code.

Do inference

Use main branch for Transformer Code Switching (TCS) and post-adapter-switching branch for Post Adapter Code Swtiching (PACS).

Check demo.ipynb for inference demo.

Output transcripts

We also share transcripts generated by our proposed systems on the 3 datasets in generated_transcripts/.

mms-code-switching's People

Contributors

atharva7k avatar

Stargazers

QuaSoft avatar Rameez Qureshi avatar Ajinkya Kulkarni avatar

Watchers

 avatar

mms-code-switching's Issues

Wav2Vec2ForCTCWithAdapterSwitching not found

In this project you have mentioned the use of Wav2Vec2ForCTCWithAdapterSwitching class but in the wav2vec2 modelling there is no class or model named Wav2Vec2ForCTCWithAdapterSwitching. Can you help me because i really need that kind of feature.

Cannot Reproduce the result in ASCEND

Hello. @Atharva7K

I am trying to reproduce the results of ASCEND using TCS, but there was no checkpoint for the cmn adapter in the checkpoint directory, so I obtained the checkpoint from the following link. However, the results turned out as follows, but there was a gap compared to your paper.
Where should I obtain the cmn checkpoint for ASCEND?
Thank you.

adapter.cmn-script_simplified.safetensors : https://huggingface.co/facebook/mms-1b-all/tree/main
'mer': 0.4770677266473016, 'cer': 0.435398798330505, 'wer_eval': 0.7841235632183908, 'cer_eval': 0.43627535235476106

System is clashing with small data

First think can you confirm that, will this code work for finetuning? because i tried to finetune on arround small size of data arround 40 instances it is giving out of memory error of cuda. also i want to add the average length of the audios is 10 to 15 sec. also i have reduces the batch size to 4.

Screenshot 2024-01-01 144034

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.