Code Monkey home page Code Monkey logo

zero-shot-st's Introduction

DCMA: Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation

The Code for the EMNLP 2022 main conference paper Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation, which aims to train an end-to-end speech translation model in a zero-shot fashion (only ASR and MT data are available).

Training a Model on MuST-C

Training an En-De model as an example.

Installation

The following environments are required.

  • Python == 3.7
  • torch == 1.8, torchaudio == 0.8.0, cuda==10.1
  • python library
pip install pandas sentencepiece editdistance PyYAML tqdm soudfile
  • fairseq == 1.0.0a0+741fd13

Dowanload the corresponding fairseq and install it.

cd fairseq
pip install --editable ./
cd ..

NOTE: fairseq == 1.0.0a0 is not a stable release. Our codes are not compatible with the current fairseq version. Please install the corresponding version we provided, or you will need to modify the model codes.

Data Preparation

  1. set configuration

Please set the global variables of WMT_DATA_ROOT, SPEECH_DATA_ROOT and SAVE_ROOT. These will be where to put the WMT datasets, MUST-C dataset and checkpoints, respectively. Set the global variables target to specify the translation direction. For example:

export WMT_DATA_ROOT=~/WMT
export SPEECH_DATA_ROOT=~/MUSTC
export SAVE_ROOT=~/checkpoints
export target=de
mkdir -p $MUSTC_ROOT $WMT_ROOT $SAVE_ROOT
  1. Download and uncompress the En-De MuST-C dataset to $SPEECH_DATA_ROOT/en-$target.

  2. Download the WMT to $WMT_ROOT/orig via:

bash egs/prepare_data/download-wmt.sh --wmt14 --data-dir $WMT_DATA_ROOT --target $target
  1. Prepare the MUST-C datasets and produce a joint spm dictionary:
bash egs/prepare_data/prepare-mustc-en2any.sh \
    --speech-data-root $SPEECH_DATA_ROOT --subword unigram --subword-tokens 10000

After this step, the directory $SPEECH_DATA_ROOT should look like:

├── en-de
│   ├── config_wave.yaml
│   ├── data
│   ├── para_text
│   ├── spm_unigram10000_wave_joint.model
│   ├── spm_unigram10000_wave_joint.txt
│   ├── spm_unigram10000_wave_joint.vocab
│   ├── train_wave_triple.tsv
│   ├── train_wave_en_asr.tsv
│   ├── dev_wave_triple.tsv
│   ├── dev_wave_en_asr.tsv
│   ├── tst-COMMON_wave_triple.tsv
│   ├── tst-COMMON_wave_en_asr.tsv
│   ├── tst-HE_wave_triple.tsv
│   ├── tst-HE_wave_triple.tsv
└── MUSTC_v1.0_en-de.tar.gz

Each .tsv file is formed as:

id	audio	n_frames	src_text	tgt_text	speaker

tgt_text contains source transcriptions in XXX_en_asr.tsv, and contains target translations in XXX_wave_triple.tsv.

​ 4. Prepare the WMT datasets:

bash egs/prepare_data/prepare-wmt-en2any.sh

This step will produce two folders mt_data and mt_data_expand in $SPEECH_DATA_ROOT/en-$target. The former only contains parallel text from WMT datasets, and the latter contains WMT and in-domain MUST-C text data.

Training

  1. Pre-training on MT data:
export CUDA_VISIBLE_DEVICES=0,1
bash egs/scripts/train-en2any-MT.sh --save-root $SAVE_ROOT

Our experiment is carried out on 2 V100 GPUs. If you want to use more or less GPUs, please modify the update-freq in the training scripts. Also, if you want to leverage the in-domain parallel text data from MUST-C, just add argument --expand like this:

bash egs/scripts/train-en2any-MT.sh --save-root $SAVE_ROOT --expand
  1. Zero-shot fine-tuning:
bash egs/scripts/train-en2any-zero-shot-ST.sh

Note: This step will use the same MT data as step 5 automatically.

Evaluation

  1. Averaging Checkpoints and Evaluate It

We average the last 5 checkpoints and evaluate it.

bash egs/scripts/eval-en2any-ST.sh

License

The codes are dependent on fairseq code base, therefore carrying the MIT License of the original codes.

Contact

If you have any questions, please feel free to contact us by sending an email to [email protected]

zero-shot-st's People

Contributors

cwang621 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.