CLiCoTEA: Cross-Lingual Contextualised Token Embedding Alignment

This code reproduces the results from ACL 2023 paper "Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages".

Installation

These dependencies must be installed:

Hatch: for managing the Python package
gdown: for downloading the datasets

pip install hatch gdown

Prepare datasets

Download all datasets for training the Cross-Lingual Contextualised Token Embedding Alignment and the Zero-Shot Cross-Lingual transfer to downstream tasks:

bash scripts/datasets/download_datasets.sh data

The archive contains the original files from Flickr30k, SNLI and NLVR2 which are all in English. It also includes the translated files for each language required in the downstream tasks.

Note that the translation of train/dev sets of Flickr30k, SNLI and NVLR2 datasets has be done using Googletrans package, running the following commands:

bash scripts/datasets/prepare_flickr30k.sh
bash scripts/datasets/prepare_snli.sh
bash scripts/datasets/prepare_nlvr2.sh

Compute token alignment with `awesome-align` model

bash scripts/alignment/token_alignment_flickr30k.sh
bash scripts/alignment/token_alignment_snli.sh
bash scripts/alignment/token_alignment_nlvr2.sh

This should create aligned word pairs in data folder for each dataset as follows:

data/
   flickr30k/
      word_pairs_dev_en-de.json
      word_pairs_dev_en-es.json
      word_pairs_dev_en-id.json
      word_pairs_dev_en-ru.json
      word_pairs_dev_en-tr.json
      word_pairs_train_en-de.json
      word_pairs_train_en-es.json
      word_pairs_train_en-id.json
      word_pairs_train_en-ru.json
      word_pairs_train_en-tr.json
   nlvr2/
      word_pairs_dev_en-id.json
      word_pairs_dev_en-sw.json
      word_pairs_dev_en-ta.json
      word_pairs_dev_en-tr.json
      word_pairs_dev_en-zh-cn.json
      word_pairs_train_en-id.json
      word_pairs_train_en-sw.json
      word_pairs_train_en-ta.json
      word_pairs_train_en-tr.json
      word_pairs_train_en-zh-cn.json
   snli/
      word_pairs_dev_en-ar.json
      word_pairs_dev_en-es.json
      word_pairs_dev_en-fr.json
      word_pairs_dev_en-ru.json
      word_pairs_train_en-ar.json
      word_pairs_train_en-es.json
      word_pairs_train_en-fr.json
      word_pairs_train_en-ru.json

Train CLiCoTEA

Train CLiCoTEA by running the following commands (default options should be modified from the bash script):

# train CLiCoTEA for image/text retrieval on flickr30k in German
bash scripts/embeddings/train_clicotea.sh flickr30k albef_retrieval flickr de

# train CLiCoTEA for visual reasoning on NLVR2 in Swahili
bash scripts/embeddings/train_clicotea.sh nlvr2 albef_nlvr nlvr sw

# train CLiCoTEA for visual entailment on SNLI in French
1bash scripts/embeddings/train_clicotea.sh snli albef_classification ve fr

Note that we start from pre-trained ALBEF models which are available in LAVIS package.

Zero-shot transfer to unseen languages

Download the images of the downstream tasks from the official website:

Text data can be downloaded from the IGLUE Benchmark with:

bash scripts/zero-shot/download_datasets.sh

Run zero-shot evaluation:

DATA_DIR="<path to folder containing test files>"
LANG="<language to test>"
FLICKR30K_IMAGE_ROOT="<place path to image folder>"
COCO_IMAGE_ROOT="<place path to image folder>"
MARVL_IMAGE_ROOT="<place path to image folder>"
PATH_TO_CHECKPOINT="<place path to model checkpoint>"

Retrieval task on xFlickrCO

bash scripts/zero-shot/zeroshot_retrieval.sh $DATA_DIR $LANG $FLICKR30K_IMAGE_ROOT $COCO_IMAGE_ROOT $PATH_TO_CHECKPOINT

Visual entailment task on XVNLI

bash scripts/zero-shot/zeroshot_ve.sh $DATA_DIR $LANG $FLICKR30K_IMAGE_ROOT $PATH_TO_CHECKPOINT

Visual reasoning task on MaRVL

bash scripts/zero-shot/zeroshot_vr.sh $DATA_DIR $LANG $MARVL_IMAGE_ROOT $PATH_TO_CHECKPOINT

Running tests

Running all tests:

hatch run test:run

Or running a specific test:

hatch run test:run -k test_get_token_pairs

Citation

Please cite as:

@inproceedings{clicotea,
   title = "Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages",
    author = "Karoui, Yasmine  and
      Lebret, R{\'e}mi  and
      Foroutan Eghlidi, Negar  and
      Aberer, Karl",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-short.32",
    pages = "366--375",
}

yasminekaroui / clicotea Goto Github PK

clicotea's Introduction

CLiCoTEA: Cross-Lingual Contextualised Token Embedding Alignment

Installation

Prepare datasets

Compute token alignment with `awesome-align` model

Train CLiCoTEA

Zero-shot transfer to unseen languages

Running tests

Citation

clicotea's People

Stargazers

Watchers

Forkers

clicotea's Issues

Averaging sub-token embeddings for token alignment

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

yasminekaroui / clicotea Goto Github PK

clicotea's Introduction

CLiCoTEA: Cross-Lingual Contextualised Token Embedding Alignment

Installation

Prepare datasets

Compute token alignment with awesome-align model

Train CLiCoTEA

Zero-shot transfer to unseen languages

Running tests

Citation

clicotea's People

Stargazers

Watchers

Forkers

clicotea's Issues

Recommend Projects

Recommend Topics

Recommend Org

Compute token alignment with `awesome-align` model