GLUECoS: An Evaluation Benchmark for Code-Switched NLP

NEW (Oct - 2020): Please check our updated policy about making submissions for evaluation here

NEW (Sep - 2020): NLI dataset preprocess script updated to fix repetitions in data. If you have downloaded the datasets before, please check this section

NEW (Aug - 2020): Evaluation is now automated and results are presented instantly. Please check this section

This is the repo for the ACL 2020 paper GLUECoS: An Evaluation Benchmark for Code-Switched NLP

GLUECoS is a benchmark comprising of multiple code-mixed tasks across 2 language pairs (En-Es and En-Hi)

Recording of talk given at ACL: Link

Below are instructions for obtaining the datasets that comprise the benchmark and training transformer based models on this data. Both steps can be run on separate systems and the instructions are structured in such a way. All the user has to do is to copy over the Data/Processed_Data folder over to perform training

Obtaining Datasets

Follow the following instructions to download and process the datasets. All the steps should work in a brand new conda environment with python==3.6.10 or a docker container with the python:3.6 image. Please note that the splits for some of the datasets are different from their original releases.

Install the requirements for the preprocessing scripts
```
pip install -r requirements.txt
```
Create a twitter developer account and fill in the 4 keys, one per line, in twitter_authentication.txt. The file should look like this
```
consumer_key
secret_key
access_token
access_secret_token
```
Obtain a key for Microsoft Translator. This is needed as the preprocessing steps involve conversion of Romanized datasets into Devanagari. Instructions for obtaining this key can be found here. While creating the translator instance, please set the region to global. The number of queries made fall within the free tier. This key will be referred to as SUBSCRIPTION_KEY in the next step
To download the data, run the command below. This will download the original datasets, perform all the preprocessing needed and bring them into a format that the training scripts can use
```
./download_data.sh SUBSCRIPTION_KEY
```
The dowloaded and processed data is stored in Data/Processed_Data.

Some of the datasets did not have predefined splits, so the splits used for those can be found in Data/Original_Data.

Please note that the labels for the test sets are not the gold labels. They have been assigned a separate token to maintain fairness in the benchmarking.

This will not download/preprocess the QA dataset. For that, please check the next step
The original QA dataset (Chandu et. al, 2018) contains contexts only for some examples. To obtain contexts for the rest, DrQA is used to obtain contexts from a Wikipedia dump. To run this, you will need atleast 20GB of disk storage (to store the wikidump) and 16GB+ of RAM (to run DrQA). DrQA uses PyTorch, so having a GPU will help speed it up (although it isn't necessary).

First, install a suitable version of PyTorch for your system. In most cases, a pip install torch should do

To download and process the QA dataset, run the following command
```
bash Data/Preprocess_Scripts/preprocess_qa.sh
```

NLI Preprocess Script Update

The data downloading and preprocessing scripts were updated in Sep - 2020 to fix an issue with the creation of the NLI train and test sets. Running the scripts as is will download all the datasets, so you do not have to make any changes if you're doing it for the first time. If you downloaded the datasets before this fix was added, you can follow these steps to get the updated NLI data alone.

Make sure you have the latest version of the repo
Comment out lines 390-397 and 399-401 of download_data.sh
Run the updated download_data.sh to create the new NLI dataset alone

Training models on the data

The code contains 4 different evaluation scripts

One script for token level tasks:
- LID (en_es/en_hi)
- NER (en_es/en_hi),
- POS (en_es/en_hi_fg/en_hi_ud)
One script for the sentence level tasks:
- Sentiment (en_es/en_hi)
One script for the QA task
- QA (en_hi)
One script for the NLI task
- NLI (en_hi)

You can train the models on your system or via Azure Machine Learning. To know more about the latter, please refer to this README.

Install the training requirements

Note: The requirements for dataset preprocessing and training have been separately mentioned, as you may run them on different systems

Install a suitable version of pytorch for your system, pip install torch should work in most cases
The requirements from the file in Code/requirements.txt
```
pip install -r Code/requirements.txt
```

Training

Run the below command to fine-tune your model on any of the task. The training scripts uses the Huggingface library and support any models based on BERT, XLM, XLM-Roberta and similar models.

bash train.sh MODEL MODEL_TYPE TASK

Example Usage :

bash train.sh bert-base-multilingual-cased bert POS_EN_HI_FG

You can also run fine-tuning for all tasks with the following command :

bash train.sh bert-base-multilingual-cased bert ALL

Submitting Predictions for Evaluation

Submission is done by uploading the results to a fork of this repo and making a pull request to the main repo. The evaluation is done automatically by a set of actions that run for the PR.

The training scripts supplied write predictions for the test set into the Results folder.

Zip this folder into results.zip with zip results.zip -r Results.
Create a fork of microsoft/GLUECoS on Github.
Add this results.zip file to the root directory of your fork and make a pull request to the main repo.

A set of actions will run for your pull request. Clicking on "Show all checks" will reveal that one of these is named "Eval script". Clicking on "Details" will take you to the sequence of steps run for the action. Expanding the "Run Eval" stage will show you the results of the eval script.

If you would like to make another submission, you can update the same PR with the new results.zip file and the action will run again. You DO NOT need to open a new PR each time. Please wait till the current action finishes running before updating the PR with the new submission.

Please ensure that this is the exact structure of the zip file. The eval script will fail if there are any differences in the names or the structure

results.zip
    └── Results
        ├── NLI_EN_HI
        │   └── test_predictions.txt
        ├── QA_EN_HI
        │   └── predictions.json
        .
        .
        .
        └── Sentiment_EN_HI
            └── test_predictions.txt

You can make as many submissions as you want. Beyond the 5th submission, your best score will be added to the leaderboard. We will use your Github username for the leaderboard. Instead, if you would like your group's name/affilication to appear on the leaderboard, please mention this along with details about the model in the pull request.

Citation

Please use the following citation if you use this benchmark:

@inproceedings{khanuja-etal-2020-gluecos,
    title = "{GLUEC}o{S}: An Evaluation Benchmark for Code-Switched {NLP}",
    author = "Khanuja, Simran  and
      Dandapat, Sandipan  and
      Srinivasan, Anirudh  and
      Sitaram, Sunayana  and
      Choudhury, Monojit",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.329",
    pages = "3575--3585"
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

ishan00 / gluecos Goto Github PK

gluecos's Introduction

GLUECoS: An Evaluation Benchmark for Code-Switched NLP

Obtaining Datasets

NLI Preprocess Script Update

Training models on the data

Install the training requirements

Training

Submitting Predictions for Evaluation

Citation

Contributing

gluecos's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent