Code Monkey home page Code Monkey logo

translation-for-code-switching-acl's Introduction

From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text

full paper - https://aclanthology.org/2021.acl-long.245.pdf poster - TODO

Introduction

Data

All-CS

The dataset we collected is All-CS.json

The data for PRETRAIN, OpSub-LEX and OpSub-EMT is not uploaded on github because of size. Please contact Ishan Tarunesh ([email protected]) and Syamantak Kumar ([email protected]) to get the data.

Pre-Processing

Run preprocess_data.sh to download the required tools for preprocessing.

Moses found in: /Users/ishan/Desktop/UnsupervisedMT/NMT/tools/mosesdecoder
fastBPE found in: /Users/ishan/Desktop/UnsupervisedMT/NMT/tools/fastBPE
fastBPE compiled in: /Users/ishan/Desktop/UnsupervisedMT/NMT/tools/fastBPE/fastBPE/fast
fastText found in: /Users/ishan/Desktop/UnsupervisedMT/NMT/tools/fastText
fastText compiled in: /Users/ishan/Desktop/UnsupervisedMT/NMT/tools/fastText/fasttext
Extracting vocabulary...

Possible errors you might get on running preprocess_data.sh

Traceback (most recent call last):
  File "/Users/ishan/Desktop/UnsupervisedMT/NMT/preprocess.py", line 28, in <module>
    dico = Dictionary.read_vocab(voc_path)
  File "/Users/ishan/Desktop/UnsupervisedMT/NMT/src/data/dictionary.py", line 111, in read_vocab
    assert line[0] not in word2id and line[1].isdigit(), (i, line)
AssertionError: (4, ['<unk>', '1322901'])

=> In that case you must be running the original src/data/dictionary.py file. We slightly modified the file on line 110-112 to accomodate for token. Use the src/data/dictionary.py file from this repository.

Traceback (most recent call last):
  File "/Users/ishan/Desktop/UnsupervisedMT/NMT/preprocess.py", line 31, in <module>
    data = Dictionary.index_data(txt_path, bin_path, dico)
  File "/Users/ishan/Desktop/UnsupervisedMT/NMT/src/data/dictionary.py", line 132, in index_data
    assert dico == data['dico']
AssertionError

=> This means the vocab dictionary in your .pth files don't match vocab.all file. The .pth file must be outdated. Delete the .pth files and re-run preprocess_data.sh => rm -r data//.pth

Running TCS Model

Other Experiments

translation-for-code-switching-acl's People

Contributors

ishan00 avatar

Stargazers

 avatar  avatar

Watchers

 avatar

translation-for-code-switching-acl's Issues

AllCS Dataset Download

Hi, I downloaded the AllCS Dataset as given here, but noticed that it only contains around 14.6k samples between the train, valid and test splits. In your paper you've mentioned there being 21.4k samples in total which is significantly greater. Is there anywhere I can download the full dataset?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.