Code Monkey home page Code Monkey logo

mtet's Introduction

MTet: Multi-domain Translation for English-Vietnamese

PWC

PWC

Release

Introduction

We are excited to introduce a new larger and better quality Machine Translation dataset, MTet, which stands for Multi-domain Translation for English and VieTnamese. In our new release, we extend our previous dataset (v1.0) by adding more high-quality English-Vietnamese sentence pairs on various domains. In addition, we also show our new larger Transformer models can achieve state-of-the-art results on multiple test sets.

Get data and model at Google Cloud Storage

Visit our 📄 paper or 📝 blog post for more details.


HuggingFace 🤗

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM


model_name = "VietAI/envit5-translation"
tokenizer = AutoTokenizer.from_pretrained(model_name)  
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model.cuda()

inputs = [
    "vi: VietAI là tổ chức phi lợi nhuận với sứ mệnh ươm mầm tài năng về trí tuệ nhân tạo và xây dựng một cộng đồng các chuyên gia trong lĩnh vực trí tuệ nhân tạo đẳng cấp quốc tế tại Việt Nam.",
    "vi: Theo báo cáo mới nhất của Linkedin về danh sách việc làm triển vọng với mức lương hấp dẫn năm 2020, các chức danh công việc liên quan đến AI như Chuyên gia AI (Artificial Intelligence Specialist), Kỹ sư ML (Machine Learning Engineer) đều xếp thứ hạng cao.",
    "en: Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.",
    "en: We're on a journey to advance and democratize artificial intelligence through open source and open science."
    ]

outputs = model.generate(tokenizer(inputs, return_tensors="pt", padding=True).input_ids.to('cuda'), max_length=512)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

# ['en: VietAI is a non-profit organization with the mission of nurturing artificial intelligence talents and building an international - class community of artificial intelligence experts in Vietnam.',
#  'en: According to the latest LinkedIn report on the 2020 list of attractive and promising jobs, AI - related job titles such as AI Specialist, ML Engineer and ML Engineer all rank high.',
#  'vi: Nhóm chúng tôi khao khát tạo ra những khám phá có ảnh hưởng đến mọi người, và cốt lõi trong cách tiếp cận của chúng tôi là chia sẻ nghiên cứu và công cụ để thúc đẩy sự tiến bộ trong lĩnh vực này.',
#  'vi: Chúng ta đang trên hành trình tiến bộ và dân chủ hoá trí tuệ nhân tạo thông qua mã nguồn mở và khoa học mở.']

Results

image

Citation

@misc{https://doi.org/10.48550/arxiv.2210.05610,
  doi = {10.48550/ARXIV.2210.05610},
  url = {https://arxiv.org/abs/2210.05610},
  author = {Ngo, Chinh and Trinh, Trieu H. and Phan, Long and Tran, Hieu and Dang, Tai and Nguyen, Hieu and Nguyen, Minh and Luong, Minh-Thang},
  keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {MTet: Multi-domain Translation for English and Vietnamese},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}

Using the code

This code is build on top of vietai/dab:

To prepare for training, generate tfrecords from raw text files:

python t2t_datagen.py \
--data_dir=$path_to_folder_contains_vocab_file \
--tmp_dir=$path_to_folder_that_contains_training_data \
--problem=$problem

To train a Transformer model on the generated tfrecords

python t2t_trainer.py \
--data_dir=$path_to_folder_contains_vocab_file_and_tf_records \
--problem=$problem \
--hparams_set=$hparams_set \
--model=transformer \
--output_dir=$path_to_folder_to_save_checkpoints

To run inference on the trained model:

python t2t_decoder.py \
--data_dir=$path_to_folde_contains_vocab_file_and_tf_records \
--problem=$problem \
--hparams_set=$hparams_set \
--model=transformer \
--output_dir=$path_to_folder_contains_checkpoints \
--checkpoint_path=$path_to_checkpoint

In this colab, we demonstrated how to run these three phases in the context of hosting data/model on Google Cloud Storage.


Dataset

Our data contains roughly 4.2 million pairs of texts, ranging across multiple different domains such as medical publications, religious texts, engineering articles, literature, news, and poems. A more detail breakdown of our data is shown in the table below.

v1 v2 (MTet)
Fictional Books 333,189 473,306
Legal Document 1,150,266 1,134,813
Medical Publication 5,861 13,410
Movies Subtitles 250,000 721,174
Software 79,912 79,132
TED Talk 352,652 303,131
Wikipedia 645,326 1,094,248
News 18,449 18,389
Religious texts 124,389 48,927
Educational content 397,008 213,284
No tag 5,517 63,896
Total 3,362,569 4,163,710

Data sources is described in more details here.

Acknowledgment

We would like to thank Google for the support of Cloud credits and TPU quota!

mtet's People

Contributors

heraclex12 avatar justinphan3110 avatar lmthang avatar ntkchinh avatar thtrieu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mtet's Issues

Lack of t2t_datagen.py

I followed your notebook and got the following error.

python3: can't open file '/content/sat/t2t_datagen.py': [Errno 2] No such file or directory

Issue: Difficulty in Extracting mTet V1.0 and V2.0 Datasets

Hi VietAI team,

I want to express my gratitude for publishing the 2nd version of the mTet dataset, V1.0 and V2.0. I followed the provided instructions for downloading the dataset from the shared file. However, I encountered an issue when using the source code provided on GitHub to extract the dataset into the tfrecord format.

Steps Taken:

  • Original command used for generating tfrecords from raw text files:

python t2t_datagen.py
--data_dir=$path_to_folder_contains_vocab_file
--tmp_dir=$path_to_folder_that_contains_training_data
--problem=$problem

Environment Setting:
Screenshot 2023-09-09 at 9 29 16 PM

  • Tensorflow version: 1.15
  • Tensor2tensor version: 1.13.3

Issue Details:
The output of the provided code only downloads the EnviIwslt32k dataset and converts it to a readable format and tfrecord files. However, the two data files from the mTet dataset (highlighted in the attached image) are not being extracted.

Authors have any suggestions or updates on how to successfully extract these datasets?. Thank you so much for your assistance and your valuable contributions.

Demo website is not working

Hi,
seems like the easiest to reach out here but https://demo.vietai.org/ is down, looks like the page tried to serve a 404 error page.

Connection failed with status 404, and response "<!DOCTYPE html> <html lang=en> <meta charset=utf-8> <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width"> <title>Error 404 (Not Found)!!1</title> <style> *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px} </style> <a href=//www.google.com/><span id=logo aria-label=Google></span></a> <p><b>404.</b> <ins>That’s an error.</ins> <p>The requested URL <code>/healthz</code> was not found on this server. <ins>That’s all we know.</ins> ".

Question about loading model

I have a question about loading model. I have trained a Russian-to-Vietnamese model base on your code and tensor2tensor. Every time I want to predict a new sentence, it always load the model again, even before that I have already predicted another sentence. I want to ask that if there is a way not to have reload the model when predict a new sentence. Thank you very much

Where to download v1 data files

Hello,

Since v2 dataset files have not yet been available at the moment, I'm wondering where can I get download the v1 dataset files? It seems that all files within the GCS bucket cannot be downloaded (I get 403 forbidden every time).

Thank you and best regards.

Data leakage issue in evaluation?

Hi team @lmthang @thtrieu @heraclex12 @hqphat @KienHuynh

The obtained results of a Transformer-based model on the PhoMT test set surprised me. My first thought was that as VietAI and PhoMT datasets have several overlapping domains (e.g. Wikihow, TED talks, Opensubtitles, news..): whether there might be a potential data leakage issue in your evaluation (e.g. PhoMT English-Vietnamese test pairs appear in the VietAI training set)?

In particular, we find that 6294/19151 PhoMT English-Vietnamese test pairs appear in the VietAI training set (v2). When evaluating your model on the PhoMT test set, did you guys retrain the model on a VietAI training set variant that does not contain PhoMT English-Vietnamese test pairs?

Cheers,
Dat.

Got RuntimeError when run on Google Colab

I ran the Readme.md samples on Google Colab with GPU and got this Error "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper__index_select)".

Error code:
outputs = model.generate(tokenizer(inputs, return_tensors="pt", padding=True).input_ids.to('cuda'), max_length=512)

I have a issue about running decoder

Data loss: Unable to open table file /content/drive/MyDrive/SAT/checkpoint: Failed precondition: /content/drive/MyDrive/SAT/checkpoint; Is a directory: perhaps your file is in a different file format and you need to use a different restore operator?

I used a pretrain model : model.augmented.envi.ckpt-1415000.data-00000-of-00001, model.augmented.envi.ckpt-1415000.index, model.augmented.envi.ckpt-1415000.meta. All 3 file are put in checkpoint

Could somebody help me with this issue ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.