Code Monkey home page Code Monkey logo

mtet's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mtet's Issues

Data leakage issue in evaluation?

Hi team @lmthang @thtrieu @heraclex12 @hqphat @KienHuynh

The obtained results of a Transformer-based model on the PhoMT test set surprised me. My first thought was that as VietAI and PhoMT datasets have several overlapping domains (e.g. Wikihow, TED talks, Opensubtitles, news..): whether there might be a potential data leakage issue in your evaluation (e.g. PhoMT English-Vietnamese test pairs appear in the VietAI training set)?

In particular, we find that 6294/19151 PhoMT English-Vietnamese test pairs appear in the VietAI training set (v2). When evaluating your model on the PhoMT test set, did you guys retrain the model on a VietAI training set variant that does not contain PhoMT English-Vietnamese test pairs?

Cheers,
Dat.

Got RuntimeError when run on Google Colab

I ran the Readme.md samples on Google Colab with GPU and got this Error "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper__index_select)".

Error code:
outputs = model.generate(tokenizer(inputs, return_tensors="pt", padding=True).input_ids.to('cuda'), max_length=512)

Lack of t2t_datagen.py

I followed your notebook and got the following error.

python3: can't open file '/content/sat/t2t_datagen.py': [Errno 2] No such file or directory

Where to download v1 data files

Hello,

Since v2 dataset files have not yet been available at the moment, I'm wondering where can I get download the v1 dataset files? It seems that all files within the GCS bucket cannot be downloaded (I get 403 forbidden every time).

Thank you and best regards.

I have a issue about running decoder

Data loss: Unable to open table file /content/drive/MyDrive/SAT/checkpoint: Failed precondition: /content/drive/MyDrive/SAT/checkpoint; Is a directory: perhaps your file is in a different file format and you need to use a different restore operator?

I used a pretrain model : model.augmented.envi.ckpt-1415000.data-00000-of-00001, model.augmented.envi.ckpt-1415000.index, model.augmented.envi.ckpt-1415000.meta. All 3 file are put in checkpoint

Could somebody help me with this issue ?

Question about loading model

I have a question about loading model. I have trained a Russian-to-Vietnamese model base on your code and tensor2tensor. Every time I want to predict a new sentence, it always load the model again, even before that I have already predicted another sentence. I want to ask that if there is a way not to have reload the model when predict a new sentence. Thank you very much

Issue: Difficulty in Extracting mTet V1.0 and V2.0 Datasets

Hi VietAI team,

I want to express my gratitude for publishing the 2nd version of the mTet dataset, V1.0 and V2.0. I followed the provided instructions for downloading the dataset from the shared file. However, I encountered an issue when using the source code provided on GitHub to extract the dataset into the tfrecord format.

Steps Taken:

  • Original command used for generating tfrecords from raw text files:

python t2t_datagen.py
--data_dir=$path_to_folder_contains_vocab_file
--tmp_dir=$path_to_folder_that_contains_training_data
--problem=$problem

Environment Setting:
Screenshot 2023-09-09 at 9 29 16 PM

  • Tensorflow version: 1.15
  • Tensor2tensor version: 1.13.3

Issue Details:
The output of the provided code only downloads the EnviIwslt32k dataset and converts it to a readable format and tfrecord files. However, the two data files from the mTet dataset (highlighted in the attached image) are not being extracted.

Authors have any suggestions or updates on how to successfully extract these datasets?. Thank you so much for your assistance and your valuable contributions.

Demo website is not working

Hi,
seems like the easiest to reach out here but https://demo.vietai.org/ is down, looks like the page tried to serve a 404 error page.

Connection failed with status 404, and response "<!DOCTYPE html> <html lang=en> <meta charset=utf-8> <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width"> <title>Error 404 (Not Found)!!1</title> <style> *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px} </style> <a href=//www.google.com/><span id=logo aria-label=Google></span></a> <p><b>404.</b> <ins>That’s an error.</ins> <p>The requested URL <code>/healthz</code> was not found on this server. <ins>That’s all we know.</ins> ".

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.