soumendrak / mtenglish2odia Goto Github PK

View Code? Open in Web Editor NEW

9.0 7.0 7.0 64.07 MB

Machine Translation from English to Odia language.

Home Page: https://mte2o.com

License: GNU General Public License v3.0

Python 0.43% Jupyter Notebook 99.57% HTML 0.01%

machine-translation odia-language odia indic-languages parallel-corpus python3

mtenglish2odia's Introduction

Project moved

This project has been moved under OdiaNLP GitHub organization. For more details please visit: https://odianlp.github.io/

MTEnglish2Odia

Approx. Number of En-Or parallel reviewed pairs: 42,000

Machine Translation from English to Odia language

This is a healthy start to building Automated machine translation for English to Odia language. This has been built mainly to help increase quality odia wikipedia articles by translating for English Wikipedia. The approach is to start building a parallel corpus between English and Odia language which can later be used in SMT(Statistical Machine Translation) or NMT (Neural Machine Translation) in future by interested people.

Around 9000 English-Odia un-reviewed raw parallel pairs dump available in this file as pipe separated phrases or sentences.

For more details visit the website of this repository : MTEnglish2Odia

How can I contribute to this repository?

Click here to read a general guide on how to contribute to a Github open source project for beginners.

What can I contribute?

You can send English-Odia word/phrase/sentence pairs on the below format in a new file, under your name and types of data.
Please put the file under Individual_files
For e.g. if your name is Satyabrata, you want to upload generic phrases:

Key	Example
Filename	`satyabrata.txt`
File upload path	`data/Individual_files/satyabrata.txt`
File text format	`Why are you so lazy?

Please make sure you have correct permissions to upload this data in GPL license.

Tutorial on how to fork a repository and send a PR can be found in this video or this video or this Github doc tutorial for fork and this one for pull request
Your Pull Request will be reviewed first.
Please follow up if any comments or modifications are needed on your Pull Request.
In case of any confusion please contact on [email protected]. You will get a response within a day or two.

License

GPL v3.0

ଇଂରାଜୀରୁ ଓଡ଼ିଆ ଭାଷାକୁ ମେସିନ ଟ୍ରାନ୍ସଲେସନଦ୍ୱାରା ଅନୁବାଦ କରିବାକୁ ଏହି ପ୍ରକଳ୍ପଟି ତିଆରି ହୋଇଛି । ଏହା ମୁଖ୍ୟତଃ ଓଡ଼ିଆ ଉଇକିପିଡ଼ିଆରେ ଗୁଣାତ୍ମକ ପୃଷ୍ଠାଗୁଡ଼ିକର ସଂଖ୍ୟା ବୃଦ୍ଧି କରିବାକୁ ଗଠନ କରାଯାଇଛି । ବର୍ତ୍ତମାନର ଯୋଜନା ହିସାବରେ ପ୍ରଥମେ ଇଂରାଜୀ-ଓଡ଼ିଆ ଅନୁବାଦର ପାରାଲେଲ ତଥ୍ୟ ସଂଗ୍ରହ ହେବ । ଯଥେଷ୍ଟ ପରିମାଣର ତଥ୍ୟ ସଂଗ୍ରହ ପରେ ଏହାକୁ ପ୍ରଥମେ ଷ୍ଟାଟିଷ୍ଟିକାଲ ମେସିନ ଟ୍ରାନ୍ସଲେସନ ଏବଂ ପରେ ନ୍ୟୂରାଲ ମେସିନ ଟ୍ରାନ୍ସଲେସନ ଦ୍ୱାରା ଉପଯୋଗ କରାଯାଇ ଅନୁବାଦର ଶୁଦ୍ଧତ୍ତା ହିସାବ କରାଯିବ । ଚଳନୀୟ ଶୁଦ୍ଧତା ହାସଲ ପରେ ଏହାକୁ ସର୍ବସାଧାରଣଙ୍କ ନିମନ୍ତେ ଉତ୍ସର୍ଗୀକୃତ କରାଯିବ ।

Contributors ✨

Thanks goes to these wonderful people (emoji key):

_subhadarship
💻 🎨 🤔

_kamakshyaP
🖋

_{Soumendra kumar sahoo}
🤔 🎨 📖 💻 🖋

This project follows the all-contributors specification. Contributions of any kind welcome!

mtenglish2odia's People

Contributors

Stargazers

Watchers

Forkers

codacy-badger akpanda fossabot colorfulpanda88 kamakshyap pruthwik h0m3brew

mtenglish2odia's Issues

CI/CD

Add CI/CD to this repository.

Prepare the list of items needed to go live

The translation system with whatever 5k-50k pairs need to go live. This is an issue to prepare the list of tasks to be done for going live.

Prepare basic English-Odia corpus

Start the parallel corpus for En-Or by taking help of the Wikipedia dumps.

Curate Wiki Translate Corpus

Do word to word and phrase to phrase separate files
Divide multiple sentences into smaller chunks, may be sentence-wise ?
Convert the Jupyter notebook code into pure Python code ?
Lemmatization, Stemming, Inflection and derivation handling

Depfu Error: No dependency files found

Hello,

We've tried to activate or update your repository on Depfu and couldn't find any supported dependency files. If we were to guess, we would say that this is not actually a project Depfu supports and has probably been activated by error.

Monorepos

Please note that Depfu currently only searches for your dependency files in the root folder. We do support monorepos and non-root files, but don't auto-detect them. If that's the case with this repo, please send us a quick email with the folder you want Depfu to work on and we'll set it up right away!

How to deactivate the project

Go to the Settings page of either your own account or the organization you've used
Go to "Installed Integrations"
Click the "Configure" button on the Depfu integration
Remove this repo (soumendrak/MTEnglish2Odia) from the list of accessible repos.

Please note that using the "All Repositories" setting doesn't make a lot of sense with Depfu.

If you think that this is a mistake

Please let us know by sending an email to [email protected].

This is an automated issue by Depfu. You're getting it because someone configured Depfu to automatically update dependencies on this project.

Unit Test Cases

Follow TDD or write up unit test cases

Write unit test cases if possible before code writing

Splitting the input training pairs into train-test

What?
The finalized input pairs need to be further split into Train-Test pairs.
May be we will start with 80% train and 20% test pairs.

Why?
This is needed to evaluate the translation accuracy by the model itself before going to live. To let us know where do we stand.

Blocked by: Issue #9

Build the translation model

Figure out which translation framework/library will be used to build the NMT model.
Our options are:

Keras
Tensorflow
PyTorch
OpenNMT

Mostly we will go with seq2seq model.

Jupyter Notebook or Python

Whether to go ahead with Jupyter Notebook or Python files?

Jupyter Notebook

Pros

Good to work and developer friendly
Odia fonts, Yuktakhyars (ଯୁକ୍ତାକ୍ଷର) and Unicode characters friendly

Cons

Difficult to implement
May create a problem in future while we will be production ready
In Github I hardly see any serious projects solely on Jupyter Notebook

Python files

Pros

Production ready
Can be easily added to PyPi in future
Unit testing will be easier
Easy for others to follow

Cons

Depending on the IDE, some IDEs do not support the Yuktakhyars (ଯୁକ୍ତାକ୍ଷର)
Analysing and visualizing large amount of data might be a problem

Server to deploy the model

We have to choose which server needs to be chosen to build and host the model. The following things the server should have the following features:

Must required

Good free tier membership to begin with
Exporting the data after the free tier expires
A DB, MongoDB to store the pairs up to 500MB size
A server which can handle at least 5 queries per second
Python 3.6+ support

Good to have

If the system can be trained and online supported then and there
Later on, we will look into Active learning
CI/CD with GitHub

On radar

Google Cloud Platform
Microsoft Azure
Amazon Web Services
IBM Cloud
Heroku

User Feedback support

As users will use the system, it is necessary that we take feedback from them. The model should learn from this feedback.

Also, we need to check all the feedback may not be constructive, so review mechanism needs to be there as well to validate the user feedback.

There will be three kinds of primary feedbacks received from users:

Thumbs Up i.e. positive, the user is happy with the translated text
Thumbs Down i.e. negative, the user is unhappy with the translation
No feedback i.e. user will not give any feedback at all.

The review process should start from the top-most query text and topmost feedbacked text.
It's fine if the review process can not be automated on this release.

Finalize the input corpus pairs

The input English-Odia pairs need to be finalized.
With how many pairs we are going to start.

We should go ahead with 5k curated high-quality pairs.
It's fine if the pairs are sentences, phrases or words.

The pairs need to be retrieved from all the Individual files and the Combined file.

The final dataset will be added into this repository.

The front end of the website need to be developed

The UI of the system need to be developed which will have the following features:

Must required

Two text boxes. One box to type/paste English text and another to display the Odia translation.
One button to click when the user is ready to translate English to text.
The button needs to call an API hosted in the cloud server.

Required

Thumbs up and down buttons to know the feedback of the user on the translated Odia response.
In case of Thumbs down ---> The user should be asked to edit and provide the correct translation

Good to have

Uncluttered webpage
SSO based login ---> [Backend Support] followed by the number of users tried our translation
[Backend Support] Number of translation pairs trained so far
[Backend Support] Number of pairs translated by users through the website

soumendrak / mtenglish2odia Goto Github PK

mtenglish2odia's Introduction

Project moved

MTEnglish2Odia

Machine Translation from English to Odia language

How can I contribute to this repository?

What can I contribute?

License

Contributors ✨

mtenglish2odia's People

Contributors

Stargazers

Watchers

Forkers

mtenglish2odia's Issues

Monorepos

How to deactivate the project

If you think that this is a mistake

Whether to go ahead with Jupyter Notebook or Python files?

Jupyter Notebook

Pros

Cons

Python files

Pros

Cons

Recommend Projects

Recommend Topics

Recommend Org