Code Monkey home page Code Monkey logo

mtenglish2odia's Introduction

Codacy Badge

All Contributors

FOSSA Status

Project moved

This project has been moved under OdiaNLP GitHub organization. For more details please visit: https://odianlp.github.io/

MTEnglish2Odia

Approx. Number of En-Or parallel reviewed pairs: 42,000

Machine Translation from English to Odia language

This is a healthy start to building Automated machine translation for English to Odia language. This has been built mainly to help increase quality odia wikipedia articles by translating for English Wikipedia. The approach is to start building a parallel corpus between English and Odia language which can later be used in SMT(Statistical Machine Translation) or NMT (Neural Machine Translation) in future by interested people.

Around 9000 English-Odia un-reviewed raw parallel pairs dump available in this file as pipe separated phrases or sentences.

For more details visit the website of this repository : MTEnglish2Odia

How can I contribute to this repository?

  • Click here to read a general guide on how to contribute to a Github open source project for beginners.

What can I contribute?

  • You can send English-Odia word/phrase/sentence pairs on the below format in a new file, under your name and types of data.
  • Please put the file under Individual_files
  • For e.g. if your name is Satyabrata, you want to upload generic phrases:
Key Example
Filename satyabrata.txt
File upload path data/Individual_files/satyabrata.txt
File text format `Why are you so lazy?

Please make sure you have correct permissions to upload this data in GPL license.

  • Tutorial on how to fork a repository and send a PR can be found in this video or this video or this Github doc tutorial for fork and this one for pull request
  • Your Pull Request will be reviewed first.
  • Please follow up if any comments or modifications are needed on your Pull Request.
  • In case of any confusion please contact on [email protected]. You will get a response within a day or two.

Fork and Pull Request-1

Fork and Pull Request-2

License

GPL v3.0


ଇଂରାଜୀରୁ ଓଡ଼ିଆ ଭାଷାକୁ ମେସିନ ଟ୍ରାନ୍ସଲେସନଦ୍ୱାରା ଅନୁବାଦ କରିବାକୁ ଏହି ପ୍ରକଳ୍ପଟି ତିଆରି ହୋଇଛି । ଏହା ମୁଖ୍ୟତଃ ଓଡ଼ିଆ ଉଇକିପିଡ଼ିଆରେ ଗୁଣାତ୍ମକ ପୃଷ୍ଠାଗୁଡ଼ିକର ସଂଖ୍ୟା ବୃଦ୍ଧି କରିବାକୁ ଗଠନ କରାଯାଇଛି । ବର୍ତ୍ତମାନର ଯୋଜନା ହିସାବରେ ପ୍ରଥମେ ଇଂରାଜୀ-ଓଡ଼ିଆ ଅନୁବାଦର ପାରାଲେଲ ତଥ୍ୟ ସଂଗ୍ରହ ହେବ । ଯଥେଷ୍ଟ ପରିମାଣର ତଥ୍ୟ ସଂଗ୍ରହ ପରେ ଏହାକୁ ପ୍ରଥମେ ଷ୍ଟାଟିଷ୍ଟିକାଲ ମେସିନ ଟ୍ରାନ୍ସଲେସନ ଏବଂ ପରେ ନ୍ୟୂରାଲ ମେସିନ ଟ୍ରାନ୍ସଲେସନ ଦ୍ୱାରା ଉପଯୋଗ କରାଯାଇ ଅନୁବାଦର ଶୁଦ୍ଧତ୍ତା ହିସାବ କରାଯିବ । ଚଳନୀୟ ଶୁଦ୍ଧତା ହାସଲ ପରେ ଏହାକୁ ସର୍ବସାଧାରଣଙ୍କ ନିମନ୍ତେ ଉତ୍ସର୍ଗୀକୃତ କରାଯିବ ।

FOSSA Status

Contributors ✨

Thanks goes to these wonderful people (emoji key):


subhadarship

💻 🎨 🤔

kamakshyaP

🖋

Soumendra kumar sahoo

🤔 🎨 📖 💻 🖋

This project follows the all-contributors specification. Contributions of any kind welcome!

mtenglish2odia's People

Contributors

allcontributors[bot] avatar fossabot avatar kamakshyap avatar soumendrak avatar subhadarship avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

mtenglish2odia's Issues

CI/CD

Add CI/CD to this repository.

Curate Wiki Translate Corpus

  • Do word to word and phrase to phrase separate files
  • Divide multiple sentences into smaller chunks, may be sentence-wise ?
  • Convert the Jupyter notebook code into pure Python code ?
  • Lemmatization, Stemming, Inflection and derivation handling

Depfu Error: No dependency files found

Hello,

We've tried to activate or update your repository on Depfu and couldn't find any supported dependency files. If we were to guess, we would say that this is not actually a project Depfu supports and has probably been activated by error.

Monorepos

Please note that Depfu currently only searches for your dependency files in the root folder. We do support monorepos and non-root files, but don't auto-detect them. If that's the case with this repo, please send us a quick email with the folder you want Depfu to work on and we'll set it up right away!

How to deactivate the project

  • Go to the Settings page of either your own account or the organization you've used
  • Go to "Installed Integrations"
  • Click the "Configure" button on the Depfu integration
  • Remove this repo (soumendrak/MTEnglish2Odia) from the list of accessible repos.

Please note that using the "All Repositories" setting doesn't make a lot of sense with Depfu.

If you think that this is a mistake

Please let us know by sending an email to [email protected].


This is an automated issue by Depfu. You're getting it because someone configured Depfu to automatically update dependencies on this project.

Unit Test Cases

Follow TDD or write up unit test cases

  • Write unit test cases if possible before code writing

Splitting the input training pairs into train-test

What?
The finalized input pairs need to be further split into Train-Test pairs.
May be we will start with 80% train and 20% test pairs.

Why?
This is needed to evaluate the translation accuracy by the model itself before going to live. To let us know where do we stand.

Blocked by: Issue #9

Build the translation model

Figure out which translation framework/library will be used to build the NMT model.
Our options are:

  • Keras
  • Tensorflow
  • PyTorch
  • OpenNMT

Mostly we will go with seq2seq model.

Jupyter Notebook or Python

Whether to go ahead with Jupyter Notebook or Python files?

Jupyter Notebook

Pros

  • Good to work and developer friendly
  • Odia fonts, Yuktakhyars (ଯୁକ୍ତାକ୍ଷର) and Unicode characters friendly

Cons

  • Difficult to implement
  • May create a problem in future while we will be production ready
  • In Github I hardly see any serious projects solely on Jupyter Notebook

Python files

Pros

  • Production ready
  • Can be easily added to PyPi in future
  • Unit testing will be easier
  • Easy for others to follow

Cons

  • Depending on the IDE, some IDEs do not support the Yuktakhyars (ଯୁକ୍ତାକ୍ଷର)
  • Analysing and visualizing large amount of data might be a problem

Server to deploy the model

We have to choose which server needs to be chosen to build and host the model. The following things the server should have the following features:

Must required

  • Good free tier membership to begin with
  • Exporting the data after the free tier expires
  • A DB, MongoDB to store the pairs up to 500MB size
  • A server which can handle at least 5 queries per second
  • Python 3.6+ support

Good to have

  • If the system can be trained and online supported then and there
  • Later on, we will look into Active learning
  • CI/CD with GitHub

On radar

  • Google Cloud Platform
  • Microsoft Azure
  • Amazon Web Services
  • IBM Cloud
  • Heroku

User Feedback support

As users will use the system, it is necessary that we take feedback from them. The model should learn from this feedback.

Also, we need to check all the feedback may not be constructive, so review mechanism needs to be there as well to validate the user feedback.

There will be three kinds of primary feedbacks received from users:

  1. Thumbs Up i.e. positive, the user is happy with the translated text
  2. Thumbs Down i.e. negative, the user is unhappy with the translation
  3. No feedback i.e. user will not give any feedback at all.

The review process should start from the top-most query text and topmost feedbacked text.
It's fine if the review process can not be automated on this release.

Finalize the input corpus pairs

The input English-Odia pairs need to be finalized.
With how many pairs we are going to start.

We should go ahead with 5k curated high-quality pairs.
It's fine if the pairs are sentences, phrases or words.

The pairs need to be retrieved from all the Individual files and the Combined file.

The final dataset will be added into this repository.

The front end of the website need to be developed

The UI of the system need to be developed which will have the following features:

Must required

  • Two text boxes. One box to type/paste English text and another to display the Odia translation.
  • One button to click when the user is ready to translate English to text.
  • The button needs to call an API hosted in the cloud server.

Required

  • Thumbs up and down buttons to know the feedback of the user on the translated Odia response.
  • In case of Thumbs down ---> The user should be asked to edit and provide the correct translation

Good to have

  • Uncluttered webpage
  • SSO based login ---> [Backend Support] followed by the number of users tried our translation
  • [Backend Support] Number of translation pairs trained so far
  • [Backend Support] Number of pairs translated by users through the website

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.