Code Monkey home page Code Monkey logo

mtunitydocs's Introduction

MTUnityDocs

Machine Translation of Unity's Documentation using IBM Model 1 as inspiration.

This is a work in progress. It uses Stanford's CoreNLP to do some NLP tasks while the machine translation is implemented by myself.

Pipeline:

  1. Extraction The paralel corpus belongs to Unity and it is formated. From the information extracted from the translation platform, the system created to do machine translation prepares the data to train, develop and test from it. This was organized like this: 60% of the data for training, 20% each for testing and developing. First, the system receives the training data and prepares it by un-formatting the information. Second, it eliminates noise by cleaning the data with regular expressions.

  2. Analyzing The system created uses Natural Language Processing to tokenize the parallel corpus converting it to a tokenized parallel corpus. For the Natural Language Processing part of the project, Stanford’s CoreNLP library was used.

  3. IBM Model 1 For each sentence aligned in the parallel corpus, there are two sentences: A Spanish sentence of size l words, and an English sentence of size m words. Before any computation, an empty dictionary is created to store the word pairs that will arise after. It is an empty dictionary because there is no knowledge of the translation meaning of any word from the parallel corpus. Additionally, since there is no knowledge about what is the correct alignment between the words of the paired sentence, there is going to be an initial alignment function given to every paired sentence of the parallel corpus. This function will initialize all possible alignments between the entire given word pairs of each sentence aligned. This means that the possible alignments for each paired sentence will be initially l x m. Because of the alignment, each word pair created will be stored in the dictionary. It is expected that the dictionary holds all possible word pairs and knows what is the likelihood of each pair, at the end of the process. By this point the system will have l x m alignment proposals for each sentence aligned in the parallel corpus. As a result, each alignment proposal (WordPair wp) will have a probability calculated. The system will create an array, which will store all alignment proposals with the highest probability of each aligned sentence. Once the system has stopped extracting the alignment pairs with the highest probability, it will update its dictionary with the word pairs stored in each alignment proposal. When the system updates the dictionary, it will also have the likelihood of each word pair updated.

  4. Translation The system first receives as input the English testing files and later analyzes it with NLP similar to the way the system did at the beginning of the pipeline -- extracting English tokenized text. Each word in each English sentence of the text will be translated according to the dictionary that the system has by getting the Spanish word that has the maximum likelihood probability of the given English word.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.