Code Monkey home page Code Monkey logo

pytorch-transformer-kor-eng's Introduction

Transformer PyTorch implementation

This repository contains Transformer implementation used to translate Korean sentence into English sentence.

I used translation dataset for NMT, but you can apply this model to any sequence to sequence (i.e. text generation) tasks such as text summarization, response generation, ..., etc.

In this project, I specially used Korean-English translation corpus from AI Hub to apply torchtext into Korean dataset.

And I also used soynlp library which is used to tokenize Korean sentence. It is really nice and easy to use, you should try if you plan to handle Korean sentences :)

Currently, the lowest valid and test losses are 2.047 and 3.488 respectively.


Overview

  • Number of train data: 92,000
  • Number of validation data: 11,500
  • Number of test data: 11,500
Example: 
{
  'kor': '['부러진', '날개로', '다시한번', '날개짓을', '하라']',
  'eng': '['wings', 'once', 'again', 'with', 'broken', 'wings']'
}

Requirements

  • Following libraries are fundamental to this repository.
  • You should install PyTorch via official Installation guide.
  • To use spaCy model which is used to tokenize english sentence, download English model by running python -m spacy download en_core_web_sm.
en-core-web-sm==2.1.0
matplotlib==3.1.1
numpy==1.16.4
pandas==0.25.1
scikit-learn==0.21.3
soynlp==0.0.493
spacy==2.1.8
torch==1.2.0
torchtext==0.4.0

Usage

  • Before training the model, you should train soynlp tokenizer on your training dataset and build vocabulary using following code.
  • You can determine the size of vocabulary of Korean and English dataset.
  • In general, Korean dataset creates the larger size vocabulary than English dataset. Therefore to make balance, you have to choose proper vocab size
  • By running following code, you will get tokenizer.pickle, kor.pickle and eng.pickle which are used to train, test the model and predict user's input sentence.
python build_pickle.py --kor_vocab KOREAN_VOCAB_SIZE --eng_vocab ENGLISH_VOCAB_SIZE
  • For training, run main.py with train mode (which is default option)
python main.py
  • For testing, run main.py with test mode
python main.py --mode test
  • For predicting, run predict.py with your Korean input sentence.
  • Don't forget to wrap your input with double quotation mark !
python predict.py --input "YOUR_KOREAN_INPUT"

Example

kor> 내일 여자친구를 만나러 가요
eng> I am going to meet my girlfriend tomorrow

kor> 감기 조심하세요
eng> Be careful not to catch a cold

To do

  • Add Beam Search for decoding step
  • Add Label Smoothing technique #1, #2, #3

References

Basically, most of my codes are based on original paper. But, I found that there is a difference between original paper and practical implementation in tensor2tensor framework. Then, I fixed some codes to follow practical framework and got better result. For following these change, you should check-out the last reference article.

pytorch-transformer-kor-eng's People

Contributors

huffon avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.