Code Monkey home page Code Monkey logo

yoruba-adr's Introduction

Automatic Diacritic Restoration of Yorùbá Text

Motivations

Nigeria’s dying languages!

Applications

  • Generating very large, high quality Yorùbá text corpora

    • [physical books] → OCR → [undiacritized text] → ADR → [clean diacritized text] Physical books written in Yorùbá (novels, manuals, school books, dictionaries) are digitized via Optical Character Recognition (OCR), which may not fully respect tonal or orthographic diacritics. Next, the undiacritized text is processed to restore the correct diacritics.
    • Correcting digital texts scraped online on Twitter, Naija forums, articles, etc
    • Suggesting corrections during manual text entry (spell/diacritic checker)
  • Preprocessing text for training Yorùbá

    • language models
    • word embeddings
    • text-language identification (so Twitter can stop claiming Yorùbá text is Vietnamese haba!)
    • part-of-speech taggers
    • named-entity recognition
    • text-to-speech (TTS) models (speech synthesis)
    • speech-to-text (STT) models (speech recogntion)

Pretrained ADR Models

Datasets

https://github.com/Niger-Volta-LTI/yoruba-text

Train a Yorùbá ADR model

Dependencies

  • Python3 (tested on 3.5, 3.6, 3.7)
  • Install all dependencies: pip3 install -r requirements.txt

We train models on an Amazon EC2 p2.xlarge instance running Deep Learning AMI (Ubuntu) Version 5.0 (ami-c27af5ba). These machine-images (AMI) have Python3 and PyTorch pre-installed as well as CUDA for training on the GPU. We use the OpenNMT-py framework for training and restoration.

  • To install PyTorch 0.4 manually, follow instructions for your {OS, package manager, python, CUDA} versions

  • git clone https://github.com/Niger-Volta-LTI/yoruba-adr.git

  • git clone https://github.com/Niger-Volta-LTI/yoruba-text.git

  • Install dependencies: pip3 install -r requirements.txt

  • Note that NLTK will need some extra hand-holding if you've installed it for the first time:

     Resource punkt not found.
     Please use the NLTK Downloader to obtain the resource:
    
     >>> import nltk
     >>> nltk.download('punkt')
    

Training an ADR sequence-to-sequence model

To start data-prep and training of the Bahdanau-style soft-attention model, execute the training script from the top-level directory: ./01_run_training.sh or ./01_run_training_transformer.sh

Learn more

yoruba-adr's People

Contributors

ruohoruotsi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

yoruba-adr's Issues

Module nltk not found

Can u help with this,am using python3 and whenever i run the .sh script on terminal ,this error shoot out module nltk not found and i have install nltk

[FIX] training script for python3.5

The paved-road for this project is python3.6 (soon 3.7), but we should support python3.5. @dadelani has reported some errors with the training script. Ensure that we retain backward compatibility with python3.5.

I tried running the code on our server with python 3.5, it gave some encoding errors. I tried it on another system with python 3.6 and it works, since i don't have root permission, i need to find a way to make sure python3.5 works for the code

Traceback (most recent call last): File "/work/smg/v-david/try_project/yoruba-adr/src/make_parallel_text.py", line 172, in <module> main() File "/work/smg/v-david/try_project/yoruba-adr/src/make_parallel_text.py", line 159, in main examples = list(make_data(ARGS.source_file, ARGS.min_len, ARGS.max_len)) File "/work/smg/v-david/try_project/yoruba-adr/src/make_parallel_text.py", line 111, in make_data print("Skipping: " + line2) UnicodeEncodeError: 'ascii' codec can't encode characters in position 24-25: ordinal not in range(128)

Tune ADR decoder parameters

Fine-tune the seq2seq decoder parameters (like beam-width) for the ADR task. This also includes error-analysis from the validation & test sets so we have a deep understanding of the performance of the model.

Please refer to this document from the CMU-LTI: https://github.com/neubig/nmt-tips

Remove OpenNMT-py code

  • Since OpenNMT-py is now on PyPI, we don't need to keep a full fork of the src in src/onmt

  • We do need scorers and other utilities (like code to prepare a model for release, stripping out optimizer info and keeping only model weights and biases)

TODO:
Refactor top level scripts and code in src to use a pip installed OpenNMT-py for training and evaluation, using only the custom scoring and utils source where necessary.

[ADD] Travis CI

Add Travis CI (or another continuous integration system), so we know when pushing breaking changes, say to the dependencies like yoruba-text or OpenNMT-py that we catch issues early on.

Also just good practise for having a confidence open-source project. Confam!

Prepare partially diacritized input dataset

To more easily normalize Yoruba wikipedia articles, create a partially diacritized dataset with diacritic marks below the vowels.

The dataset can be used in the following ways:

  1. Train a partially diacritized text i.e sentences with correct lower marks as input and corresponding fully diacritized sentences as output. I believe this will give better accuracy than what we already have. If this gives very high accuracy, we can now consider
  2. Training a non-diacritized text to output a partially diacritized text, and from the output we train the fully diacritized text i.e [non-diacritized text] ====> [partially diacritized text] ====> [fully diacritized text]

Motivation:
From my observation about the writing of Yorùbá text, majority of people especially young people don't know the tonal marks (high, mid, and low) above the vowel letters but many people know how (and want to be able) to distinguish between symbol with/without lower mark e.g E vs Ẹ, O vs Ọ and S vs Ṣ especially with the availability of Google Gboard on android phones.

Can I train a model with larger text corpus ?

I have trained with 8 million sentences. and it works well
But with larger corpus (more than 5 times) I have a problem with memory, How to deal with it? Which parameter I have to change?
I use Tesla K40m with 12GiB memory.
Thank you ;) .
image

[FIX] drop-out option warning

Fix drop-out option warning, esp. as its distracting and muddies the output of the Jupyter prediction notebook.

/Users/iroro/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py:46: 
UserWarning: dropout option adds dropout after all but last recurrent layer,
so non-zero dropout expects num_layers greater than 1, 
but got dropout=0.3 and num_layers=1 "num_layers={}".format(dropout, num_layers))

Issues with reproduction

I'm having some challenges reproducing the code on my local machine. The issue seems to be an error with torchtext but I honestly can't seem to figure out what exactly is causing it.

Here's a stacktrace.
Screen Shot 2019-05-14 at 4 23 39 PM

To confirm it's not an environment issue, I tried running on google colab too but the same issue comes up, so it doesn't seem to be a versioning issue.

Screen Shot 2019-05-14 at 4 19 10 PM

Do you have any idea what I might be missing?

Are there any limitations size of vocab ?

In small data, there is no strange because there is a distance between src vocab size and tgt vocab size
But with larger data, Are there something wrong if src vocab size: 50002 and tgt vocab size: 50004 ?
I think it has to be bigger
Thank you.

...
[2019-03-05 17:58:34,600 INFO]  * reloading ./data/demo.train.8.pt.
[2019-03-05 17:58:36,239 INFO]  * tgt vocab size: 50004.
[2019-03-05 17:58:36,661 INFO]  * src vocab size: 50002.
[INFO] running Bahdanau seq2seq training, for GPU training add: -gpuid 0 
[2019-03-05 17:58:40,955 INFO]  * src vocab size = 50002
[2019-03-05 17:58:40,956 INFO]  * tgt vocab size = 50004
[2019-03-05 17:58:40,956 INFO] Building model...
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py:46: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.3 and num_layers=1
  "num_layers={}".format(dropout, num_layers))
[2019-03-05 17:58:42,292 INFO] NMTModel(
  (encoder): RNNEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(50002, 500, padding_idx=1)
        )
      )
    )
    (rnn): LSTM(500, 128, dropout=0.3)
  )
  (decoder): InputFeedRNNDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(50004, 500, padding_idx=1)
        )
      )
    )
    (dropout): Dropout(p=0.3)
    (rnn): StackedLSTM(
      (dropout): Dropout(p=0.3)
      (layers): ModuleList(
        (0): LSTMCell(628, 128)
      )
    )
    (attn): GlobalAttention(
      (linear_out): Linear(in_features=256, out_features=128, bias=False)
    )
  )
  (generator): Sequential(
    (0): Linear(in_features=128, out_features=50004, bias=True)
    (1): LogSoftmax()
  )
)
[2019-03-05 17:58:42,292 INFO] encoder: 25323560
[2019-03-05 17:58:42,292 INFO] decoder: 31873380
...

**Cards**

Cards can be added to your board to track the progress of issues and pull requests. You can also add note cards, like this one!

[FIX] ADR model size

The ADR model is too big.

  • 200MB is what the training script emits, but normal people cannot be downloading 200MB haba!!
  • What is comprising the large size?? My suspicion is that Pytorch is also saving other data along w/ the weights/biases. Investigate and optimize the size of the model so that we can store it either locally (within github's limits) or at least make it an easier download.

Ìrànlọ́wọ́:

Reduce model sizes in preparation for Productization

  • For Sagemaker the model size is too big.
  • Use model release preparation code to reduce the size, so that we can not incur additional expenses on AWS.
  • Apply to all trained models. Will be interesting to see how big the Transformers end up being.

[ADD] enhancements for new training session

Add enhancements to the model that include :

  • new data from text-reserve (TImi_Wuraola text, new books, dictionaries & proverbs, 1-5 grams, (Agbanilolúwa ==> a-gba-ẹni-ni-olúwa from Yorùbá Name) taking Yorùbá Word vocabulary as input (to constrain predictions) to that canonical set.

  • During prediction (there can be a lookup, perhaps best implemented in Ìrànlọ́wọ́) that validates an entered word is in the dictionary and either rejects it or looks up a nearest neighbour from a pretrained text-embedding)

  • Prepare Iroyin as a validation dataset

  • Once training dataprep is complete, hand over to David to retrain on his GPU.

  • Twitter Yorùbá scraper for conversational (create new Ìrànlọ́wọ́ issue) when we get here

For Reference, which captures all of the detailed discussions about next steps:
See this: https://yorubaname.slack.com/archives/C16A699LY/p1564362548029800

Training in new languages

I want training in difference language, What part I have to modify like data, tokenizer, vocabulary set ... ???
Thank you for your response.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.