niger-volta-lti / yoruba-adr Goto Github PK

Automatic Diacritic Restoration of Yorùbá language Text

License: MIT License

Shell 5.44% Python 78.94% Jupyter Notebook 15.62%

diacritics text-processing seq2seq neural-machine-translation yoruba african-languages orthographic-diacritics adr attention python3

yoruba-adr's Introduction

Automatic Diacritic Restoration of Yorùbá Text

Motivations

Nigeria’s dying languages!

Applications

Generating very large, high quality Yorùbá text corpora
- [physical books] → OCR → [undiacritized text] → ADR → [clean diacritized text] Physical books written in Yorùbá (novels, manuals, school books, dictionaries) are digitized via Optical Character Recognition (OCR), which may not fully respect tonal or orthographic diacritics. Next, the undiacritized text is processed to restore the correct diacritics.
- Correcting digital texts scraped online on Twitter, Naija forums, articles, etc
- Suggesting corrections during manual text entry (spell/diacritic checker)
Preprocessing text for training Yorùbá
- language models
- word embeddings
- text-language identification (so Twitter can stop claiming Yorùbá text is Vietnamese haba!)
- part-of-speech taggers
- named-entity recognition
- text-to-speech (TTS) models (speech synthesis)
- speech-to-text (STT) models (speech recogntion)

Pretrained ADR Models

Datasets

https://github.com/Niger-Volta-LTI/yoruba-text

Train a Yorùbá ADR model

Dependencies

Python3 (tested on 3.5, 3.6, 3.7)
Install all dependencies: pip3 install -r requirements.txt

We train models on an Amazon EC2 p2.xlarge instance running Deep Learning AMI (Ubuntu) Version 5.0 (ami-c27af5ba). These machine-images (AMI) have Python3 and PyTorch pre-installed as well as CUDA for training on the GPU. We use the OpenNMT-py framework for training and restoration.

To install PyTorch 0.4 manually, follow instructions for your {OS, package manager, python, CUDA} versions
git clone https://github.com/Niger-Volta-LTI/yoruba-adr.git
git clone https://github.com/Niger-Volta-LTI/yoruba-text.git
Install dependencies: pip3 install -r requirements.txt

Note that NLTK will need some extra hand-holding if you've installed it for the first time:

 Resource punkt not found.
 Please use the NLTK Downloader to obtain the resource:

 >>> import nltk
 >>> nltk.download('punkt')

Training an ADR sequence-to-sequence model

To start data-prep and training of the Bahdanau-style soft-attention model, execute the training script from the top-level directory: ./01_run_training.sh or ./01_run_training_transformer.sh

Learn more

Orthographic diacritics & multilingual computing: http://www.phon.ucl.ac.uk/home/wells/dia/diacritics-revised.htm
Interspeech 2018 Paper: Attentive Sequence-to-Sequence Learning for Diacritic Restoration of Yorùbá Language Text

ICLR 2020 AfricaNLP Workshop Paper: Improving Yorùbá Diacritic Restoration

If you use this code in your research please cite:

@article{orife2018attentive,
  title={Attentive Sequence-to-Sequence Learning for Diacritic Restoration of Yor{\`u}B{\'a} Language Text},
  author={Orife, Iroro},
  journal={Proc. Interspeech 2018},
  pages={2848--2852},
  year={2018}
}

@article{orife2020improving,
  title={Improving Yor{\`u}B{\'a}  Diacritic Restoration},
  author={Orife, Iroro and Adelani, David I and Fasubaa, Timi and Williamson, Victor and Oyewusi, Wuraola Fisayo and Wahab, Olamilekan and Tubosun, Kola},
  journal={arXiv preprint arXiv:2003.10564},
  year={2020}
}

yoruba-adr's People

Contributors

Stargazers

Watchers

Forkers

timilehin tolulope olamyy dupsys ngtrang misterola alimi001 awujo-olopolo-pipe afro-lingo nativemaps

yoruba-adr's Issues

[ADD] all models from Improving ADR paper to bintray

Add all models from Improving ADR paper to bintray:https://bintray.com/ruohoruotsi/prebuilt-models

Alternatively, find a place to host them on GCP, we need a cloud solution as these models are too big to exist on local computers or as part of a 100MB max PyPI sdist or Wheel package.

[UPDATE] OpenNMT code

Our fork of OpenNMT used in ./src/ https://github.com/ruohoruotsi/OpenNMT-py, is very much behind the current https://github.com/OpenNMT/OpenNMT-py HEAD. Update our code to the latest.

Additionally, decide if a submodule-like mechanism that doesn't require manual merging, will facilitate maintenance of framework dependencies like OpenNMT or T2T.

Resources: https://stackoverflow.com/questions/6500524/alternatives-to-git-submodules

Module nltk not found

Can u help with this,am using python3 and whenever i run the .sh script on terminal ,this error shoot out module nltk not found and i have install nltk

[FIX] training script for python3.5

The paved-road for this project is python3.6 (soon 3.7), but we should support python3.5. @dadelani has reported some errors with the training script. Ensure that we retain backward compatibility with python3.5.

I tried running the code on our server with python 3.5, it gave some encoding errors. I tried it on another system with python 3.6 and it works, since i don't have root permission, i need to find a way to make sure python3.5 works for the code

Traceback (most recent call last): File "/work/smg/v-david/try_project/yoruba-adr/src/make_parallel_text.py", line 172, in <module> main() File "/work/smg/v-david/try_project/yoruba-adr/src/make_parallel_text.py", line 159, in main examples = list(make_data(ARGS.source_file, ARGS.min_len, ARGS.max_len)) File "/work/smg/v-david/try_project/yoruba-adr/src/make_parallel_text.py", line 111, in make_data print("Skipping: " + line2) UnicodeEncodeError: 'ascii' codec can't encode characters in position 24-25: ordinal not in range(128)

Fix Lẹ́síkà

Fix Lẹ́síkà logic to ensure that vocabulary is shared across training, dev and test splits.

http://www.albertauyeung.com/post/generating-ngrams-python/
https://gist.github.com/amontalenti/7975313
https://developer.ibm.com/articles/cc-patterns-artificial-intelligence-part2/
https://pypi.org/project/icegrams/
http://www.ling.helsinki.fi/kit/2014s/clt237/nltk-02-2-print.shtml

https://dl.bintray.com/ruohoruotsi/prebuilt-models/

Tune ADR decoder parameters

Fine-tune the seq2seq decoder parameters (like beam-width) for the ADR task. This also includes error-analysis from the validation & test sets so we have a deep understanding of the performance of the model.

Please refer to this document from the CMU-LTI: https://github.com/neubig/nmt-tips

Remove OpenNMT-py code

Since OpenNMT-py is now on PyPI, we don't need to keep a full fork of the src in src/onmt
We do need scorers and other utilities (like code to prepare a model for release, stripping out optimizer info and keeping only model weights and biases)

TODO:
Refactor top level scripts and code in src to use a pip installed OpenNMT-py for training and evaluation, using only the custom scoring and utils source where necessary.

[ADD] Travis CI

Add Travis CI (or another continuous integration system), so we know when pushing breaking changes, say to the dependencies like yoruba-text or OpenNMT-py that we catch issues early on.

Also just good practise for having a confidence open-source project. Confam!

Prepare partially diacritized input dataset

To more easily normalize Yoruba wikipedia articles, create a partially diacritized dataset with diacritic marks below the vowels.

The dataset can be used in the following ways:

Train a partially diacritized text i.e sentences with correct lower marks as input and corresponding fully diacritized sentences as output. I believe this will give better accuracy than what we already have. If this gives very high accuracy, we can now consider
Training a non-diacritized text to output a partially diacritized text, and from the output we train the fully diacritized text i.e [non-diacritized text] ====> [partially diacritized text] ====> [fully diacritized text]

Motivation:
From my observation about the writing of Yorùbá text, majority of people especially young people don't know the tonal marks (high, mid, and low) above the vowel letters but many people know how (and want to be able) to distinguish between symbol with/without lower mark e.g E vs Ẹ, O vs Ọ and S vs Ṣ especially with the availability of Google Gboard on android phones.

Can I train a model with larger text corpus ?

I have trained with 8 million sentences. and it works well
But with larger corpus (more than 5 times) I have a problem with memory, How to deal with it? Which parameter I have to change?
I use Tesla K40m with 12GiB memory.
Thank you ;) .

[FIX] drop-out option warning

Fix drop-out option warning, esp. as its distracting and muddies the output of the Jupyter prediction notebook.

/Users/iroro/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py:46: 
UserWarning: dropout option adds dropout after all but last recurrent layer,
so non-zero dropout expects num_layers greater than 1, 
but got dropout=0.3 and num_layers=1 "num_layers={}".format(dropout, num_layers))

Issues with reproduction

I'm having some challenges reproducing the code on my local machine. The issue seems to be an error with torchtext but I honestly can't seem to figure out what exactly is causing it.

Here's a stacktrace.

To confirm it's not an environment issue, I tried running on google colab too but the same issue comes up, so it doesn't seem to be a versioning issue.

Do you have any idea what I might be missing?

[ADD] support for self-attentive training with T2T

It will be useful to have, in addition to the OpenNMT implementation, support for training self-attentive (Transformer) models with the reference/canonical implementation from Google: https://github.com/tensorflow/tensor2tensor

Are there any limitations size of vocab ?

In small data, there is no strange because there is a distance between src vocab size and tgt vocab size
But with larger data, Are there something wrong if src vocab size: 50002 and tgt vocab size: 50004 ?
I think it has to be bigger
Thank you.

...
[2019-03-05 17:58:34,600 INFO]  * reloading ./data/demo.train.8.pt.
[2019-03-05 17:58:36,239 INFO]  * tgt vocab size: 50004.
[2019-03-05 17:58:36,661 INFO]  * src vocab size: 50002.
[INFO] running Bahdanau seq2seq training, for GPU training add: -gpuid 0 
[2019-03-05 17:58:40,955 INFO]  * src vocab size = 50002
[2019-03-05 17:58:40,956 INFO]  * tgt vocab size = 50004
[2019-03-05 17:58:40,956 INFO] Building model...
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py:46: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.3 and num_layers=1
  "num_layers={}".format(dropout, num_layers))
[2019-03-05 17:58:42,292 INFO] NMTModel(
  (encoder): RNNEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(50002, 500, padding_idx=1)
        )
      )
    )
    (rnn): LSTM(500, 128, dropout=0.3)
  )
  (decoder): InputFeedRNNDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(50004, 500, padding_idx=1)
        )
      )
    )
    (dropout): Dropout(p=0.3)
    (rnn): StackedLSTM(
      (dropout): Dropout(p=0.3)
      (layers): ModuleList(
        (0): LSTMCell(628, 128)
      )
    )
    (attn): GlobalAttention(
      (linear_out): Linear(in_features=256, out_features=128, bias=False)
    )
  )
  (generator): Sequential(
    (0): Linear(in_features=128, out_features=50004, bias=True)
    (1): LogSoftmax()
  )
)
[2019-03-05 17:58:42,292 INFO] encoder: 25323560
[2019-03-05 17:58:42,292 INFO] decoder: 31873380
...

200MB is what the training script emits, but normal people cannot be downloading 200MB haba!!
What is comprising the large size?? My suspicion is that Pytorch is also saving other data along w/ the weights/biases. Investigate and optimize the size of the model so that we can store it either locally (within github's limits) or at least make it an easier download.

Ìrànlọ́wọ́:

https://github.com/OpenNMT/OpenNMT-py/blob/master/tools/release_model.py

For Sagemaker the model size is too big.
Use model release preparation code to reduce the size, so that we can not incur additional expenses on AWS.
Apply to all trained models. Will be interesting to see how big the Transformers end up being.

[ADD] enhancements for new training session

Add enhancements to the model that include :

new data from text-reserve (TImi_Wuraola text, new books, dictionaries & proverbs, 1-5 grams, (Agbanilolúwa ==> a-gba-ẹni-ni-olúwa from Yorùbá Name) taking Yorùbá Word vocabulary as input (to constrain predictions) to that canonical set.
During prediction (there can be a lookup, perhaps best implemented in Ìrànlọ́wọ́) that validates an entered word is in the dictionary and either rejects it or looks up a nearest neighbour from a pretrained text-embedding)
Prepare Iroyin as a validation dataset
Once training dataprep is complete, hand over to David to retrain on his GPU.
Twitter Yorùbá scraper for conversational (create new Ìrànlọ́wọ́ issue) when we get here

For Reference, which captures all of the detailed discussions about next steps:
See this: https://yorubaname.slack.com/archives/C16A699LY/p1564362548029800

Training in new languages

I want training in difference language, What part I have to modify like data, tokenizer, vocabulary set ... ???
Thank you for your response.