Code Monkey home page Code Monkey logo

inltk's Introduction

Natural Language Toolkit for Indic Languages (iNLTK)

Gitter Downloads

iNLTK aims to provide out of the box support for various NLP tasks that an application developer might need for Indic languages.

Documentation

Checkout detailed docs along with Installation instructions at https://inltk.readthedocs.io

Supported languages

Language Code
Hindi hi
Punjabi pa
Gujarati gu
Kannada kn
Malayalam ml
Oriya or
Marathi mr
Bengali bn
Tamil ta
Urdu ur
Nepali ne
Sanskrit sa
English en

Repositories containing models used in iNLTK

Language Repository Dataset used for Language modeling Perplexity of ULMFiT LM
(on validation set)
Perplexity of TransformerXL LM
(on validation set)
Dataset used for Classification Classification:
Test set Accuracy
Classification:
Test set MCC
Classification: Notebook
for Reproducibility
ULMFiT Embeddings visualization TransformerXL Embeddings visualization
Hindi NLP for Hindi Hindi Wikipedia Articles - 172k


Hindi Wikipedia Articles - 55k
34.06


35.87
26.09


34.78
BBC News Articles


IIT Patna Movie Reviews


IIT Patna Product Reviews
78.75


57.74


75.71
71.61


37.23


59.76
Notebook


Notebook


Notebook
Hindi Embeddings projection Hindi Embeddings projection
Bengali NLP for Bengali Bengali Wikipedia Articles 41.2 39.3 Bengali News Articles (Soham Articles) 90.71 87.92 Notebook Bengali Embeddings projection Bengali Embeddings projection
Gujarati NLP for Gujarati Gujarati Wikipedia Articles 34.12 28.12 iNLTK Headlines Corpus - Gujarati 91.05 86.09 Notebook Gujarati Embeddings projection Gujarati Embeddings projection
Malayalam NLP for Malayalam Malayalam Wikipedia Articles 26.39 25.79 iNLTK Headlines Corpus - Malayalam 95.56 93.29 Notebook Malayalam Embeddings projection Malayalam Embeddings projection
Marathi NLP for Marathi Marathi Wikipedia Articles 18 17.42 iNLTK Headlines Corpus - Marathi 92.40 85.23 Notebook Marathi Embeddings projection Marathi Embeddings projection
Tamil NLP for Tamil Tamil Wikipedia Articles 19.80 17.22 iNLTK Headlines Corpus - Tamil 95.22 92.70 Notebook Tamil Embeddings projection Tamil Embeddings projection
Punjabi NLP for Punjabi Punjabi Wikipedia Articles 24.40 14.03 IndicNLP News Article Classification Dataset - Punjabi 97.12 96.17 Notebook Punjabi Embeddings projection Punjabi Embeddings projection
Kannada NLP for Kannada Kannada Wikipedia Articles 70.10 61.97 IndicNLP News Article Classification Dataset - Kannada 98.87 98.30 Notebook Kannada Embeddings projection Kannada Embeddings projection
Oriya NLP for Oriya Oriya Wikipedia Articles 26.57 26.81 IndicNLP News Article Classification Dataset - Oriya 98.83 98.44 Notebook Oriya Embeddings Projection Oriya Embeddings Projection
Sanskrit NLP for Sanskrit Sanskrit Wikipedia Articles ~6 ~3 Sanskrit Shlokas Dataset 84.3 (valid set) Sanskrit Embeddings projection Sanskrit Embeddings projection
Nepali NLP for Nepali Nepali Wikipedia Articles 31.5 29.3 Nepali News Dataset 98.5 (valid set) Nepali Embeddings projection Nepali Embeddings projection
Urdu NLP for Urdu Urdu Wikipedia Articles 13.19 12.55 Urdu News Dataset 95.28 (valid set) Urdu Embeddings projection Urdu Embeddings projection

Note: English model has been directly taken from fast.ai

Effect of using Transfer Learning + Data-Augmentation from iNLTK

Language Repository Dataset used for Classification Results on using
complete training set
Percentage Decrease
in Training set size
Results on using
reduced training set
without Data Aug
Results on using
reduced training set
with Data Aug
Hindi NLP for Hindi IIT Patna Movie Reviews Accuracy: 57.74

MCC: 37.23
80% (2480 -> 496) Accuracy: 47.74

MCC: 20.50
Accuracy: 56.13

MCC: 34.39
Bengali NLP for Bengali Bengali News Articles (Soham Articles) Accuracy: 90.71

MCC: 87.92
99% (11284 -> 112) Accuracy: 69.88

MCC: 61.56
Accuracy: 74.06

MCC: 65.08
Gujarati NLP for Gujarati iNLTK Headlines Corpus - Gujarati Accuracy: 91.05

MCC: 86.09
90% (5269 -> 526) Accuracy: 80.88

MCC: 70.18
Accuracy: 81.03

MCC: 70.44
Malayalam NLP for Malayalam iNLTK Headlines Corpus - Malayalam Accuracy: 95.56

MCC: 93.29
90% (5036 -> 503) Accuracy: 82.38

MCC: 73.47
Accuracy: 84.29

MCC: 76.36
Marathi NLP for Marathi iNLTK Headlines Corpus - Marathi Accuracy: 92.40

MCC: 85.23
95% (9672 -> 483) Accuracy: 84.13

MCC: 68.59
Accuracy: 84.55

MCC: 69.11
Tamil NLP for Tamil iNLTK Headlines Corpus - Tamil Accuracy: 95.22

MCC: 92.70
95% (5346 -> 267) Accuracy: 86.25

MCC: 79.42
Accuracy: 89.84

MCC: 84.63

For more details around implementation or to reproduce results, checkout respective repositories.

Contributing

Add a new language support

If you would like to add support for language of your own choice to iNLTK, please start with checking/raising a issue here

Please checkout the steps I'd mentioned here for Telugu to begin with. They should be almost similar for other languages as well.

Improving models/using models for your own research

If you would like to take iNLTK's models and refine them with your own dataset or build your own custom models on top of it, please check out the repositories in the above table for the language of your choice. The repositories above contain links to datasets, pretrained models, classifiers and all of the code for that.

Add new functionality

If you wish for a particular functionality in iNLTK - Start by checking/raising a issue here

What's next

..and being worked upon

Shout out if you want to help :)

..and NOT being worked upon

Shout out if you want to lead :)

iNLTK's Appreciation

inltk's People

Contributors

goru001 avatar ibrahiminfinite avatar anuragshas avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.