Code Monkey home page Code Monkey logo

word-spell-checker's Introduction

Urdu Spell Corrector

This is a Urdu Spell Corrector using Noisy Channel Model implemented in Python3.

It involves the following steps:

  • Train a bigram language model on language corpus (jang.txt).
biwordCount(word[0] | word[1]) / unigramCount(word[0])
  • For all the error words in Error words corpus (jang_errors.txt), find the candidate words that are one and two edit distance away from the error word. Use dictionary (wordlist.txt) to reduce your search space i.e. remove invalid candidates.
(newBiwordCount(word[0] | word[1]) + 1) /(unigramCount(word[0]) + lengthOf(‘jang.txt’))
  • For all the error words, rank the candidate words on the basis of prior probability obtained from the language model (jang_nonerrors.txt).

  • For prediction and correction, select top 10 candidate words for an error word. If any of the 10 words exists in ‘jang_nonerrors.txt’ at that precise location of same sentence then that true word is highlighted. Otherwise, all candidates words with their probabilities are listed.



Using for other language corpus

This word spell corrector can be trained on other any language corpus just by changing the path of corpus file and adding the character set of language.

In Urdu Spell Corrector.ipynb, cell#2 contains the following code:

with open('./jang_errors.txt', 'r', encoding='utf8', errors='ignore') as f:
    erorrsFile = f.readlines() # wrongly spelled file

This line contains the path of file having sentences which includes the error words.

with open('./jang_nonerrors.txt', 'r', encoding='utf8', errors='ignore') as f:
    correctedFile = f.readlines() # correctly spelled file

This line contains the path of file having same sentences without any error. It is used to compare the results at the end.

with open('./wordlist.txt', 'r', encoding='utf8', errors='ignore') as f:
    wordsFile = f.readlines() # list of valid urdu words, dictionary

This line contains the path of file having list of valid language vocabulary.

In Urdu Spell Corrector.ipynb, cell#3 contains the following code:

 urdu_charset='ابپتٹثجچحخدڈذرڑزژسشصضطظعغفقکگلمنںوہھءیے' # urdu charset

This is the character set of a language.

Now you are ready to make your own Word Spell Corrector using Noisy Channel Model.


Author 👋

You can get in touch with me on my LinkedIn Profile:

Ahmad Shafique

LinkedIn Link

You can also follow my GitHub Profile to stay updated about my latest projects: GitHub Follow

If you liked the repo then please support it by giving it a star ⭐!


Contributions Welcome ✨

forthebadge

If you find any bug in the code or have any improvements in mind then feel free to generate a pull request.


License 📄

MIT

Copyright (c) 2020, Ahmad Shafique

word-spell-checker's People

Contributors

ahmadshafique avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.