Code Monkey home page Code Monkey logo

urdu-characters's Introduction

Urduhack: A Python NLP library for Urdu language

image image Azure DevOps builds Azure DevOps tests Build Status CodeFactor codecov image Downloads Gitter License: MIT

Urduhack is a NLP library for urdu language. It comes with a lot of battery included features to help you process Urdu data in the easiest way possible.

You can reach out core contributor Mr Ikram Ali @ https://github.com/akkefa

Our Goal

  • Academic users Easier experimentation to prove their hypothesis without coding from scratch.
  • NLP beginners Learn how to build an NLP project with production level code quality.
  • NLP developers Build a production level application within minutes.

๐Ÿ”ฅ Features Support

  • Normalization
  • Preprocessing
  • Tokenization
  • Pipeline Module
  • Models
    • Pos tagger
    • Lemmatizer
    • Name entity recognition
    • Sentimental analysis
    • Image to text
    • Question answering system
  • Datasets loader

๐Ÿ›  Installation

Urduhack officially supports Python 3.6โ€“3.7, and runs great on PyPy.

Installing with tensorflow cpu version.

$ pip install urduhack[tf]

Installing with tensorflow gpu version.

$ pip install urduhack[tf-gpu]

Usage

import urduhack

# Downloading models
urduhack.download()

nlp = urduhack.Pipeline()
text = ""
doc = nlp(text)

for sentence in doc.sentences:
    print(sentence.text)
    for word in sentence.words:
        print(f"{word.text}\t{word.pos}")

    for token in sentence.tokens:
        print(f"{token.text}\t{token.ner}")

๐Ÿ”— Documentation

Fantastic documentation is available at https://urduhack.readthedocs.io/

Documentation
Installation How to install Urduhack and download models
Quickstart New to Urduhack? Here's everything you need to know!
API Reference The detailed reference for Urduhack's API.
Contribute How to contribute to the code base.

๐Ÿ‘ Contributors

Special thanks to everyone who contributed to getting the Urduhack to the current state.

Backers Backers on Open Collective

Thank you to all our backers! ๐Ÿ™ [Become a backer]

Sponsors Sponsors on Open Collective

Support this project by becoming a sponsor. [Become a sponsor]

๐Ÿ“ Copyright and license

Code released under the MIT License.

urdu-characters's People

Contributors

akkefa avatar dependabot-preview[bot] avatar imgbot[bot] avatar muhammadfahid51 avatar mujadadrao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

urdu-characters's Issues

There is no difference between Urdu and Arabic characters in Unicode

I came here with with very high hopes after seeing the report on propakistani.pk. This could be the start of something great and I admire your initiative. You are doing great work.

However, your claims about different 'Arabic' and 'Urdu' range, and the same in propakistani.pk is flawed. This range: 0600-06FF is different from what you call 'Arabic' range in the sense that this one contains characters that are used in normal text, while the second one are presentation forms. (Imagine trying to write out in a child's qaeda how many different shapes 'Bay' can be in. Or to force a certain form of letter somewhere) FB50-FBFF are presentation forms. Since this range is supposed to be used as standalone characters - it won't work with fonts that change shape/height/size of character based on context. This effect can be easily seen in any modern non-monospaced font but it is very clearly visible in Nastaleeq type fonts. The usual range for Arabic text (as you can verify from any Arabic source on the web) is 0600-06FF.

Unicode doesn't have separate code points for Urdu and Arabic. Urdu, Arabic, Persian, Sindhi, all occupy the same code points i.e. 0600-06FF.

I really want a separate range for languages written in nastaleeq scripts, and this got my hopes too high. This makes some sense when you see how much Urdu text is mixed with Arabic, and how many Urdu language users also use Arabic but won't understand Urdu in Naskh or Arabic in Nastaleeq. Separate code ranges would allow fonts that can use Naskh for Arabic and Nastaleeq for Urdu and Persian in a single font, but they don't exist. Such fonts won't require Apple to have region specific font settings (or however they do the nastaleeq bit, idk) and sites like medium.com can have a font that defaults to serif in English, naskh in Arabic and nastaleeq in Urdu/Persian/South Asian languages.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.