Code Monkey home page Code Monkey logo

tessdata's Introduction

#Tessdata

What have we done different?

Though Tesseract supports Indic scripts, the approach tesseract takes to train models for languages like Tamil, Malayalam, Oriya, Gujarati, Kannada and Telugu is same as those for English, French or Spanish.

This fails often for Indic Scripts because in languages mentioned above, some characters which are dependent on consonants occur before the consonants and these characters turn out wrong while tesseract scans the image left to right.

We have trained tesseract to interpret these characters as individual glyphs so that they can be post-processed later.

Trained Models for Indian Languages.

Tesseract Models (Traineddata) are being made available for all the Indic Scripts here including Santali and Meetei Meyek. We have used Noto Fonts to train all the scripts. These models are to be expected to have more accuracy than the ones provided through tesseract site

###The languages currently covered are

  • Bengali (ben)
  • Gujarati (guj)
  • Hindi (hin)
  • Kannada (kan)
  • Malayalam (mal)
  • Meetei Meyak (mni)
  • Oriya (ori)
  • Punjabi (pan)
  • Santali (sat)
  • Tamil (tam)
  • Telugu (tel)

Installation

Please install tesseract for your OS system https://github.com/tesseract-ocr/tesseract/wiki and then copy these models (traineddata files) to tessdata directory.

Future plans

In future we have plans to release Sinhalese and Thai too

Authors and Contributors

@rkvsraman

Support or Contact

If you can help or need help in training a new font or a new language which is identical to Indic Scripts (Khmer, Laos , Thai etc) please feel free to join the team and contribute -Team Indic OCR

tessdata's People

Contributors

rkvsraman avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.