General Readme

install pytessearct, tesseract-ocr
cv2
transformers
unidecode
googletrans=4.0
flask
sentencepiece
tesseract-ocr-ben offline using marianmt:

Marianmt not available.

OCR application

HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 python3 ocr_from_image_marianmt.py

Installation in linux

Important link: https://linuxhint.com/install-tesseract-ocr-linux/

Installing pip requirements pip3 install -r requirements.txt
Installing in linux sudo apt-get install tesseract-ocr
For installing all language pack use apt-get install tesseract-ocr-all
For bengali use: apt-get install tesseract-ocr-ben

Or download language pack manually from github.

Initially using tesseract.

Other Tools to leverage for ocr

google translate

It is working well with python

Tesseract 4.1

From : https://nanonets.com/blog/ocr-with-tesseract/

Tesseract 4.00 includes a new neural network subsystem configured as a text line recognizer. It has its origins in OCRopus' Python-based LSTM implementation but has been redesigned for Tesseract in C++. The neural network system in Tesseract pre-dates TensorFlow but is compatible with it, as there is a network description language called Variable Graph Specification Language (VGSL), that is also available for TensorFlow. To recognize an image containing a single character, we typically use a Convolutional Neural Network (CNN). Text of arbitrary length is a sequence of characters, and such problems are solved using RNNs and LSTM is a popular form of RNN. Read this post to learn more about LSTM.

LSTMs are great at learning sequences but slow down a lot when the number of states is too large. There are empirical results that suggest it is better to ask an LSTM to learn a long sequence than a short sequence of many classes. Tesseract developed from OCRopus model in Python which was a fork of a LSMT in C++, called CLSTM. CLSTM is an implementation of the LSTM recurrent neural network model in C++, using the Eigen library for numerical computations.

The only language pack installed in macOS Tesseract is English, which is contained in the eng.traineddata file.

So what are these Tesseract files?

eng.traineddata is the language pack for English. osd.traineddata is a special data file related to orientation and scripts. snum.traineddata is an internal serial number used by Tesseract. pdf.ttf is a True Type Format Font file to support pdf renderings.

Got bengali text:

Checklist:

make pytesseract work - done

tesseract documentation:

https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc

tessseract sorted!

bengali translation - done

get boxes around text - not done

https://www.pyimagesearch.com/2020/08/03/tesseract-ocr-for-non-english-languages/

To run

put the pdf you want to extract in the pdf folder
run pdf_to_img first tp convert pdf to img

python3 pdf_to_img.py

run ocr_from_multiple_img for convertig img to script

python3 ocr_from_multiple_img.py

for using the flask server

python3 app.py

Things to do

improve UI
Add language detection feature
Add possibility to directly convert from pdf
integrate docx documents

for offline installation

Checkout the offline folder

using MarianMT

pip3 install transformers
pip3 install sentencepiece
pip3 install mosestokenizer
You also need to install pytorch in the base

fanbyprinciple / multi-lingual-ocr Goto Github PK

multi-lingual-ocr's Introduction

General Readme

OCR application

Installation in linux

Initially using tesseract.

Other Tools to leverage for ocr

google translate

Tesseract 4.1

Checklist:

To run

Things to do

for offline installation

using MarianMT

multi-lingual-ocr's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent