- install pytessearct, tesseract-ocr
- cv2
- transformers
- unidecode
- googletrans=4.0
- flask
- sentencepiece
- tesseract-ocr-ben offline using marianmt:
Marianmt not available.
HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 python3 ocr_from_image_marianmt.py
Important link:
https://linuxhint.com/install-tesseract-ocr-linux/
-
Installing pip requirements
pip3 install -r requirements.txt
-
Installing in linux
sudo apt-get install tesseract-ocr
-
For installing all language pack use
apt-get install tesseract-ocr-all
-
For bengali use:
apt-get install tesseract-ocr-ben
Or download language pack manually from github.
-
OCRopus - https://github.com/ocropus/ocropy
-
Ocular - https://github.com/tberg12/ocular
-
https://medium.com/better-programming/beginners-guide-to-tesseract-ocr-using-python-10ecbb426c3d
It is working well with python
From : https://nanonets.com/blog/ocr-with-tesseract/
Tesseract 4.00 includes a new neural network subsystem configured as a text line recognizer. It has its origins in OCRopus' Python-based LSTM implementation but has been redesigned for Tesseract in C++. The neural network system in Tesseract pre-dates TensorFlow but is compatible with it, as there is a network description language called Variable Graph Specification Language (VGSL), that is also available for TensorFlow. To recognize an image containing a single character, we typically use a Convolutional Neural Network (CNN). Text of arbitrary length is a sequence of characters, and such problems are solved using RNNs and LSTM is a popular form of RNN. Read this post to learn more about LSTM.
LSTMs are great at learning sequences but slow down a lot when the number of states is too large. There are empirical results that suggest it is better to ask an LSTM to learn a long sequence than a short sequence of many classes. Tesseract developed from OCRopus model in Python which was a fork of a LSMT in C++, called CLSTM. CLSTM is an implementation of the LSTM recurrent neural network model in C++, using the Eigen library for numerical computations.
The only language pack installed in macOS Tesseract is English, which is contained in the eng.traineddata file.
So what are these Tesseract files?
eng.traineddata
is the language pack for English.
osd.traineddata
is a special data file related to orientation and scripts.
snum.traineddata
is an internal serial number used by Tesseract.
pdf.ttf is a True Type Format Font file to support pdf renderings.
- make pytesseract work - done
tesseract documentation:
https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc
tessseract sorted!
- bengali translation - done
- get boxes around text - not done
https://www.pyimagesearch.com/2020/08/03/tesseract-ocr-for-non-english-languages/
-
put the pdf you want to extract in the pdf folder
-
run pdf_to_img first tp convert pdf to img
python3 pdf_to_img.py
- run ocr_from_multiple_img for convertig img to script
python3 ocr_from_multiple_img.py
- for using the flask server
python3 app.py
- improve UI
- Add language detection feature
- Add possibility to directly convert from pdf
- integrate docx documents
Checkout the offline folder
pip3 install transformers
pip3 install sentencepiece
pip3 install mosestokenizer
- You also need to install pytorch in the base