-
First make sure PyTorch - 1.7.1 (or later) and torchvision are installed.
-
pip install git+https://github.com/openai/CLIP.git
- OpenAI's CLIP model for matching text with images -
pip install numpy pandas ftfy regex tqdm PyPDF2 python-dotenv openai
-
Setup
pdf2image
. Instructions given here:Linux and MacOS
- setup poppler using the isntructions given in https://pdf2image.readthedocs.io/en/latest/installation.html
pip install pdf2image
Windows
- Download the latest poppler package from https://github.com/oschwartz10612/poppler-windows/releases/ which is the most up-to-date.
- Move the extracted directory to the desired place on your system
- Add the
bin/
directory to your PATH - Test that all went well by opening cmd and making sure that you can call
pdftoppm -h
- If still not working, point the
poppler_path
argument to the\bin
folder like already done inside the file. pip install pdf2image
-
Setup
pytesseract
. Instructions given here:Linux and MacOS
- Setup the latest version of pytesseract (5+) using https://studysection.com/blog/quick-guide-to-install-and-remove-tesseract-ocr-5-on-ubuntu-18-04/
- Make sure the correct tesseract language packages are installed for your use. Helpful guide - https://ocrmypdf.readthedocs.io/en/latest/languages.html Windows
shrivastava95 / docparser Goto Github PK
View Code? Open in Web Editor NEWA multilingual document parser that processes PDFs. Built using Google's open source Tesseract OCR, and OpenAI's CLIP (Contrastive Language Image Pretraining).
License: MIT License