Light

shrivastava95 / docparser Goto Github PK

View Code? Open in Web Editor NEW

6.0 1.0 2.0 1.54 MB

A multilingual document parser that processes PDFs. Built using Google's open source Tesseract OCR, and OpenAI's CLIP (Contrastive Language Image Pretraining).

License: MIT License

Python 99.94% Dockerfile 0.06%

docparser's Introduction

Requirements

First make sure PyTorch - 1.7.1 (or later) and torchvision are installed.
pip install git+https://github.com/openai/CLIP.git - OpenAI's CLIP model for matching text with images
pip install numpy pandas ftfy regex tqdm PyPDF2 python-dotenv openai
Setup pdf2image. Instructions given here:

Linux and MacOS
1. setup poppler using the isntructions given in https://pdf2image.readthedocs.io/en/latest/installation.html
2. pip install pdf2image
Windows
1. Download the latest poppler package from https://github.com/oschwartz10612/poppler-windows/releases/ which is the most up-to-date.
2. Move the extracted directory to the desired place on your system
3. Add the bin/ directory to your PATH
4. Test that all went well by opening cmd and making sure that you can call pdftoppm -h
5. If still not working, point the poppler_path argument to the \bin folder like already done inside the file.
6. pip install pdf2image
Setup pytesseract. Instructions given here:

Linux and MacOS
1. Setup the latest version of pytesseract (5+) using https://studysection.com/blog/quick-guide-to-install-and-remove-tesseract-ocr-5-on-ubuntu-18-04/
2. Make sure the correct tesseract language packages are installed for your use. Helpful guide - https://ocrmypdf.readthedocs.io/en/latest/languages.html Windows

docparser's People

Contributors

Stargazers

Watchers

Forkers

xorsuyash rexdivakar

docparser's Issues

Word aggregation strategy

Some / all PDF documents return a newline after every word in the parsing process. Figure out what is causing this problem and resolve this issue.

Upload an image showing the problem related to this issue.

Dockerize

Make a Docker container for the repo which makes it easy to setup without installing all the requirements. Publish image on Docker Hub.

Documentation, README, setup

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.