Code Monkey home page Code Monkey logo

parsing-pdfs-using-yolov3's Introduction

Parsing PDFs using YOLOV3

There exist many python librairies which enable the parsing of pdfs, Camelot is one of the best. Although it performs well on text, however, it struggles on tables specially the ones localized inside paragraphs.
Camelot offers the option of specifying the regions to parse through the variable table_areas="x1,y1,x2,y2" where (x1, y1) is left-top and (x2, y2) right-bottom in PDF coordinate space. When filled out, the result is significantly enhanced.

Explaining the basic idea

One way to automize the parsing of tables is to train an algorithm capable of returning the coordinates of the bounding boxe circling the table, as detailled in the following pipeline:

If the primitive pdf page is image-based, we can use ocrmypdf to turn into a text-based one in order to be able to get the text inside of the table. We, then, carry out the following operations:

  • Transform a pdf page into an image one using pdf2img
  • Use a trained algorithm to detect the regions of tables.
  • Normalize the bounding boxes, using the image dimension, which enables use to get the regions in the pdf space using the pdf dimensions obtained through PyPDF2.
  • Feed the regions to camelot and get the corresponding pandas dataframes.


When detecting a table in pdf image we expand the bounding boxe in order to guarante its full inclusion, as follows:

Tables detection

The algorithm which allows the detection of tables, is nothing but yolov3, I advise your to read my previous article about objects detection. We finetune the algorithm to detect tables and retrain all the architecture. To do so, we carry out the following steps:

  • Create a training database using Makesense a tool which enables an export in yolo's format:

  • Train a yolov3 repository modified to fit our purpose on AWS EC2, we get the following results:

Requirements

All python requirements are included in the file package.txt, all you need to do is run the following command line:

pip install -r packages.txt

Prediction

It is possible to make prediction on a pdf page using the following command line:

python predict_table.py --pdf_path pdfs/boeings.pdf --page 2

It takes two arguments:

  • pdf_path: where the original pdf file is located
  • page: the desired page to parse

Examples

NB: following the same steps, we can train the algorithms to detect any other object in a pdf page such as graphics and images which can be extracted from the image page.

parsing-pdfs-using-yolov3's People

Contributors

ismail-mebsout avatar ismailmebsout avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.