Code Monkey home page Code Monkey logo

pdfcontentconverter's Introduction

PDF Content Converter

The PDF Content Converter is a tool for converting PDF text as well as structural features into a pandas dataframe, written natively in Python. It retrieves information about textual content, fonts, positions, character frequencies and surrounding visual PDF elements.

How-to

  • Pass the path of the PDF file which is wanted to be converted to PDFContentConverter.
  • Call the function pdf2pandas(). The PDF content is then returned as a pandas dataframe.
  • Media boxes of a PDF can be accessed using get_media_boxes(), the page count over get_page_count() and the document text using pdf2text().
  • Using the convert() function, the pandas dataframe, textual document content, media boxes and page count are returned as a dictionary.

Example call:

converter = PDFContentConverter(pdf)
result = converter.pdf2pandas()

A more detailed example usage is also given in Tester.py.

Project Structure

  • PDFContentConverter.py: contains the PDFContentConverter class for converting PDF documents.
  • util:
    • constants: paths to input and output data, pdfminer parameters
    • StorageUtil: store/load functionalities
  • Tester.py: Python script for testing the PDFContentConverter
  • csv: example csv output files for tests
  • pdf: example pdf input files for tests

Output Format

The output containing the converted PDF data is stored as pandas dataframe. The different PDF elements are stored as rows. The dataframe contains the following columns:

  • id: unique identifier of the PDF element
  • page: page number, starting with 0
  • text: text of the PDF element
  • x_0: left x coordinate
  • x_1: right x coordinate
  • y_0: top y coordinate
  • y_1: bottom y coordinate
  • pos_x: center x coordinate
  • pos_y: center y coordinate
  • abs_pos: tuple containing a page independent representation of (pos_x,pos_y) coordinates
  • original_font: font as extracted by pdfminer
  • font_name: name of the font extracted from original_font
  • code: font code as provided by pdfminer
  • bold: factor 1 indicating that a text is bold and 0 otherwise
  • italic: factor 1 indicating that a text is italic and 0 otherwise
  • font_size: size of the text in points
  • masked: text with numeric content substituted as #
  • frequency_hist: histogram of character type frequencies in a text, stored as a tuple containing percentages of textual, numerical, text symbolic and other symbols
  • len_text: number of characters
  • n_tokens: number of words
  • tag: tag for key-value pair extractions, indicating keys or values based on simple heuristics
  • box: box extracted by pdfminer Layout Analysis
  • in_element_ids: contains IDs of surrounding visual elements such as rectangles or lists. They are stored as a list [left, right, top, bottom]. -1 is indicating that there is no adjacent visual element.
  • in_element: indicates based on in_element_ids whether an element is stored in a visual rectangle representation (stored as "rectangle") or not (stored as "none").

Additionally, a dictionary is returned containing the following entries, which can be used to transform the absolute CSV coordinates:

  • x0: Left x page crop box coordinate
  • x1: Right x page crop box coordinate
  • y0: Top y page crop box coordinate
  • y1: Bottom y page crop box coordinate
  • x0page: Left x page coordinate
  • x1page: Right x page coordinate
  • y0page: Top y page coordinate
  • y1page: Bottom y page coordinate

Both are returned in a dictionary when using convert(). The dataframe is stored as "content", the page characteristics as "media_boxes", the textual content as "text" and the number of pages as "page_count".

Acknowledgements

Authors

  • Michael Benedikt Aigner
  • Florian Preis

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.