Code Monkey home page Code Monkey logo

pdf-tables-excel's Introduction

PDF To Excel Table Extractor using API + Python

This is a modified version of the example API script from https://pdftables.com/.

I needed to extract loads of tables from a few years worth of bank statements from a bank account I no longer have access to.

Instead of phoning up the customer help line (Zzzz) or having to manually copy + paste this info I thought I'd put my limited Python skillz to use (Shout out to Al Sweigart - https://automatetheboringstuff.com/).

Hopefully someone will find this script useful...

Quick Start Guide

  1. Download convert.py script
  2. Sign up for an account here to get your API key (50 free credits) https://pdftables.com/join
  3. Add API key to line 7 which is the API variable
  4. Update line 15 to reference where you saved this file e.g. /home/user/desktop/convert-files/
  5. Run using Terminal "$ CD /home/user/desktop/convert-files/" "$ python3 convert.py" or Python IDE of your choice
  6. Excel versions of PDFs docs will appear in the same folder where they are currently saved.

FAQs

Will this work on Windows/Mac/Linux?

The underlying Python code will work for sure, my bash script wrapper perhaps not. I currently use Ubuntu (and had to edit how .sh files execute via double click) so I'm not sure on how other systems and Python work together. My limited experience is that Mac is very similar as it's UNIX whereas Windows command line and installation is slightly different.

How many PDFs/tables does does 1 pdf cost?

It's actually done on a page by page basis, so 1 page = 1 credit. I thought this didn't seem particularly fair to be honest, but then realised someone could concatenate a massive PDF with hundreds of pages and potentially rinse their servers at the cost of 1 credit if it was done per file...

What dependencies do I need to install?

Using PIP3 or whatever you will need to install OS, Requests and pdftables_api.OS and requests might be isntalled by default depending on your Python version/installation or can be installed using PIP/Python Package Index. PDFtables_API is slightly different..

**I can't install pdftables_api using PIP/ Python Package Index (PyPI) repository... **

I had to download it and then install it from https://github.com/pdftables/python-pdftables-api/tree/master/pdftables_api.

The Script encounters an error; do you have any other troubleshooting advice?

Four tips I found when creating this:

  1. Check the file extension name. Current script is capitalised PDF. Try changing it to lowercase or upating the files.

  2. Check which version of Python and PIP you are running. Assumig you have old version of Python installed you should specify PIP3 when installing anything, and execute Python3 just to be sure.

  3. Next tip is to check trailing slashes on folder locations, make sure this is correct. It's easy to get confused between forward and backward slashes, and which folder path you are trying to reference (not file!).

  4. Lastly, check to see the number of credits left and make sure your PDF actually contains tables/actually exists in the directory. Testing a few different PDF examples is a good way of isolating whether the problem is with your code or file.

Is there a script for just converting one file? Yeah there is an example script from the official site (I've uploaded it as single-pdf-example.py) but I'd just include one PDF within the folder to achieve the same thing. The blog post is here - https://pdftables.com/blog/pdf-to-excel-with-python

Are there any other (free) Python libraries or methods to extract tables from PDFs to Excel?

I was able to find a few examples which were open source/free, however I couldn't get them to work out of the box very easily: http://theautomatic.net/2019/05/24/3-ways-to-scrape-tables-from-pdfs-with-python/.

For me Tabula Py appeared to be the most promising however I had issues installing and running the Java layer required which Python uses: https://github.com/chezou/tabula-py

How else could this script be improved

I'm done with it for now so don't have the motivation but I think it could be improved by ensuring the PDF file type match rule does both lower and upper case for ease of use.

Also there could be more done in terms of credits usage e.g. before and after, how much running it will cost etc. and maybe a confirmation prompt. And of course specifying inputs and outputs.

Useful Links

pdf-tables-excel's People

Contributors

nicksamuel avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.