camelot-dev / excalibur Goto Github PK

View Code? Open in Web Editor NEW

1.5K 38.0 224.0 18.22 MB

A web interface to extract tabular data from PDFs

Home Page: https://excalibur-py.readthedocs.io

License: MIT License

Python 28.62% CSS 12.35% JavaScript 14.78% HTML 32.56% Makefile 0.77% SCSS 10.92%

pdf table extract for-humans

excalibur's People

Contributors

Stargazers

Watchers

Forkers

sainideepak gridl lazydancer jerrywyj fendaq deeplearning2012 yyht ivonnechen wwwfeng benjamesbabala giserh zzzz123321 person11111 mxb1210 b-xiang greatfollow changss saurabharch stegallo mengzhaoji mdedonno1337 kolanich-tools vavilon e3d madokazhou lgf124 majianqi vikash0837 timyb linlut macicco neosun100 bhanditz vm-asd2015 joeflorence cyt1984 haginile aioros ieee820 liangjingjing1124 nachoag76 jeffli678 pixelpassion thegreatgaspy desaetiis mussard kabirutd kecleveland lsmidt ilginsila dwohlfahrt lamaun ahmed-elnaggar moritala hamedmp heixincai rjsheperd lfbeyond dggihosur freebluesky99 pythonthings websiteinspiration priyatamnayak raemond gabby-t hongshunyang gitgirish2 tavernier van-8 y20-j claudiucornea n0rbrt cloudtrainerwork ducphan-dp gregordusan flaketill nidhoggurz nareshkumarjaggiexiver utsav-fusemachines azraelrabbit ankur111944 ssitb onionmk2 forestjohnson twright8 xdeepanshu p-r-t sabas shubham-root debarati-s ikueisou ksindi barrett2689 dhirajdd pedrogbastos xubiuit pa9io zoraluo rocke2020 jai2033shankar

excalibur's Issues

close_fds is not supported on Windows platforms if you redirect stdin/stdout/stderr

'ascii' codec can't encode character

I tried to convert some tables in Polish govt doc, containing polish accented characters.
This failed with error: UnicodeEncodeError: 'ascii' codec can't encode character u'\u015a' in position 346: ordinal not in range(128)
The character in question is ś, but there are way more such characters in that file.

If I need something reconfigured to be able to parse such chars, I believe it shouldn't raise an error, but rather suggest change :)

I can share the file if needed :)

UnicodeEncodeError: 'charmap' codec can't encode character '\u2010' in position 9239: character maps to <undefined>

OS: Win 10
I keep on getting this error when I'm attempting to view data:
https://i.imgur.com/waFbLPJ.png

Add link to installation of external deps in the contributor's guide

Support to parallel process multiple PDFs from CLI using a rule.json

Possible command: $ excalibur distribute --rule rule.json directory/

version `GLIBC_2.25' not found

While installing from linux executable on ubuntu 14.04 above error message is reported. I have also tried to update the library but understand that an upgrade to ubuntu 16 is required. Will be great if we can get a fix for ubuntu 14.04.

ERROR:root:file has not been decrypted

While trying to upload a pdf file following error is through.

ERROR:root:file has not been decrypted
Traceback (most recent call last):
File "c:\py3\lib\site-packages\excalibur\tasks.py", line 57, in split
save_page(file.filepath, file.page_number)
File "c:\py3\lib\site-packages\excalibur\utils\task.py", line 10, in save_page
page = infile.getPage(page_number - 1)
File "c:\py3\lib\site-packages\PyPDF2\pdf.py", line 1176, in getPage
self._flatten()
File "c:\py3\lib\site-packages\PyPDF2\pdf.py", line 1505, in _flatten
catalog = self.trailer["/Root"].getObject()
File "c:\py3\lib\site-packages\PyPDF2\generic.py", line 516, in getitem
return dict.getitem(self, key).getObject()
File "c:\py3\lib\site-packages\PyPDF2\generic.py", line 178, in getObject
return self.pdf.getObject(self).getObject()
File "c:\py3\lib\site-packages\PyPDF2\pdf.py", line 1617, in getObject
raise utils.PdfReadError("file has not been decrypted")
PyPDF2.utils.PdfReadError: file has not been decrypted

Refactor workspace.js into separate files

Is there a way to run excalibur from command line or batch mode?

I am trying to run it on Mac. I am ok to also set it up on my linux box if that works better.

Please wait while the pages are converted to images. Refresh again in some time

AttributeError Nonetype for 'job_id'

Here's the print out of the problem. I'm getting a 500 internal server error. I'm in python 3.7. Camelot works fine for me (I can parse, read, export no problems). Just have a problem with running Excalibur.

Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
127.0.0.1 - - [06/Nov/2018 21:30:22] "GET / HTTP/1.1" 302 -
[2018-11-06 21:30:22,663] ERROR in app: Exception on /files [GET]
Traceback (most recent call last):
File "d:\python3.7\lib\site-packages\flask\app.py", line 2292, in wsgi_app
response = self.full_dispatch_request()
File "d:\python3.7\lib\site-packages\flask\app.py", line 1815, in full_dispatch_request
rv = self.handle_user_exception(e)
File "d:\python3.7\lib\site-packages\flask\app.py", line 1718, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "d:\python3.7\lib\site-packages\flask_compat.py", line 35, in reraise
raise value
File "d:\python3.7\lib\site-packages\flask\app.py", line 1813, in full_dispatch_request
rv = self.dispatch_request()
File "d:\python3.7\lib\site-packages\flask\app.py", line 1799, in dispatch_request
return self.view_functionsrule.endpoint
File "d:\python3.7\lib\site-packages\excalibur\www\views.py", line 39, in files
'job_id': job.job_id,
AttributeError: 'NoneType' object has no attribute 'job_id'

Here's the code that contains the 'job_id' line 39 from the www\views.py file

@views.route('/files', methods=['GET', 'POST'])
def files():
if request.method == 'GET':
files_response = []
session = Session()
for file in session.query(File).order_by(File.uploaded_at.desc()).all():
job = session.query(Job).filter(Job.file_id == file.file_id).order_by(Job.started_at.desc()).first()
files_response.append({
'file_id': file.file_id,
'job_id': job.job_id,
'uploaded_at': file.uploaded_at.strftime('%Y-%m-%dT%H:%M:%S'),
'filename': file.filename
})

Any thoughts or am I making some stupid mistakes here?

Failure when "all" or 1-end is selected

Excalibur struggles on large pdfs (20pgs or more) when I indicate the "all" or "1-end" options.
I get the following warning:
UserWarning: No tables found on page-144 [lattice.py:399]
UserWarning: No tables found on page-144 [stream.py:447]
UserWarning: No tables found in table area 1 [stream.py:361]
UserWarning: No tables found in table area 1 [stream.py:361]
UserWarning: No tables found in table area 2 [stream.py:361]

However if I manually select the pages it works fine. Is there a way to solve this?

Please make sure that Ghostscript is installed

all installed , but error still !!!!!!!!!!!!!!!!!!!

Add tk along with ghostscript in the "using pip" section of installation

Add user defined excalibur.cfg and PDFS_FOLDER

Update PyPI description

To "Excalibur: A web interface to extract tabular data from PDFs".

can't change web port

I want to change the web port from 5000 to another port,so i modify the web port in excalibur.cfg, and then , I reset db, but when start web server,the web port is 5000

ERROR:root:'charmap' codec can't encode character '\ued6f' in position 350: char acter maps to <undefined>

I am unable to share the pdf that is causing this issue. I would like to know what I can do to bypass this error.
Even if it means dropping the "offending char". Getting some of the data is better than getting none of the data.
I'd be ecstatic if this is a PEBKAC issue, so please don't discount that.

Using the latest download of excaliber and Python 3.7.3 (I think). Only using the webui to do this. Don't think I could handle coding it, without some hand holding.

This is happening on several pages of a very large pdf (700+ pages). But not all of them. So the file can be parsed. Just not the important portion, which is most of the file.

I DID just realize that it is creating some of the output files (excel, csv, and json), but not html. Since I on'y really need the csv or excel, I might be good. Will keep pushing on the remainder of the file (its slow to handle 100 pages at a time)

127.0.0.1 - - [05/Aug/2019 16:55:09] "GET /jobs/fbfeb974-5f3d-4991-b26c-98356064
0de5 HTTP/1.1" 200 -
ERROR:root:'charmap' codec can't encode character '\ued6f' in position 350: char
acter maps to <undefined>
Traceback (most recent call last):
  File "excalibur\executors\sequential_executor.py", line 12, in execute_command

  File "subprocess.py", line 336, in check_call
  File "subprocess.py", line 317, in call
  File "subprocess.py", line 769, in __init__
  File "subprocess.py", line 1172, in _execute_child
FileNotFoundError: [WinError 2] The system cannot find the file specified

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "excalibur\tasks.py", line 161, in extract
  File "lib\site-packages\camelot\core.py", line 479, in export
  File "lib\site-packages\camelot\core.py", line 437, in _write_file
  File "lib\site-packages\camelot\core.py", line 394, in to_html
  File "C:\Python37\lib\encodings\cp1252.py", line 19, in encode
UnicodeEncodeError: 'charmap' codec can't encode character '\ued6f' in position
350: character maps to <undefined>

Support listening on all interfaces

As far as I can tell, Excalibur listens on 127.0.0.1 and that can't be changed.

I am attempting to create a Docker container for Excalibur and I would like to have the web server listen on all interfaces to make it work.

I am not completely familiar with the code base, but I accomplished this myself by adding a "web_server_host" value to the [webserver] section of the config file, and changing the call to app.run() in cli.py:

app.run(host=conf.get('webserver', 'web_server_host'), use_reloader=False)

Is this a change you would be interested in adding to the project?

Please make sure that Ghostscript is installed

I am on mac os and have installed Ghostscript by brew. The error is below

File "/Users/.../anaconda3/envs/py37/lib/python3.7/site-packages/camelot/ext/ghostscript/init.py", line 24, in
from . import _gsprint as gs
File "/Users/.../anaconda3/envs/py37/lib/python3.7/site-packages/camelot/ext/ghostscript/_gsprint.py", line 258, in
raise RuntimeError("Please make sure that Ghostscript is installed")
RuntimeError: Please make sure that Ghostscript is installed

I guess the problem might be that my brew and pip have different paths, which cause that excalibur is installed in '/Users/.../anaconda3/envs/py37/bin/excalibur' while gs is installed in '/usr/local/bin/gs'.

Extract tables from webpage

Tables can be extracted from a webpage using pandas.read_html. We can create an interface 1) simple: where user can submit the link of the webpage and download extracted tables or 2) fancy: where user can submit the link of the webpage, see detected tables (on an image of the webpage?), un-select the tables they don't want and then download extracted tables.

Hosting excalibur on the web

Have you thought about hosting Excalibur for ease of use by non-technical people?

Was expecting tryexcalibur.com to be a hosted version but found a landing page instead.

Let me know if I can help.

Convert "grant all on" in the mysql setup section to uppercase

Unable to process PDFs, throws GhostscriptError: -100 on refresh page (mac OS)

I've been trying to work in a private repository first. I forked the repository by first duplicating it and setting it to private via instructions from top answer here (https://stackoverflow.com/questions/10065526/github-how-to-make-a-fork-of-public-repository-private). Whenever I try to process PDFs, I get a GhostScript error on a mac OSX.

I am using:
Python 3.7
GPL Ghostscript 9.27 [via homebrew]
opencv: stable 4.1.0 (bottled) [via homebrew]
numpy 1.14.5
excalibur 0.4.2

I'm simply trying to run it locally via flask run and have already set up my own mySQL database as per excalibur documentation. It throws the error after I upload the PDF and go to the next page (makes it to the refresh page then throws the error there). I've tried many different PDFs with the same result including empty pages.

127.0.0.1 - - [09/Jul/2019 09:14:01] "GET /static/css/vendor/jquery-ui.structure.min.css HTTP/1.1" 200 -
127.0.0.1 - - [09/Jul/2019 09:14:01] "GET /static/css/workspace.css HTTP/1.1" 200 -
127.0.0.1 - - [09/Jul/2019 09:14:01] "GET /static/js/vendor/jquery.selectareas.min.js HTTP/1.1" 200 -
127.0.0.1 - - [09/Jul/2019 09:14:01] "GET /static/js/workspace.js HTTP/1.1" 200 -
127.0.0.1 - - [09/Jul/2019 09:14:01] "GET /static/js/vendor/jquery-ui.min.js HTTP/1.1" 200 -
*GPL Ghostscript 9.27: Unrecoverable error, exit code 1
ERROR:root:-100
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/excalibur/tasks.py", line 44, in split
with Ghostscript(gs_call, stdout=null) as gs:
File "/usr/local/lib/python3.7/site-packages/camelot/ext/ghostscript/init.py", line 93, in Ghostscript
stderr=kwargs.get('stderr', None))
File "/usr/local/lib/python3.7/site-packages/camelot/ext/ghostscript/init.py", line 39, in init
rc = gs.init_with_args(instance, args)
File "/usr/local/lib/python3.7/site-packages/camelot/ext/ghostscript/_gsprint.py", line 169, in init_with_args
raise GhostscriptError(rc)
camelot.ext.ghostscript._gsprint.GhostscriptError: -100

Any help solving this would be appreciated.

Row separator like column separator

The column separator is an amazing feature!

Taking a logical step forward, how about horizontal row separators too? To help avoid consecutive table entries mixing in with each other.

With both column and row separators at hand, this will help the user parse their table perfectly.

Ref: A similar request I'd posted on Tabula that had garnered a lot of attention: https://github.com/tabulapdf/tabula/issues/409

Also, I understand Excalibur is a frontend to Camelot. So the column separator feature might be there in Camelot and you've brought it to user interface here. Perhaps there's no row separator feature in Camelot itself yet? Can someone clarify? Then I'll go post this request there.

Feature request: Option to Link PDF URL, refreshes each time download page is accessed

I'd like to request an option to link PDFs (since PDF data often updates, it's much easier to keep data updated by linking than by manually uploading each time).

When a PDF is linked, and the rules have been set for that particular PDF, whenever the download page is accessed(example download page link: http://127.0.0.1:5000/jobs/3c90fc1b-a9d8-4d51-a83a-218d18d4893f), it automatically downloads from URL, then re-processes it with pre-defined rule, and displays the tables of the extracted data. This should work by just accessing the download page.

I can perhaps hire someone to get this done if you're willing to add it to the main project.

So steps are:

Link to PDF (example: https://www.lcfcu.org/home/fiFiles/static/documents/rates.pdf)
Set Rules for the PDF and save
Access the download page for that pdf (example: http://127.0.0.1:5000/jobs/3c90fc1b-a9d8-4d51-a83a-218d18d4893f)
Excalibur automatically fetches the PDF from link
Extracts data from PDF based on predefined rule
Displays like so:

So in the future, whenever I detect the pdf has changed, I can access the download page link and it'll repeat the entire process again(steps 3-6).

Again, please let me know if you're open to have this change contribute to main source, if so, I can get it coded. I feel this change is extremely important since many pdf on web change frequently thus making this feature very useful.

Download is not working

S05MoldedCaseCircuitBreakers.pdf

Hi,

The file will not download.

The upload & Table extraction seem to be working. When I press the download button, it goes to the refresh screen. The file never downloads.

Attached is the PDF I was testing with.

I hope this helps.

Cheers,
Mitch Hunt

windows install threw an unhandled error

running as administrator on Windows Version 10.0.17763 Build 17763 after installing python-3.7.1-amd64

PS C:\excalibur> pip install excalibur-py
PS C:\excalibur> excalibur initdb
Error ModuleNotFoundError: No module named 'chardet'

full output below

PS C:\WINDOWS\system32> *cd C:\excalibur*
PS C:\excalibur> pip install excalibur-py
Collecting excalibur-py
Downloading https://files.pythonhosted.org/packages/9e/fe/6d60ad37075c89136e614cefa494aff4701e97e5121193c09a70d3c827b0/excalibur_py-0.4.0-py2.py3-none-any.whl (1.5MB)
100% |████████████████████████████████| 1.5MB 1.9MB/s
Collecting camelot-py[cv]>=0.2.3 (from excalibur-py)
Downloading https://files.pythonhosted.org/packages/65/e3/75842357e53f675d60b093c182d254c37db5b1d6144d12703af0a433f7f5/camelot-py-0.4.0.tar.gz
Collecting configparser<3.6.0,>=3.5.0 (from excalibur-py)
Downloading https://files.pythonhosted.org/packages/7c/69/c2ce7e91c89dc073eb1aa74c0621c3eefbffe8216b3f9af9d3885265c01c/configparser-3.5.0.tar.gz
Collecting Flask>=1.0.2 (from excalibur-py)
Downloading https://files.pythonhosted.org/packages/7f/e7/08578774ed4536d3242b14dacb4696386634607af824ea997202cd0edb4b/Flask-1.0.2-py2.py3-none-any.whl (91kB)
100% |████████████████████████████████| 92kB 3.8MB/s
Collecting SQLAlchemy>=1.2.12 (from excalibur-py)
Downloading https://files.pythonhosted.org/packages/e2/0a/05b7d13618ad41c108a6c2b886af83bf9bb7e35f8951227abb18b1330745/SQLAlchemy-1.2.14.tar.gz (5.7MB)
100% |████████████████████████████████| 5.7MB 1.7MB/s
Collecting Click>=7.0 (from excalibur-py)
Downloading https://files.pythonhosted.org/packages/fa/37/45185cb5abbc30d7257104c434fe0b07e5a195a6847506c074527aa599ec/Click-7.0-py2.py3-none-any.whl (81kB)
100% |████████████████████████████████| 81kB 4.1MB/s
Collecting celery>=4.1.1 (from excalibur-py)
Downloading https://files.pythonhosted.org/packages/e8/58/2a0b1067ab2c12131b5c089dfc579467c76402475c5231095e36a43b749c/celery-4.2.1-py2.py3-none-any.whl (401kB)
100% |████████████████████████████████| 409kB 1.1MB/s
Collecting numpy>=1.13.3 (from camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/00/0e/5a8c34adb97fc1cd6636d78050e575945e874c8516d501421d5a0f377a6c/numpy-1.15.4-cp37-none-win_amd64.whl (13.5MB)
100% |████████████████████████████████| 13.5MB 1.3MB/s
Collecting openpyxl>=2.5.8 (from camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/08/8a/509eb6f58672288da9a5884e1cc7e90819bc8dbef501161c4b40a6a4e46b/openpyxl-2.5.12.tar.gz (173kB)
100% |████████████████████████████████| 174kB 1.8MB/s
Collecting pandas>=0.23.4 (from camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/58/a8/03e5fe0edbc522e46cb27df2abfb4266814129253d8462f38bc704a76a2a/pandas-0.23.4-cp37-cp37m-win_amd64.whl (7.9MB)
100% |████████████████████████████████| 7.9MB 1.6MB/s
Collecting pdfminer.six>=20170720 (from camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/8a/fd/6e8746e6965d1a7ea8e97253e3d79e625da5547e8f376f88de5d024bacb9/pdfminer.six-20181108-py2.py3-none-any.whl (5.6MB)
100% |████████████████████████████████| 5.6MB 1.7MB/s
Collecting PyPDF2>=1.26.0 (from camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/b4/01/68fcc0d43daf4c6bdbc6b33cc3f77bda531c86b174cac56ef0ffdb96faab/PyPDF2-1.26.0.tar.gz (77kB)
100% |████████████████████████████████| 81kB 4.8MB/s
Collecting opencv-python>=3.4.2.17 (from camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/65/b0/1b098827a7a879546363e5c976418850e6d9bebf7662f32ddefd30ae9c2c/opencv_python-3.4.4.19-cp37-cp37m-win_amd64.whl (38.3MB)
100% |████████████████████████████████| 38.3MB 302kB/s
Collecting Jinja2>=2.10 (from Flask>=1.0.2->excalibur-py)
Downloading https://files.pythonhosted.org/packages/7f/ff/ae64bacdfc95f27a016a7bed8e8686763ba4d277a78ca76f32659220a731/Jinja2-2.10-py2.py3-none-any.whl (126kB)
100% |████████████████████████████████| 133kB 2.8MB/s
Collecting Werkzeug>=0.14 (from Flask>=1.0.2->excalibur-py)
Downloading https://files.pythonhosted.org/packages/20/c4/12e3e56473e52375aa29c4764e70d1b8f3efa6682bef8d0aae04fe335243/Werkzeug-0.14.1-py2.py3-none-any.whl (322kB)
100% |████████████████████████████████| 327kB 1.8MB/s
Collecting itsdangerous>=0.24 (from Flask>=1.0.2->excalibur-py)
Downloading https://files.pythonhosted.org/packages/76/ae/44b03b253d6fade317f32c24d100b3b35c2239807046a4c953c7b89fa49e/itsdangerous-1.1.0-py2.py3-none-any.whl
Collecting billiard<3.6.0,>=3.5.0.2 (from celery>=4.1.1->excalibur-py)
Downloading https://files.pythonhosted.org/packages/87/ac/9b3cc065557ad5769d0626fd5dba0ad1cb40e3a72fe6acd3d081b4ad864e/billiard-3.5.0.4.tar.gz (150kB)
100% |████████████████████████████████| 153kB 2.2MB/s
Collecting kombu<5.0,>=4.2.0 (from celery>=4.1.1->excalibur-py)
Downloading https://files.pythonhosted.org/packages/97/61/65838c7da048e56d549e358ac19c0979c892e17dc6186610c49531d35b70/kombu-4.2.1-py2.py3-none-any.whl (177kB)
100% |████████████████████████████████| 184kB 2.2MB/s
Collecting pytz>dev (from celery>=4.1.1->excalibur-py)
Downloading https://files.pythonhosted.org/packages/f8/0e/2365ddc010afb3d79147f1dd544e5ee24bf4ece58ab99b16fbb465ce6dc0/pytz-2018.7-py2.py3-none-any.whl (506kB)
100% |████████████████████████████████| 512kB 1.4MB/s
Collecting jdcal (from openpyxl>=2.5.8->camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/a0/38/dcf83532480f25284f3ef13f8ed63e03c58a65c9d3ba2a6a894ed9497207/jdcal-1.4-py2.py3-none-any.whl
Collecting et_xmlfile (from openpyxl>=2.5.8->camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/22/28/a99c42aea746e18382ad9fb36f64c1c1f04216f41797f2f0fa567da11388/et_xmlfile-1.0.1.tar.gz
Collecting python-dateutil>=2.5.0 (from pandas>=0.23.4->camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/74/68/d87d9b36af36f44254a8d512cbfc48369103a3b9e474be9bdfe536abfc45/python_dateutil-2.7.5-py2.py3-none-any.whl (225kB)
100% |████████████████████████████████| 235kB 2.2MB/s
Collecting sortedcontainers (from pdfminer.six>=20170720->camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/13/f3/cf85f7c3a2dbd1a515d51e1f1676d971abe41bba6f4ab5443240d9a78e5b/sortedcontainers-2.1.0-py2.py3-none-any.whl
Collecting pycryptodome (from pdfminer.six>=20170720->camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/fc/99/ed80fd36eebe26914bd8aae4ac70fcef2d4ad94453981c171fe791629146/pycryptodome-3.7.2-cp37-cp37m-win_amd64.whl (8.0MB)
100% |████████████████████████████████| 8.0MB 1.5MB/s
Collecting six (from pdfminer.six>=20170720->camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/67/4b/141a581104b1f6397bfa78ac9d43d8ad29a7ca43ea90a2d863fe3056e86a/six-1.11.0-py2.py3-none-any.whl
Collecting MarkupSafe>=0.23 (from Jinja2>=2.10->Flask>=1.0.2->excalibur-py)
Downloading https://files.pythonhosted.org/packages/44/6e/41ac9266e3db762dfd9089f6b0d2298c84160f54ef2a7257c17b0e7ec2ec/MarkupSafe-1.1.0-cp37-cp37m-win_amd64.whl
Collecting amqp<3.0,>=2.1.4 (from kombu<5.0,>=4.2.0->celery>=4.1.1->excalibur-py)
Downloading https://files.pythonhosted.org/packages/7f/cf/12d4611fc67babd4ae250c9e8249c5650ae1933395488e9e7e3562b4ff24/amqp-2.3.2-py2.py3-none-any.whl (48kB)
100% |████████████████████████████████| 51kB 2.8MB/s
Collecting vine>=1.1.3 (from amqp<3.0,>=2.1.4->kombu<5.0,>=4.2.0->celery>=4.1.1->excalibur-py)
Downloading https://files.pythonhosted.org/packages/10/50/5b1ebe42843c19f35edb15022ecae339fbec6db5b241a7a13c924dabf2a3/vine-1.1.4-py2.py3-none-any.whl
Installing collected packages: Click, numpy, jdcal, et-xmlfile, openpyxl, six, python-dateutil, pytz, pandas, sortedcontainers, pycryptodome, pdfminer.six, PyPDF2, opencv-python, camelot-py, configparser, MarkupSafe, Jinja2, Werkzeug, itsdangerous, Flask, SQLAlchemy, billiard, vine, amqp, kombu, celery, excalibur-py
Running setup.py install for et-xmlfile ... done
Running setup.py install for openpyxl ... done
Running setup.py install for PyPDF2 ... done
Running setup.py install for camelot-py ... done
Running setup.py install for configparser ... done
The script flask.exe is installed in 'c:\users\user1\appdata\local\programs\python\python37\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Running setup.py install for SQLAlchemy ... done
Running setup.py install for billiard ... done
The script celery.exe is installed in 'c:\users\user1\appdata\local\programs\python\python37\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
The script excalibur.exe is installed in 'c:\users\user1\appdata\local\programs\python\python37\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed Click-7.0 Flask-1.0.2 Jinja2-2.10 MarkupSafe-1.1.0 PyPDF2-1.26.0 SQLAlchemy-1.2.14 Werkzeug-0.14.1 amqp-2.3.2 billiard-3.5.0.4 camelot-py-0.4.0 celery-4.2.1 configparser-3.5.0 et-xmlfile-1.0.1 excalibur-py-0.4.0 itsdangerous-1.1.0 jdcal-1.4 kombu-4.2.1 numpy-1.15.4 opencv-python-3.4.4.19 openpyxl-2.5.12 pandas-0.23.4 pdfminer.six-20181108 pycryptodome-3.7.2 python-dateutil-2.7.5 pytz-2018.7 six-1.11.0 sortedcontainers-2.1.0 vine-1.1.4
You are using pip version 10.0.1, however version 18.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
PS C:\excalibur> excalibur initdb
Creating new Excalibur configuration file in: C:\Users\user1/excalibur/excalibur.cfg
Traceback (most recent call last):
File "c:\users\user1\appdata\local\programs\python\python37\lib\runpy.py", line 193, in run_module_as_main
"main", mod_spec)
File "c:\users\user1\appdata\local\programs\python\python37\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\Users\user1\AppData\Local\Programs\Python\Python37\Scripts\excalibur.exe_main.py", line 5, in
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\excalibur\cli.py", line 10, in
from .tasks import split, extract
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\excalibur\tasks.py", line 10, in
import camelot
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\camelot_init.py", line 8, in
from .io import read_pdf
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\camelot\io.py", line 4, in
from .handlers import PDFHandler
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\camelot\handlers.py", line 9, in
from .parsers import Stream, Lattice
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\camelot\parsers_init_.py", line 3, in
from .stream import Stream
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\camelot\parsers\stream.py", line 11, in
from .base import BaseParser
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\camelot\parsers\base.py", line 5, in
from ..utils import get_page_layout, get_text_objects
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\camelot\utils.py", line 10, in
from pdfminer.pdfparser import PDFParser
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\pdfminer\pdfparser.py", line 4, in
from .psparser import PSStackParser
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\pdfminer\psparser.py", line 11, in
from .utils import choplist
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\pdfminer\utils.py", line 13, in
import chardet # For str encoding detection in Py3
ModuleNotFoundError: No module named 'chardet'

RuntimeError: Please make sure that Ghostscript is installed (Tried all closed issues but no use)

Python parameters to get the result found using the UI

Is it possible to get the input parameters obtained by using the UI to later use it on a camelot python script to get the same result?

Update FAQ?

With questions that come up for first time users.

Error on Windows: OSError: exception: access violation writing 0x0967BC48 while running python-Excalibur code

camelot Excalibur thow an oserror:access violation writing 0x0967BC48
os - Windows 10
python version - 3.7

below is the output screen

Debug mode: off
Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
127.0.0.1 - - [02/Jul/2019 19:03:54] "GET /files HTTP/1.1" 200 -
127.0.0.1 - - [02/Jul/2019 19:04:10] "POST /files HTTP/1.1" 200 -
127.0.0.1 - - [02/Jul/2019 19:04:10] "GET /workspaces/59f1c984-31fa-4ade-b944-770072f82827 HTTP/1.1" 200 -
127.0.0.1 - - [02/Jul/2019 19:04:19] "GET /workspaces/59f1c984-31fa-4ade-b944-770072f82827 HTTP/1.1" 200 -
127.0.0.1 - - [02/Jul/2019 19:04:20] "GET /static/favicon.ico HTTP/1.1" 200 -
ERROR:root:exception: access violation writing 0x0967BC48
Traceback (most recent call last):
File "c:\users\comp7\appdata\local\programs\python\python37-32\lib\site-packages\excalibur\tasks.py", line 44, in split
with Ghostscript(*gs_call, stdout=null) as gs:
File "c:\users\comp7\appdata\local\programs\python\python37-32\lib\site-packages\camelot\ext\ghostscript_init_.py", line 93, in Ghostscript
stderr=kwargs.get('stderr', None))
File "c:\users\comp7\appdata\local\programs\python\python37-32\lib\site-packages\camelot\ext\ghostscript_init_.py", line 39, in init
rc = gs.init_with_args(instance, args)
File "c:\users\comp7\appdata\local\programs\python\python37-32\lib\site-packages\camelot\ext\ghostscript_gsprint.py", line 167, in init_with_args
rc = libgs.gsapi_init_with_args(instance, len(argv), c_argv)
OSError: exception: access violation writing 0x0967BC48

Staying in refresh page and not displaying document

127.0.0.1 - - [08/May/2019 17:29:26] "GET /workspaces/35dd602d-2eec-44e7-a51f-36f6b4e4862b HTTP/1.1" 200 -
ERROR:root:exception: access violation writing 0x0E80C1B0
Traceback (most recent call last):
File "c:\users\naveen\appdata\local\programs\python\python37-32\lib\site-packages\excalibur\tasks.py", line 44, in split
with Ghostscript(*gs_call, stdout=null) as gs:
File "c:\users\naveen\appdata\local\programs\python\python37-32\lib\site-packages\camelot\ext\ghostscript_init_.py", line 93, in Ghostscript
stderr=kwargs.get('stderr', None))
File "c:\users\naveen\appdata\local\programs\python\python37-32\lib\site-packages\camelot\ext\ghostscript_init_.py", line 39, in init
rc = gs.init_with_args(instance, args)
File "c:\users\naveen\appdata\local\programs\python\python37-32\lib\site-packages\camelot\ext\ghostscript_gsprint.py", line 167, in init_with_args
rc = libgs.gsapi_init_with_args(instance, len(argv), c_argv)
OSError: exception: access violation writing 0x0E80C1B0

Command not found error on ubuntu

unable to use camelot and excalibur commands from commandline in ubuntu

Processing PDF - error message

I am on ubuntu 14.04 and installed excalibur-py using pip. while processing the following pdf (this is also used in camilot-py) and works well... the system returns the following message -

ERROR:root:'Table' object has no attribute '_bbox'
Traceback (most recent call last):
File "/home/sandeep/anaconda3/lib/python3.6/site-packages/excalibur/tasks.py", line 96, in split
x1, y1, x2, y2 = tables[0]._bbox
AttributeError: 'Table' object has no attribute '_bbox'
Refresh does not change anything... if i click on excalibur then i get this msg back
"The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application."
background_lines.pdf

pdf file used is attached

GhostscriptNotFound: Please make sure that Ghostscript is installed and available on the PATH environment variable

OS: Windows 10

After downloading excalibur, I had to download/install ghostscript (this should be stated in instructions).

After installing ghostscript, the PATH needs to be set. After setting PATH, the error still exists:

PATH set: C:\Program Files\gs\gs9.26\bin\gswin64c.exe (I've restarted)

Table area selection shows up as white instead of transparent on Windows

Add celery support

Add option to specify password during upload

#22

GhostScript error only for lattice when run through command line

This is my code:

**import camelot
import pandas as pd
import string

class CommandLine:
def init(self):
tables=camelot.read_pdf(filepath="C:\Users\Ikhanna\OneDrive - STEPSTONE GROUP LP\Documents\PortCo Ari\Adam\Q1.2018.CAB.FS.Rpt.Orion Euro RE IV.pdf", pages="1")
if name == 'main':
app = CommandLine()**

This runs fine when I run in Jupyter notebook but when I convert this file to a .ipynb or .py and run through cmd it throws the error RuntimeError: Please make sure that Ghostscript is installed. It doesn't give this error if I use the stream flavour. I am not sure why this is happening

Create Windows executable

Excalibur's data directory is created in HOME

I consider it bad form for Excalibur to create a user-visible folder in the home folder (/Users/akx/excalibur on my Mac, for instance).

It'd be better to use e.g. appdirs to figure out the "user data" directory, and create the Excalibur directory there.

ImportError: cannot import name 'TableList'

C:\Program Files\gs\gs9.26\bin>excalibur initdb
Traceback (most recent call last):
File "c:\programdata\anaconda3\lib\runpy.py", line 193, in _run_module_as_main

"__main__", mod_spec)

File "c:\programdata\anaconda3\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\ProgramData\Anaconda3\Scripts\excalibur.exe_main.py", line 5, in

File "c:\programdata\anaconda3\lib\site-packages\excalibur\cli.py", line 10, i
n
from .tasks import split, extract
File "c:\programdata\anaconda3\lib\site-packages\excalibur\tasks.py", line 11,
in
from camelot.core import TableList
ImportError: cannot import name 'TableList'

Show errors on webpage

A lot of users are stuck on the Refresh page without any clue about what's happening. We need to show any errors that happen on the webpage itself.

page : Refresh !!!

Refresh!
Please wait while the pages are converted to images. Refresh again in some time.

data extracted in 2nd,3rd... page is not aligned properly as in 1st page.

When there are 3 pages in the PDF table on the first page is extracted and aligned properly but the in 2nd-page data is not aligned properly. Can you please let us know what needs to be modified.

import Camelot
tables = camelot.read_pdf("D:\drive-download-20180907T081608Z-001\047238.pdf", flavor ='stream', pages='1-end')
tables
tables[0].df
tables.export("C:\Users\fooo.htm", f='html', compress=True)

Clicking on Clear doesn't remove selections

can't change web port

I want to change the web port from 5000 to another port,so i modify the web port in excalibur.cfg, and then , I reset db, but when start web server,the web port is 5000

Excalibur python error

When I downloaded the extracted file , I got result like below mentioned . How to get the actual value ?
(cid:40)(cid:90)(cid:3)(cid:86)(cid:85)(cid:3)(cid:91)(cid:79)

undefined character maps

Im getting a similar charmap error.

File "c:\users\sebastien.cote\appdata\local\continuum\anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2776' in position 246: character maps to

Is there anyway the program can proceed by skipping the problematic character not being able to encode?