camelot-dev / excalibur Goto Github PK
View Code? Open in Web Editor NEWA web interface to extract tabular data from PDFs
Home Page: https://excalibur-py.readthedocs.io
License: MIT License
A web interface to extract tabular data from PDFs
Home Page: https://excalibur-py.readthedocs.io
License: MIT License
close_fds is not supported on Windows platforms if you redirect stdin/stdout/stderr
I tried to convert some tables in Polish govt doc, containing polish accented characters.
This failed with error: UnicodeEncodeError: 'ascii' codec can't encode character u'\u015a' in position 346: ordinal not in range(128)
The character in question is ś
, but there are way more such characters in that file.
If I need something reconfigured to be able to parse such chars, I believe it shouldn't raise an error, but rather suggest change :)
I can share the file if needed :)
OS: Win 10
I keep on getting this error when I'm attempting to view data:
https://i.imgur.com/waFbLPJ.png
Possible command: $ excalibur distribute --rule rule.json directory/
While installing from linux executable on ubuntu 14.04 above error message is reported. I have also tried to update the library but understand that an upgrade to ubuntu 16 is required. Will be great if we can get a fix for ubuntu 14.04.
While trying to upload a pdf file following error is through.
ERROR:root:file has not been decrypted
Traceback (most recent call last):
File "c:\py3\lib\site-packages\excalibur\tasks.py", line 57, in split
save_page(file.filepath, file.page_number)
File "c:\py3\lib\site-packages\excalibur\utils\task.py", line 10, in save_page
page = infile.getPage(page_number - 1)
File "c:\py3\lib\site-packages\PyPDF2\pdf.py", line 1176, in getPage
self._flatten()
File "c:\py3\lib\site-packages\PyPDF2\pdf.py", line 1505, in _flatten
catalog = self.trailer["/Root"].getObject()
File "c:\py3\lib\site-packages\PyPDF2\generic.py", line 516, in getitem
return dict.getitem(self, key).getObject()
File "c:\py3\lib\site-packages\PyPDF2\generic.py", line 178, in getObject
return self.pdf.getObject(self).getObject()
File "c:\py3\lib\site-packages\PyPDF2\pdf.py", line 1617, in getObject
raise utils.PdfReadError("file has not been decrypted")
PyPDF2.utils.PdfReadError: file has not been decrypted
I am trying to run it on Mac. I am ok to also set it up on my linux box if that works better.
Here's the print out of the problem. I'm getting a 500 internal server error. I'm in python 3.7. Camelot works fine for me (I can parse, read, export no problems). Just have a problem with running Excalibur.
Here's the code that contains the 'job_id' line 39 from the www\views.py file
@views.route('/files', methods=['GET', 'POST'])
def files():
if request.method == 'GET':
files_response = []
session = Session()
for file in session.query(File).order_by(File.uploaded_at.desc()).all():
job = session.query(Job).filter(Job.file_id == file.file_id).order_by(Job.started_at.desc()).first()
files_response.append({
'file_id': file.file_id,
'job_id': job.job_id,
'uploaded_at': file.uploaded_at.strftime('%Y-%m-%dT%H:%M:%S'),
'filename': file.filename
})
Any thoughts or am I making some stupid mistakes here?
Excalibur struggles on large pdfs (20pgs or more) when I indicate the "all" or "1-end" options.
I get the following warning:
UserWarning: No tables found on page-144 [lattice.py:399]
UserWarning: No tables found on page-144 [stream.py:447]
UserWarning: No tables found in table area 1 [stream.py:361]
UserWarning: No tables found in table area 1 [stream.py:361]
UserWarning: No tables found in table area 2 [stream.py:361]
However if I manually select the pages it works fine. Is there a way to solve this?
Please make sure that Ghostscript is installed
all installed , but error still !!!!!!!!!!!!!!!!!!!
To "Excalibur: A web interface to extract tabular data from PDFs".
I want to change the web port from 5000 to another port,so i modify the web port in excalibur.cfg, and then , I reset db, but when start web server,the web port is 5000
I am unable to share the pdf that is causing this issue. I would like to know what I can do to bypass this error.
Even if it means dropping the "offending char". Getting some of the data is better than getting none of the data.
I'd be ecstatic if this is a PEBKAC issue, so please don't discount that.
Using the latest download of excaliber and Python 3.7.3 (I think). Only using the webui to do this. Don't think I could handle coding it, without some hand holding.
This is happening on several pages of a very large pdf (700+ pages). But not all of them. So the file can be parsed. Just not the important portion, which is most of the file.
I DID just realize that it is creating some of the output files (excel, csv, and json), but not html. Since I on'y really need the csv or excel, I might be good. Will keep pushing on the remainder of the file (its slow to handle 100 pages at a time)
127.0.0.1 - - [05/Aug/2019 16:55:09] "GET /jobs/fbfeb974-5f3d-4991-b26c-98356064
0de5 HTTP/1.1" 200 -
ERROR:root:'charmap' codec can't encode character '\ued6f' in position 350: char
acter maps to <undefined>
Traceback (most recent call last):
File "excalibur\executors\sequential_executor.py", line 12, in execute_command
File "subprocess.py", line 336, in check_call
File "subprocess.py", line 317, in call
File "subprocess.py", line 769, in __init__
File "subprocess.py", line 1172, in _execute_child
FileNotFoundError: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "excalibur\tasks.py", line 161, in extract
File "lib\site-packages\camelot\core.py", line 479, in export
File "lib\site-packages\camelot\core.py", line 437, in _write_file
File "lib\site-packages\camelot\core.py", line 394, in to_html
File "C:\Python37\lib\encodings\cp1252.py", line 19, in encode
UnicodeEncodeError: 'charmap' codec can't encode character '\ued6f' in position
350: character maps to <undefined>
As far as I can tell, Excalibur listens on 127.0.0.1 and that can't be changed.
I am attempting to create a Docker container for Excalibur and I would like to have the web server listen on all interfaces to make it work.
I am not completely familiar with the code base, but I accomplished this myself by adding a "web_server_host" value to the [webserver] section of the config file, and changing the call to app.run()
in cli.py
:
app.run(host=conf.get('webserver', 'web_server_host'), use_reloader=False)
Is this a change you would be interested in adding to the project?
I am on mac os and have installed Ghostscript by brew. The error is below
File "/Users/.../anaconda3/envs/py37/lib/python3.7/site-packages/camelot/ext/ghostscript/init.py", line 24, in
from . import _gsprint as gs
File "/Users/.../anaconda3/envs/py37/lib/python3.7/site-packages/camelot/ext/ghostscript/_gsprint.py", line 258, in
raise RuntimeError("Please make sure that Ghostscript is installed")
RuntimeError: Please make sure that Ghostscript is installed
I guess the problem might be that my brew and pip have different paths, which cause that excalibur is installed in '/Users/.../anaconda3/envs/py37/bin/excalibur' while gs is installed in '/usr/local/bin/gs'.
Tables can be extracted from a webpage using pandas.read_html
. We can create an interface 1) simple: where user can submit the link of the webpage and download extracted tables or 2) fancy: where user can submit the link of the webpage, see detected tables (on an image of the webpage?), un-select the tables they don't want and then download extracted tables.
Have you thought about hosting Excalibur for ease of use by non-technical people?
Was expecting tryexcalibur.com to be a hosted version but found a landing page instead.
Let me know if I can help.
I've been trying to work in a private repository first. I forked the repository by first duplicating it and setting it to private via instructions from top answer here (https://stackoverflow.com/questions/10065526/github-how-to-make-a-fork-of-public-repository-private). Whenever I try to process PDFs, I get a GhostScript error on a mac OSX.
I am using:
Python 3.7
GPL Ghostscript 9.27 [via homebrew]
opencv: stable 4.1.0 (bottled) [via homebrew]
numpy 1.14.5
excalibur 0.4.2
I'm simply trying to run it locally via flask run and have already set up my own mySQL database as per excalibur documentation. It throws the error after I upload the PDF and go to the next page (makes it to the refresh page then throws the error there). I've tried many different PDFs with the same result including empty pages.
127.0.0.1 - - [09/Jul/2019 09:14:01] "GET /static/css/vendor/jquery-ui.structure.min.css HTTP/1.1" 200 -
127.0.0.1 - - [09/Jul/2019 09:14:01] "GET /static/css/workspace.css HTTP/1.1" 200 -
127.0.0.1 - - [09/Jul/2019 09:14:01] "GET /static/js/vendor/jquery.selectareas.min.js HTTP/1.1" 200 -
127.0.0.1 - - [09/Jul/2019 09:14:01] "GET /static/js/workspace.js HTTP/1.1" 200 -
127.0.0.1 - - [09/Jul/2019 09:14:01] "GET /static/js/vendor/jquery-ui.min.js HTTP/1.1" 200 -
*GPL Ghostscript 9.27: Unrecoverable error, exit code 1
ERROR:root:-100
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/excalibur/tasks.py", line 44, in split
with Ghostscript(gs_call, stdout=null) as gs:
File "/usr/local/lib/python3.7/site-packages/camelot/ext/ghostscript/init.py", line 93, in Ghostscript
stderr=kwargs.get('stderr', None))
File "/usr/local/lib/python3.7/site-packages/camelot/ext/ghostscript/init.py", line 39, in init
rc = gs.init_with_args(instance, args)
File "/usr/local/lib/python3.7/site-packages/camelot/ext/ghostscript/_gsprint.py", line 169, in init_with_args
raise GhostscriptError(rc)
camelot.ext.ghostscript._gsprint.GhostscriptError: -100
Any help solving this would be appreciated.
The column separator is an amazing feature!
Taking a logical step forward, how about horizontal row separators too? To help avoid consecutive table entries mixing in with each other.
With both column and row separators at hand, this will help the user parse their table perfectly.
Ref: A similar request I'd posted on Tabula that had garnered a lot of attention: https://github.com/tabulapdf/tabula/issues/409
Also, I understand Excalibur is a frontend to Camelot. So the column separator feature might be there in Camelot and you've brought it to user interface here. Perhaps there's no row separator feature in Camelot itself yet? Can someone clarify? Then I'll go post this request there.
I'd like to request an option to link PDFs (since PDF data often updates, it's much easier to keep data updated by linking than by manually uploading each time).
When a PDF is linked, and the rules have been set for that particular PDF, whenever the download page is accessed(example download page link: http://127.0.0.1:5000/jobs/3c90fc1b-a9d8-4d51-a83a-218d18d4893f), it automatically downloads from URL, then re-processes it with pre-defined rule, and displays the tables of the extracted data. This should work by just accessing the download page.
I can perhaps hire someone to get this done if you're willing to add it to the main project.
So steps are:
So in the future, whenever I detect the pdf has changed, I can access the download page link and it'll repeat the entire process again(steps 3-6).
Again, please let me know if you're open to have this change contribute to main source, if so, I can get it coded. I feel this change is extremely important since many pdf on web change frequently thus making this feature very useful.
S05MoldedCaseCircuitBreakers.pdf
Hi,
The file will not download.
The upload & Table extraction seem to be working. When I press the download button, it goes to the refresh screen. The file never downloads.
Attached is the PDF I was testing with.
I hope this helps.
Cheers,
Mitch Hunt
running as administrator on Windows Version 10.0.17763 Build 17763 after installing python-3.7.1-amd64
PS C:\excalibur> pip install excalibur-py
PS C:\excalibur> excalibur initdb
Error ModuleNotFoundError: No module named 'chardet'
full output below
Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.
PS C:\WINDOWS\system32> *cd C:\excalibur*
PS C:\excalibur> pip install excalibur-py
Collecting excalibur-py
Downloading https://files.pythonhosted.org/packages/9e/fe/6d60ad37075c89136e614cefa494aff4701e97e5121193c09a70d3c827b0/excalibur_py-0.4.0-py2.py3-none-any.whl (1.5MB)
100% |████████████████████████████████| 1.5MB 1.9MB/s
Collecting camelot-py[cv]>=0.2.3 (from excalibur-py)
Downloading https://files.pythonhosted.org/packages/65/e3/75842357e53f675d60b093c182d254c37db5b1d6144d12703af0a433f7f5/camelot-py-0.4.0.tar.gz
Collecting configparser<3.6.0,>=3.5.0 (from excalibur-py)
Downloading https://files.pythonhosted.org/packages/7c/69/c2ce7e91c89dc073eb1aa74c0621c3eefbffe8216b3f9af9d3885265c01c/configparser-3.5.0.tar.gz
Collecting Flask>=1.0.2 (from excalibur-py)
Downloading https://files.pythonhosted.org/packages/7f/e7/08578774ed4536d3242b14dacb4696386634607af824ea997202cd0edb4b/Flask-1.0.2-py2.py3-none-any.whl (91kB)
100% |████████████████████████████████| 92kB 3.8MB/s
Collecting SQLAlchemy>=1.2.12 (from excalibur-py)
Downloading https://files.pythonhosted.org/packages/e2/0a/05b7d13618ad41c108a6c2b886af83bf9bb7e35f8951227abb18b1330745/SQLAlchemy-1.2.14.tar.gz (5.7MB)
100% |████████████████████████████████| 5.7MB 1.7MB/s
Collecting Click>=7.0 (from excalibur-py)
Downloading https://files.pythonhosted.org/packages/fa/37/45185cb5abbc30d7257104c434fe0b07e5a195a6847506c074527aa599ec/Click-7.0-py2.py3-none-any.whl (81kB)
100% |████████████████████████████████| 81kB 4.1MB/s
Collecting celery>=4.1.1 (from excalibur-py)
Downloading https://files.pythonhosted.org/packages/e8/58/2a0b1067ab2c12131b5c089dfc579467c76402475c5231095e36a43b749c/celery-4.2.1-py2.py3-none-any.whl (401kB)
100% |████████████████████████████████| 409kB 1.1MB/s
Collecting numpy>=1.13.3 (from camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/00/0e/5a8c34adb97fc1cd6636d78050e575945e874c8516d501421d5a0f377a6c/numpy-1.15.4-cp37-none-win_amd64.whl (13.5MB)
100% |████████████████████████████████| 13.5MB 1.3MB/s
Collecting openpyxl>=2.5.8 (from camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/08/8a/509eb6f58672288da9a5884e1cc7e90819bc8dbef501161c4b40a6a4e46b/openpyxl-2.5.12.tar.gz (173kB)
100% |████████████████████████████████| 174kB 1.8MB/s
Collecting pandas>=0.23.4 (from camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/58/a8/03e5fe0edbc522e46cb27df2abfb4266814129253d8462f38bc704a76a2a/pandas-0.23.4-cp37-cp37m-win_amd64.whl (7.9MB)
100% |████████████████████████████████| 7.9MB 1.6MB/s
Collecting pdfminer.six>=20170720 (from camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/8a/fd/6e8746e6965d1a7ea8e97253e3d79e625da5547e8f376f88de5d024bacb9/pdfminer.six-20181108-py2.py3-none-any.whl (5.6MB)
100% |████████████████████████████████| 5.6MB 1.7MB/s
Collecting PyPDF2>=1.26.0 (from camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/b4/01/68fcc0d43daf4c6bdbc6b33cc3f77bda531c86b174cac56ef0ffdb96faab/PyPDF2-1.26.0.tar.gz (77kB)
100% |████████████████████████████████| 81kB 4.8MB/s
Collecting opencv-python>=3.4.2.17 (from camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/65/b0/1b098827a7a879546363e5c976418850e6d9bebf7662f32ddefd30ae9c2c/opencv_python-3.4.4.19-cp37-cp37m-win_amd64.whl (38.3MB)
100% |████████████████████████████████| 38.3MB 302kB/s
Collecting Jinja2>=2.10 (from Flask>=1.0.2->excalibur-py)
Downloading https://files.pythonhosted.org/packages/7f/ff/ae64bacdfc95f27a016a7bed8e8686763ba4d277a78ca76f32659220a731/Jinja2-2.10-py2.py3-none-any.whl (126kB)
100% |████████████████████████████████| 133kB 2.8MB/s
Collecting Werkzeug>=0.14 (from Flask>=1.0.2->excalibur-py)
Downloading https://files.pythonhosted.org/packages/20/c4/12e3e56473e52375aa29c4764e70d1b8f3efa6682bef8d0aae04fe335243/Werkzeug-0.14.1-py2.py3-none-any.whl (322kB)
100% |████████████████████████████████| 327kB 1.8MB/s
Collecting itsdangerous>=0.24 (from Flask>=1.0.2->excalibur-py)
Downloading https://files.pythonhosted.org/packages/76/ae/44b03b253d6fade317f32c24d100b3b35c2239807046a4c953c7b89fa49e/itsdangerous-1.1.0-py2.py3-none-any.whl
Collecting billiard<3.6.0,>=3.5.0.2 (from celery>=4.1.1->excalibur-py)
Downloading https://files.pythonhosted.org/packages/87/ac/9b3cc065557ad5769d0626fd5dba0ad1cb40e3a72fe6acd3d081b4ad864e/billiard-3.5.0.4.tar.gz (150kB)
100% |████████████████████████████████| 153kB 2.2MB/s
Collecting kombu<5.0,>=4.2.0 (from celery>=4.1.1->excalibur-py)
Downloading https://files.pythonhosted.org/packages/97/61/65838c7da048e56d549e358ac19c0979c892e17dc6186610c49531d35b70/kombu-4.2.1-py2.py3-none-any.whl (177kB)
100% |████████████████████████████████| 184kB 2.2MB/s
Collecting pytz>dev (from celery>=4.1.1->excalibur-py)
Downloading https://files.pythonhosted.org/packages/f8/0e/2365ddc010afb3d79147f1dd544e5ee24bf4ece58ab99b16fbb465ce6dc0/pytz-2018.7-py2.py3-none-any.whl (506kB)
100% |████████████████████████████████| 512kB 1.4MB/s
Collecting jdcal (from openpyxl>=2.5.8->camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/a0/38/dcf83532480f25284f3ef13f8ed63e03c58a65c9d3ba2a6a894ed9497207/jdcal-1.4-py2.py3-none-any.whl
Collecting et_xmlfile (from openpyxl>=2.5.8->camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/22/28/a99c42aea746e18382ad9fb36f64c1c1f04216f41797f2f0fa567da11388/et_xmlfile-1.0.1.tar.gz
Collecting python-dateutil>=2.5.0 (from pandas>=0.23.4->camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/74/68/d87d9b36af36f44254a8d512cbfc48369103a3b9e474be9bdfe536abfc45/python_dateutil-2.7.5-py2.py3-none-any.whl (225kB)
100% |████████████████████████████████| 235kB 2.2MB/s
Collecting sortedcontainers (from pdfminer.six>=20170720->camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/13/f3/cf85f7c3a2dbd1a515d51e1f1676d971abe41bba6f4ab5443240d9a78e5b/sortedcontainers-2.1.0-py2.py3-none-any.whl
Collecting pycryptodome (from pdfminer.six>=20170720->camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/fc/99/ed80fd36eebe26914bd8aae4ac70fcef2d4ad94453981c171fe791629146/pycryptodome-3.7.2-cp37-cp37m-win_amd64.whl (8.0MB)
100% |████████████████████████████████| 8.0MB 1.5MB/s
Collecting six (from pdfminer.six>=20170720->camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/67/4b/141a581104b1f6397bfa78ac9d43d8ad29a7ca43ea90a2d863fe3056e86a/six-1.11.0-py2.py3-none-any.whl
Collecting MarkupSafe>=0.23 (from Jinja2>=2.10->Flask>=1.0.2->excalibur-py)
Downloading https://files.pythonhosted.org/packages/44/6e/41ac9266e3db762dfd9089f6b0d2298c84160f54ef2a7257c17b0e7ec2ec/MarkupSafe-1.1.0-cp37-cp37m-win_amd64.whl
Collecting amqp<3.0,>=2.1.4 (from kombu<5.0,>=4.2.0->celery>=4.1.1->excalibur-py)
Downloading https://files.pythonhosted.org/packages/7f/cf/12d4611fc67babd4ae250c9e8249c5650ae1933395488e9e7e3562b4ff24/amqp-2.3.2-py2.py3-none-any.whl (48kB)
100% |████████████████████████████████| 51kB 2.8MB/s
Collecting vine>=1.1.3 (from amqp<3.0,>=2.1.4->kombu<5.0,>=4.2.0->celery>=4.1.1->excalibur-py)
Downloading https://files.pythonhosted.org/packages/10/50/5b1ebe42843c19f35edb15022ecae339fbec6db5b241a7a13c924dabf2a3/vine-1.1.4-py2.py3-none-any.whl
Installing collected packages: Click, numpy, jdcal, et-xmlfile, openpyxl, six, python-dateutil, pytz, pandas, sortedcontainers, pycryptodome, pdfminer.six, PyPDF2, opencv-python, camelot-py, configparser, MarkupSafe, Jinja2, Werkzeug, itsdangerous, Flask, SQLAlchemy, billiard, vine, amqp, kombu, celery, excalibur-py
Running setup.py install for et-xmlfile ... done
Running setup.py install for openpyxl ... done
Running setup.py install for PyPDF2 ... done
Running setup.py install for camelot-py ... done
Running setup.py install for configparser ... done
The script flask.exe is installed in 'c:\users\user1\appdata\local\programs\python\python37\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Running setup.py install for SQLAlchemy ... done
Running setup.py install for billiard ... done
The script celery.exe is installed in 'c:\users\user1\appdata\local\programs\python\python37\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
The script excalibur.exe is installed in 'c:\users\user1\appdata\local\programs\python\python37\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed Click-7.0 Flask-1.0.2 Jinja2-2.10 MarkupSafe-1.1.0 PyPDF2-1.26.0 SQLAlchemy-1.2.14 Werkzeug-0.14.1 amqp-2.3.2 billiard-3.5.0.4 camelot-py-0.4.0 celery-4.2.1 configparser-3.5.0 et-xmlfile-1.0.1 excalibur-py-0.4.0 itsdangerous-1.1.0 jdcal-1.4 kombu-4.2.1 numpy-1.15.4 opencv-python-3.4.4.19 openpyxl-2.5.12 pandas-0.23.4 pdfminer.six-20181108 pycryptodome-3.7.2 python-dateutil-2.7.5 pytz-2018.7 six-1.11.0 sortedcontainers-2.1.0 vine-1.1.4
You are using pip version 10.0.1, however version 18.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
PS C:\excalibur> excalibur initdb
Creating new Excalibur configuration file in: C:\Users\user1/excalibur/excalibur.cfg
Traceback (most recent call last):
File "c:\users\user1\appdata\local\programs\python\python37\lib\runpy.py", line 193, in run_module_as_main
"main", mod_spec)
File "c:\users\user1\appdata\local\programs\python\python37\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\Users\user1\AppData\Local\Programs\Python\Python37\Scripts\excalibur.exe_main.py", line 5, in
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\excalibur\cli.py", line 10, in
from .tasks import split, extract
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\excalibur\tasks.py", line 10, in
import camelot
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\camelot_init.py", line 8, in
from .io import read_pdf
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\camelot\io.py", line 4, in
from .handlers import PDFHandler
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\camelot\handlers.py", line 9, in
from .parsers import Stream, Lattice
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\camelot\parsers_init_.py", line 3, in
from .stream import Stream
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\camelot\parsers\stream.py", line 11, in
from .base import BaseParser
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\camelot\parsers\base.py", line 5, in
from ..utils import get_page_layout, get_text_objects
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\camelot\utils.py", line 10, in
from pdfminer.pdfparser import PDFParser
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\pdfminer\pdfparser.py", line 4, in
from .psparser import PSStackParser
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\pdfminer\psparser.py", line 11, in
from .utils import choplist
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\pdfminer\utils.py", line 13, in
import chardet # For str encoding detection in Py3
ModuleNotFoundError: No module named 'chardet'
Is it possible to get the input parameters obtained by using the UI to later use it on a camelot python script to get the same result?
With questions that come up for first time users.
camelot Excalibur thow an oserror:access violation writing 0x0967BC48
os - Windows 10
python version - 3.7
below is the output screen
127.0.0.1 - - [08/May/2019 17:29:26] "GET /workspaces/35dd602d-2eec-44e7-a51f-36f6b4e4862b HTTP/1.1" 200 -
ERROR:root:exception: access violation writing 0x0E80C1B0
Traceback (most recent call last):
File "c:\users\naveen\appdata\local\programs\python\python37-32\lib\site-packages\excalibur\tasks.py", line 44, in split
with Ghostscript(*gs_call, stdout=null) as gs:
File "c:\users\naveen\appdata\local\programs\python\python37-32\lib\site-packages\camelot\ext\ghostscript_init_.py", line 93, in Ghostscript
stderr=kwargs.get('stderr', None))
File "c:\users\naveen\appdata\local\programs\python\python37-32\lib\site-packages\camelot\ext\ghostscript_init_.py", line 39, in init
rc = gs.init_with_args(instance, args)
File "c:\users\naveen\appdata\local\programs\python\python37-32\lib\site-packages\camelot\ext\ghostscript_gsprint.py", line 167, in init_with_args
rc = libgs.gsapi_init_with_args(instance, len(argv), c_argv)
OSError: exception: access violation writing 0x0E80C1B0
unable to use camelot and excalibur commands from commandline in ubuntu
I am on ubuntu 14.04 and installed excalibur-py using pip. while processing the following pdf (this is also used in camilot-py) and works well... the system returns the following message -
ERROR:root:'Table' object has no attribute '_bbox'
Traceback (most recent call last):
File "/home/sandeep/anaconda3/lib/python3.6/site-packages/excalibur/tasks.py", line 96, in split
x1, y1, x2, y2 = tables[0]._bbox
AttributeError: 'Table' object has no attribute '_bbox'
Refresh does not change anything... if i click on excalibur then i get this msg back
"The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application."
background_lines.pdf
pdf file used is attached
This is my code:
**import camelot
import pandas as pd
import string
class CommandLine:
def init(self):
tables=camelot.read_pdf(filepath="C:\Users\Ikhanna\OneDrive - STEPSTONE GROUP LP\Documents\PortCo Ari\Adam\Q1.2018.CAB.FS.Rpt.Orion Euro RE IV.pdf", pages="1")
if name == 'main':
app = CommandLine()**
This runs fine when I run in Jupyter notebook but when I convert this file to a .ipynb or .py and run through cmd it throws the error RuntimeError: Please make sure that Ghostscript is installed. It doesn't give this error if I use the stream flavour. I am not sure why this is happening
I consider it bad form for Excalibur to create a user-visible folder in the home folder (/Users/akx/excalibur
on my Mac, for instance).
It'd be better to use e.g. appdirs
to figure out the "user data" directory, and create the Excalibur directory there.
C:\Program Files\gs\gs9.26\bin>excalibur initdb
Traceback (most recent call last):
File "c:\programdata\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "c:\programdata\anaconda3\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\ProgramData\Anaconda3\Scripts\excalibur.exe_main.py", line 5, in
File "c:\programdata\anaconda3\lib\site-packages\excalibur\cli.py", line 10, i
n
from .tasks import split, extract
File "c:\programdata\anaconda3\lib\site-packages\excalibur\tasks.py", line 11,
in
from camelot.core import TableList
ImportError: cannot import name 'TableList'
A lot of users are stuck on the Refresh page without any clue about what's happening. We need to show any errors that happen on the webpage itself.
When there are 3 pages in the PDF table on the first page is extracted and aligned properly but the in 2nd-page data is not aligned properly. Can you please let us know what needs to be modified.
import Camelot
tables = camelot.read_pdf("D:\drive-download-20180907T081608Z-001\047238.pdf", flavor ='stream', pages='1-end')
tables
tables[0].df
tables.export("C:\Users\fooo.htm", f='html', compress=True)
I want to change the web port from 5000 to another port,so i modify the web port in excalibur.cfg, and then , I reset db, but when start web server,the web port is 5000
When I downloaded the extracted file , I got result like below mentioned . How to get the actual value ?
(cid:40)(cid:90)(cid:3)(cid:86)(cid:85)(cid:3)(cid:91)(cid:79)
Im getting a similar charmap error.
File "c:\users\sebastien.cote\appdata\local\continuum\anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2776' in position 246: character maps to
Is there anyway the program can proceed by skipping the problematic character not being able to encode?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.