Code Monkey home page Code Monkey logo

excalibur's People

Contributors

arky avatar dependabot[bot] avatar foarsitter avatar lamaun avatar lazydancer avatar martinthoma avatar monkeywithacupcake avatar n-sikka avatar sangarshanan avatar vinayak-mehta avatar williamjacksn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

excalibur's Issues

'ascii' codec can't encode character

I tried to convert some tables in Polish govt doc, containing polish accented characters.
This failed with error: UnicodeEncodeError: 'ascii' codec can't encode character u'\u015a' in position 346: ordinal not in range(128)
The character in question is ś, but there are way more such characters in that file.

If I need something reconfigured to be able to parse such chars, I believe it shouldn't raise an error, but rather suggest change :)

I can share the file if needed :)

version `GLIBC_2.25' not found

While installing from linux executable on ubuntu 14.04 above error message is reported. I have also tried to update the library but understand that an upgrade to ubuntu 16 is required. Will be great if we can get a fix for ubuntu 14.04.

ERROR:root:file has not been decrypted

While trying to upload a pdf file following error is through.

ERROR:root:file has not been decrypted
Traceback (most recent call last):
File "c:\py3\lib\site-packages\excalibur\tasks.py", line 57, in split
save_page(file.filepath, file.page_number)
File "c:\py3\lib\site-packages\excalibur\utils\task.py", line 10, in save_page
page = infile.getPage(page_number - 1)
File "c:\py3\lib\site-packages\PyPDF2\pdf.py", line 1176, in getPage
self._flatten()
File "c:\py3\lib\site-packages\PyPDF2\pdf.py", line 1505, in _flatten
catalog = self.trailer["/Root"].getObject()
File "c:\py3\lib\site-packages\PyPDF2\generic.py", line 516, in getitem
return dict.getitem(self, key).getObject()
File "c:\py3\lib\site-packages\PyPDF2\generic.py", line 178, in getObject
return self.pdf.getObject(self).getObject()
File "c:\py3\lib\site-packages\PyPDF2\pdf.py", line 1617, in getObject
raise utils.PdfReadError("file has not been decrypted")
PyPDF2.utils.PdfReadError: file has not been decrypted

AttributeError Nonetype for 'job_id'

Here's the print out of the problem. I'm getting a 500 internal server error. I'm in python 3.7. Camelot works fine for me (I can parse, read, export no problems). Just have a problem with running Excalibur.

  • Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
    127.0.0.1 - - [06/Nov/2018 21:30:22] "GET / HTTP/1.1" 302 -
    [2018-11-06 21:30:22,663] ERROR in app: Exception on /files [GET]
    Traceback (most recent call last):
    File "d:\python3.7\lib\site-packages\flask\app.py", line 2292, in wsgi_app
    response = self.full_dispatch_request()
    File "d:\python3.7\lib\site-packages\flask\app.py", line 1815, in full_dispatch_request
    rv = self.handle_user_exception(e)
    File "d:\python3.7\lib\site-packages\flask\app.py", line 1718, in handle_user_exception
    reraise(exc_type, exc_value, tb)
    File "d:\python3.7\lib\site-packages\flask_compat.py", line 35, in reraise
    raise value
    File "d:\python3.7\lib\site-packages\flask\app.py", line 1813, in full_dispatch_request
    rv = self.dispatch_request()
    File "d:\python3.7\lib\site-packages\flask\app.py", line 1799, in dispatch_request
    return self.view_functionsrule.endpoint
    File "d:\python3.7\lib\site-packages\excalibur\www\views.py", line 39, in files
    'job_id': job.job_id,
    AttributeError: 'NoneType' object has no attribute 'job_id'

Here's the code that contains the 'job_id' line 39 from the www\views.py file

@views.route('/files', methods=['GET', 'POST'])
def files():
if request.method == 'GET':
files_response = []
session = Session()
for file in session.query(File).order_by(File.uploaded_at.desc()).all():
job = session.query(Job).filter(Job.file_id == file.file_id).order_by(Job.started_at.desc()).first()
files_response.append({
'file_id': file.file_id,
'job_id': job.job_id,
'uploaded_at': file.uploaded_at.strftime('%Y-%m-%dT%H:%M:%S'),
'filename': file.filename
})


Any thoughts or am I making some stupid mistakes here?

Failure when "all" or 1-end is selected

Excalibur struggles on large pdfs (20pgs or more) when I indicate the "all" or "1-end" options.
I get the following warning:
UserWarning: No tables found on page-144 [lattice.py:399]
UserWarning: No tables found on page-144 [stream.py:447]
UserWarning: No tables found in table area 1 [stream.py:361]
UserWarning: No tables found in table area 1 [stream.py:361]
UserWarning: No tables found in table area 2 [stream.py:361]

However if I manually select the pages it works fine. Is there a way to solve this?

can't change web port

I want to change the web port from 5000 to another port,so i modify the web port in excalibur.cfg, and then , I reset db, but when start web server,the web port is 5000

ERROR:root:'charmap' codec can't encode character '\ued6f' in position 350: char acter maps to <undefined>

I am unable to share the pdf that is causing this issue. I would like to know what I can do to bypass this error.
Even if it means dropping the "offending char". Getting some of the data is better than getting none of the data.
I'd be ecstatic if this is a PEBKAC issue, so please don't discount that.

Using the latest download of excaliber and Python 3.7.3 (I think). Only using the webui to do this. Don't think I could handle coding it, without some hand holding.

This is happening on several pages of a very large pdf (700+ pages). But not all of them. So the file can be parsed. Just not the important portion, which is most of the file.

I DID just realize that it is creating some of the output files (excel, csv, and json), but not html. Since I on'y really need the csv or excel, I might be good. Will keep pushing on the remainder of the file (its slow to handle 100 pages at a time)

127.0.0.1 - - [05/Aug/2019 16:55:09] "GET /jobs/fbfeb974-5f3d-4991-b26c-98356064
0de5 HTTP/1.1" 200 -
ERROR:root:'charmap' codec can't encode character '\ued6f' in position 350: char
acter maps to <undefined>
Traceback (most recent call last):
  File "excalibur\executors\sequential_executor.py", line 12, in execute_command

  File "subprocess.py", line 336, in check_call
  File "subprocess.py", line 317, in call
  File "subprocess.py", line 769, in __init__
  File "subprocess.py", line 1172, in _execute_child
FileNotFoundError: [WinError 2] The system cannot find the file specified

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "excalibur\tasks.py", line 161, in extract
  File "lib\site-packages\camelot\core.py", line 479, in export
  File "lib\site-packages\camelot\core.py", line 437, in _write_file
  File "lib\site-packages\camelot\core.py", line 394, in to_html
  File "C:\Python37\lib\encodings\cp1252.py", line 19, in encode
UnicodeEncodeError: 'charmap' codec can't encode character '\ued6f' in position
350: character maps to <undefined>

Support listening on all interfaces

As far as I can tell, Excalibur listens on 127.0.0.1 and that can't be changed.

I am attempting to create a Docker container for Excalibur and I would like to have the web server listen on all interfaces to make it work.

I am not completely familiar with the code base, but I accomplished this myself by adding a "web_server_host" value to the [webserver] section of the config file, and changing the call to app.run() in cli.py:

app.run(host=conf.get('webserver', 'web_server_host'), use_reloader=False)

Is this a change you would be interested in adding to the project?

Please make sure that Ghostscript is installed

I am on mac os and have installed Ghostscript by brew. The error is below

File "/Users/.../anaconda3/envs/py37/lib/python3.7/site-packages/camelot/ext/ghostscript/init.py", line 24, in
from . import _gsprint as gs
File "/Users/.../anaconda3/envs/py37/lib/python3.7/site-packages/camelot/ext/ghostscript/_gsprint.py", line 258, in
raise RuntimeError("Please make sure that Ghostscript is installed")
RuntimeError: Please make sure that Ghostscript is installed

I guess the problem might be that my brew and pip have different paths, which cause that excalibur is installed in '/Users/.../anaconda3/envs/py37/bin/excalibur' while gs is installed in '/usr/local/bin/gs'.

Extract tables from webpage

Tables can be extracted from a webpage using pandas.read_html. We can create an interface 1) simple: where user can submit the link of the webpage and download extracted tables or 2) fancy: where user can submit the link of the webpage, see detected tables (on an image of the webpage?), un-select the tables they don't want and then download extracted tables.

Hosting excalibur on the web

Have you thought about hosting Excalibur for ease of use by non-technical people?

Was expecting tryexcalibur.com to be a hosted version but found a landing page instead.

Let me know if I can help.

Unable to process PDFs, throws GhostscriptError: -100 on refresh page (mac OS)

I've been trying to work in a private repository first. I forked the repository by first duplicating it and setting it to private via instructions from top answer here (https://stackoverflow.com/questions/10065526/github-how-to-make-a-fork-of-public-repository-private). Whenever I try to process PDFs, I get a GhostScript error on a mac OSX.

I am using:
Python 3.7
GPL Ghostscript 9.27 [via homebrew]
opencv: stable 4.1.0 (bottled) [via homebrew]
numpy 1.14.5
excalibur 0.4.2

I'm simply trying to run it locally via flask run and have already set up my own mySQL database as per excalibur documentation. It throws the error after I upload the PDF and go to the next page (makes it to the refresh page then throws the error there). I've tried many different PDFs with the same result including empty pages.

127.0.0.1 - - [09/Jul/2019 09:14:01] "GET /static/css/vendor/jquery-ui.structure.min.css HTTP/1.1" 200 -
127.0.0.1 - - [09/Jul/2019 09:14:01] "GET /static/css/workspace.css HTTP/1.1" 200 -
127.0.0.1 - - [09/Jul/2019 09:14:01] "GET /static/js/vendor/jquery.selectareas.min.js HTTP/1.1" 200 -
127.0.0.1 - - [09/Jul/2019 09:14:01] "GET /static/js/workspace.js HTTP/1.1" 200 -
127.0.0.1 - - [09/Jul/2019 09:14:01] "GET /static/js/vendor/jquery-ui.min.js HTTP/1.1" 200 -
*GPL Ghostscript 9.27: Unrecoverable error, exit code 1
ERROR:root:-100
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/excalibur/tasks.py", line 44, in split
with Ghostscript(gs_call, stdout=null) as gs:
File "/usr/local/lib/python3.7/site-packages/camelot/ext/ghostscript/init.py", line 93, in Ghostscript
stderr=kwargs.get('stderr', None))
File "/usr/local/lib/python3.7/site-packages/camelot/ext/ghostscript/init.py", line 39, in init
rc = gs.init_with_args(instance, args)
File "/usr/local/lib/python3.7/site-packages/camelot/ext/ghostscript/_gsprint.py", line 169, in init_with_args
raise GhostscriptError(rc)
camelot.ext.ghostscript._gsprint.GhostscriptError: -100

Screen Shot 2019-07-09 at 11 18 51 am

Any help solving this would be appreciated.

Row separator like column separator

The column separator is an amazing feature!

Taking a logical step forward, how about horizontal row separators too? To help avoid consecutive table entries mixing in with each other.

With both column and row separators at hand, this will help the user parse their table perfectly.

Ref: A similar request I'd posted on Tabula that had garnered a lot of attention: https://github.com/tabulapdf/tabula/issues/409

Also, I understand Excalibur is a frontend to Camelot. So the column separator feature might be there in Camelot and you've brought it to user interface here. Perhaps there's no row separator feature in Camelot itself yet? Can someone clarify? Then I'll go post this request there.

Feature request: Option to Link PDF URL, refreshes each time download page is accessed

I'd like to request an option to link PDFs (since PDF data often updates, it's much easier to keep data updated by linking than by manually uploading each time).

When a PDF is linked, and the rules have been set for that particular PDF, whenever the download page is accessed(example download page link: http://127.0.0.1:5000/jobs/3c90fc1b-a9d8-4d51-a83a-218d18d4893f), it automatically downloads from URL, then re-processes it with pre-defined rule, and displays the tables of the extracted data. This should work by just accessing the download page.

I can perhaps hire someone to get this done if you're willing to add it to the main project.

So steps are:

  1. Link to PDF (example: https://www.lcfcu.org/home/fiFiles/static/documents/rates.pdf)
  2. Set Rules for the PDF and save
  3. Access the download page for that pdf (example: http://127.0.0.1:5000/jobs/3c90fc1b-a9d8-4d51-a83a-218d18d4893f)
  4. Excalibur automatically fetches the PDF from link
  5. Extracts data from PDF based on predefined rule
  6. Displays like so:
    image

So in the future, whenever I detect the pdf has changed, I can access the download page link and it'll repeat the entire process again(steps 3-6).

Again, please let me know if you're open to have this change contribute to main source, if so, I can get it coded. I feel this change is extremely important since many pdf on web change frequently thus making this feature very useful.

Download is not working

S05MoldedCaseCircuitBreakers.pdf

Hi,

The file will not download.

The upload & Table extraction seem to be working. When I press the download button, it goes to the refresh screen. The file never downloads.

Attached is the PDF I was testing with.

I hope this helps.

Cheers,
Mitch Hunt

windows install threw an unhandled error

running as administrator on Windows Version 10.0.17763 Build 17763 after installing python-3.7.1-amd64

PS C:\excalibur> pip install excalibur-py
PS C:\excalibur> excalibur initdb
Error ModuleNotFoundError: No module named 'chardet'

full output below

Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.

PS C:\WINDOWS\system32> *cd C:\excalibur*
PS C:\excalibur> pip install excalibur-py
Collecting excalibur-py
Downloading https://files.pythonhosted.org/packages/9e/fe/6d60ad37075c89136e614cefa494aff4701e97e5121193c09a70d3c827b0/excalibur_py-0.4.0-py2.py3-none-any.whl (1.5MB)
100% |████████████████████████████████| 1.5MB 1.9MB/s
Collecting camelot-py[cv]>=0.2.3 (from excalibur-py)
Downloading https://files.pythonhosted.org/packages/65/e3/75842357e53f675d60b093c182d254c37db5b1d6144d12703af0a433f7f5/camelot-py-0.4.0.tar.gz
Collecting configparser<3.6.0,>=3.5.0 (from excalibur-py)
Downloading https://files.pythonhosted.org/packages/7c/69/c2ce7e91c89dc073eb1aa74c0621c3eefbffe8216b3f9af9d3885265c01c/configparser-3.5.0.tar.gz
Collecting Flask>=1.0.2 (from excalibur-py)
Downloading https://files.pythonhosted.org/packages/7f/e7/08578774ed4536d3242b14dacb4696386634607af824ea997202cd0edb4b/Flask-1.0.2-py2.py3-none-any.whl (91kB)
100% |████████████████████████████████| 92kB 3.8MB/s
Collecting SQLAlchemy>=1.2.12 (from excalibur-py)
Downloading https://files.pythonhosted.org/packages/e2/0a/05b7d13618ad41c108a6c2b886af83bf9bb7e35f8951227abb18b1330745/SQLAlchemy-1.2.14.tar.gz (5.7MB)
100% |████████████████████████████████| 5.7MB 1.7MB/s
Collecting Click>=7.0 (from excalibur-py)
Downloading https://files.pythonhosted.org/packages/fa/37/45185cb5abbc30d7257104c434fe0b07e5a195a6847506c074527aa599ec/Click-7.0-py2.py3-none-any.whl (81kB)
100% |████████████████████████████████| 81kB 4.1MB/s
Collecting celery>=4.1.1 (from excalibur-py)
Downloading https://files.pythonhosted.org/packages/e8/58/2a0b1067ab2c12131b5c089dfc579467c76402475c5231095e36a43b749c/celery-4.2.1-py2.py3-none-any.whl (401kB)
100% |████████████████████████████████| 409kB 1.1MB/s
Collecting numpy>=1.13.3 (from camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/00/0e/5a8c34adb97fc1cd6636d78050e575945e874c8516d501421d5a0f377a6c/numpy-1.15.4-cp37-none-win_amd64.whl (13.5MB)
100% |████████████████████████████████| 13.5MB 1.3MB/s
Collecting openpyxl>=2.5.8 (from camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/08/8a/509eb6f58672288da9a5884e1cc7e90819bc8dbef501161c4b40a6a4e46b/openpyxl-2.5.12.tar.gz (173kB)
100% |████████████████████████████████| 174kB 1.8MB/s
Collecting pandas>=0.23.4 (from camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/58/a8/03e5fe0edbc522e46cb27df2abfb4266814129253d8462f38bc704a76a2a/pandas-0.23.4-cp37-cp37m-win_amd64.whl (7.9MB)
100% |████████████████████████████████| 7.9MB 1.6MB/s
Collecting pdfminer.six>=20170720 (from camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/8a/fd/6e8746e6965d1a7ea8e97253e3d79e625da5547e8f376f88de5d024bacb9/pdfminer.six-20181108-py2.py3-none-any.whl (5.6MB)
100% |████████████████████████████████| 5.6MB 1.7MB/s
Collecting PyPDF2>=1.26.0 (from camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/b4/01/68fcc0d43daf4c6bdbc6b33cc3f77bda531c86b174cac56ef0ffdb96faab/PyPDF2-1.26.0.tar.gz (77kB)
100% |████████████████████████████████| 81kB 4.8MB/s
Collecting opencv-python>=3.4.2.17 (from camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/65/b0/1b098827a7a879546363e5c976418850e6d9bebf7662f32ddefd30ae9c2c/opencv_python-3.4.4.19-cp37-cp37m-win_amd64.whl (38.3MB)
100% |████████████████████████████████| 38.3MB 302kB/s
Collecting Jinja2>=2.10 (from Flask>=1.0.2->excalibur-py)
Downloading https://files.pythonhosted.org/packages/7f/ff/ae64bacdfc95f27a016a7bed8e8686763ba4d277a78ca76f32659220a731/Jinja2-2.10-py2.py3-none-any.whl (126kB)
100% |████████████████████████████████| 133kB 2.8MB/s
Collecting Werkzeug>=0.14 (from Flask>=1.0.2->excalibur-py)
Downloading https://files.pythonhosted.org/packages/20/c4/12e3e56473e52375aa29c4764e70d1b8f3efa6682bef8d0aae04fe335243/Werkzeug-0.14.1-py2.py3-none-any.whl (322kB)
100% |████████████████████████████████| 327kB 1.8MB/s
Collecting itsdangerous>=0.24 (from Flask>=1.0.2->excalibur-py)
Downloading https://files.pythonhosted.org/packages/76/ae/44b03b253d6fade317f32c24d100b3b35c2239807046a4c953c7b89fa49e/itsdangerous-1.1.0-py2.py3-none-any.whl
Collecting billiard<3.6.0,>=3.5.0.2 (from celery>=4.1.1->excalibur-py)
Downloading https://files.pythonhosted.org/packages/87/ac/9b3cc065557ad5769d0626fd5dba0ad1cb40e3a72fe6acd3d081b4ad864e/billiard-3.5.0.4.tar.gz (150kB)
100% |████████████████████████████████| 153kB 2.2MB/s
Collecting kombu<5.0,>=4.2.0 (from celery>=4.1.1->excalibur-py)
Downloading https://files.pythonhosted.org/packages/97/61/65838c7da048e56d549e358ac19c0979c892e17dc6186610c49531d35b70/kombu-4.2.1-py2.py3-none-any.whl (177kB)
100% |████████████████████████████████| 184kB 2.2MB/s
Collecting pytz>dev (from celery>=4.1.1->excalibur-py)
Downloading https://files.pythonhosted.org/packages/f8/0e/2365ddc010afb3d79147f1dd544e5ee24bf4ece58ab99b16fbb465ce6dc0/pytz-2018.7-py2.py3-none-any.whl (506kB)
100% |████████████████████████████████| 512kB 1.4MB/s
Collecting jdcal (from openpyxl>=2.5.8->camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/a0/38/dcf83532480f25284f3ef13f8ed63e03c58a65c9d3ba2a6a894ed9497207/jdcal-1.4-py2.py3-none-any.whl
Collecting et_xmlfile (from openpyxl>=2.5.8->camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/22/28/a99c42aea746e18382ad9fb36f64c1c1f04216f41797f2f0fa567da11388/et_xmlfile-1.0.1.tar.gz
Collecting python-dateutil>=2.5.0 (from pandas>=0.23.4->camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/74/68/d87d9b36af36f44254a8d512cbfc48369103a3b9e474be9bdfe536abfc45/python_dateutil-2.7.5-py2.py3-none-any.whl (225kB)
100% |████████████████████████████████| 235kB 2.2MB/s
Collecting sortedcontainers (from pdfminer.six>=20170720->camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/13/f3/cf85f7c3a2dbd1a515d51e1f1676d971abe41bba6f4ab5443240d9a78e5b/sortedcontainers-2.1.0-py2.py3-none-any.whl
Collecting pycryptodome (from pdfminer.six>=20170720->camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/fc/99/ed80fd36eebe26914bd8aae4ac70fcef2d4ad94453981c171fe791629146/pycryptodome-3.7.2-cp37-cp37m-win_amd64.whl (8.0MB)
100% |████████████████████████████████| 8.0MB 1.5MB/s
Collecting six (from pdfminer.six>=20170720->camelot-py[cv]>=0.2.3->excalibur-py)
Downloading https://files.pythonhosted.org/packages/67/4b/141a581104b1f6397bfa78ac9d43d8ad29a7ca43ea90a2d863fe3056e86a/six-1.11.0-py2.py3-none-any.whl
Collecting MarkupSafe>=0.23 (from Jinja2>=2.10->Flask>=1.0.2->excalibur-py)
Downloading https://files.pythonhosted.org/packages/44/6e/41ac9266e3db762dfd9089f6b0d2298c84160f54ef2a7257c17b0e7ec2ec/MarkupSafe-1.1.0-cp37-cp37m-win_amd64.whl
Collecting amqp<3.0,>=2.1.4 (from kombu<5.0,>=4.2.0->celery>=4.1.1->excalibur-py)
Downloading https://files.pythonhosted.org/packages/7f/cf/12d4611fc67babd4ae250c9e8249c5650ae1933395488e9e7e3562b4ff24/amqp-2.3.2-py2.py3-none-any.whl (48kB)
100% |████████████████████████████████| 51kB 2.8MB/s
Collecting vine>=1.1.3 (from amqp<3.0,>=2.1.4->kombu<5.0,>=4.2.0->celery>=4.1.1->excalibur-py)
Downloading https://files.pythonhosted.org/packages/10/50/5b1ebe42843c19f35edb15022ecae339fbec6db5b241a7a13c924dabf2a3/vine-1.1.4-py2.py3-none-any.whl
Installing collected packages: Click, numpy, jdcal, et-xmlfile, openpyxl, six, python-dateutil, pytz, pandas, sortedcontainers, pycryptodome, pdfminer.six, PyPDF2, opencv-python, camelot-py, configparser, MarkupSafe, Jinja2, Werkzeug, itsdangerous, Flask, SQLAlchemy, billiard, vine, amqp, kombu, celery, excalibur-py
Running setup.py install for et-xmlfile ... done
Running setup.py install for openpyxl ... done
Running setup.py install for PyPDF2 ... done
Running setup.py install for camelot-py ... done
Running setup.py install for configparser ... done
The script flask.exe is installed in 'c:\users\user1\appdata\local\programs\python\python37\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Running setup.py install for SQLAlchemy ... done
Running setup.py install for billiard ... done
The script celery.exe is installed in 'c:\users\user1\appdata\local\programs\python\python37\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
The script excalibur.exe is installed in 'c:\users\user1\appdata\local\programs\python\python37\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed Click-7.0 Flask-1.0.2 Jinja2-2.10 MarkupSafe-1.1.0 PyPDF2-1.26.0 SQLAlchemy-1.2.14 Werkzeug-0.14.1 amqp-2.3.2 billiard-3.5.0.4 camelot-py-0.4.0 celery-4.2.1 configparser-3.5.0 et-xmlfile-1.0.1 excalibur-py-0.4.0 itsdangerous-1.1.0 jdcal-1.4 kombu-4.2.1 numpy-1.15.4 opencv-python-3.4.4.19 openpyxl-2.5.12 pandas-0.23.4 pdfminer.six-20181108 pycryptodome-3.7.2 python-dateutil-2.7.5 pytz-2018.7 six-1.11.0 sortedcontainers-2.1.0 vine-1.1.4
You are using pip version 10.0.1, however version 18.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
PS C:\excalibur> excalibur initdb
Creating new Excalibur configuration file in: C:\Users\user1/excalibur/excalibur.cfg
Traceback (most recent call last):
File "c:\users\user1\appdata\local\programs\python\python37\lib\runpy.py", line 193, in run_module_as_main
"main", mod_spec)
File "c:\users\user1\appdata\local\programs\python\python37\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\Users\user1\AppData\Local\Programs\Python\Python37\Scripts\excalibur.exe_main
.py", line 5, in
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\excalibur\cli.py", line 10, in
from .tasks import split, extract
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\excalibur\tasks.py", line 10, in
import camelot
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\camelot_init
.py", line 8, in
from .io import read_pdf
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\camelot\io.py", line 4, in
from .handlers import PDFHandler
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\camelot\handlers.py", line 9, in
from .parsers import Stream, Lattice
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\camelot\parsers_init_.py", line 3, in
from .stream import Stream
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\camelot\parsers\stream.py", line 11, in
from .base import BaseParser
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\camelot\parsers\base.py", line 5, in
from ..utils import get_page_layout, get_text_objects
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\camelot\utils.py", line 10, in
from pdfminer.pdfparser import PDFParser
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\pdfminer\pdfparser.py", line 4, in
from .psparser import PSStackParser
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\pdfminer\psparser.py", line 11, in
from .utils import choplist
File "c:\users\user1\appdata\local\programs\python\python37\lib\site-packages\pdfminer\utils.py", line 13, in
import chardet # For str encoding detection in Py3
ModuleNotFoundError: No module named 'chardet'

Update FAQ?

With questions that come up for first time users.

Error on Windows: OSError: exception: access violation writing 0x0967BC48 while running python-Excalibur code

camelot Excalibur thow an oserror:access violation writing 0x0967BC48
os - Windows 10
python version - 3.7

below is the output screen

  • Debug mode: off
  • Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
    127.0.0.1 - - [02/Jul/2019 19:03:54] "GET /files HTTP/1.1" 200 -
    127.0.0.1 - - [02/Jul/2019 19:04:10] "POST /files HTTP/1.1" 200 -
    127.0.0.1 - - [02/Jul/2019 19:04:10] "GET /workspaces/59f1c984-31fa-4ade-b944-770072f82827 HTTP/1.1" 200 -
    127.0.0.1 - - [02/Jul/2019 19:04:19] "GET /workspaces/59f1c984-31fa-4ade-b944-770072f82827 HTTP/1.1" 200 -
    127.0.0.1 - - [02/Jul/2019 19:04:20] "GET /static/favicon.ico HTTP/1.1" 200 -
    ERROR:root:exception: access violation writing 0x0967BC48
    Traceback (most recent call last):
    File "c:\users\comp7\appdata\local\programs\python\python37-32\lib\site-packages\excalibur\tasks.py", line 44, in split
    with Ghostscript(*gs_call, stdout=null) as gs:
    File "c:\users\comp7\appdata\local\programs\python\python37-32\lib\site-packages\camelot\ext\ghostscript_init_.py", line 93, in Ghostscript
    stderr=kwargs.get('stderr', None))
    File "c:\users\comp7\appdata\local\programs\python\python37-32\lib\site-packages\camelot\ext\ghostscript_init_.py", line 39, in init
    rc = gs.init_with_args(instance, args)
    File "c:\users\comp7\appdata\local\programs\python\python37-32\lib\site-packages\camelot\ext\ghostscript_gsprint.py", line 167, in init_with_args
    rc = libgs.gsapi_init_with_args(instance, len(argv), c_argv)
    OSError: exception: access violation writing 0x0967BC48

excal

Staying in refresh page and not displaying document

127.0.0.1 - - [08/May/2019 17:29:26] "GET /workspaces/35dd602d-2eec-44e7-a51f-36f6b4e4862b HTTP/1.1" 200 -
ERROR:root:exception: access violation writing 0x0E80C1B0
Traceback (most recent call last):
File "c:\users\naveen\appdata\local\programs\python\python37-32\lib\site-packages\excalibur\tasks.py", line 44, in split
with Ghostscript(*gs_call, stdout=null) as gs:
File "c:\users\naveen\appdata\local\programs\python\python37-32\lib\site-packages\camelot\ext\ghostscript_init_.py", line 93, in Ghostscript
stderr=kwargs.get('stderr', None))
File "c:\users\naveen\appdata\local\programs\python\python37-32\lib\site-packages\camelot\ext\ghostscript_init_.py", line 39, in init
rc = gs.init_with_args(instance, args)
File "c:\users\naveen\appdata\local\programs\python\python37-32\lib\site-packages\camelot\ext\ghostscript_gsprint.py", line 167, in init_with_args
rc = libgs.gsapi_init_with_args(instance, len(argv), c_argv)
OSError: exception: access violation writing 0x0E80C1B0

Processing PDF - error message

I am on ubuntu 14.04 and installed excalibur-py using pip. while processing the following pdf (this is also used in camilot-py) and works well... the system returns the following message -

ERROR:root:'Table' object has no attribute '_bbox'
Traceback (most recent call last):
File "/home/sandeep/anaconda3/lib/python3.6/site-packages/excalibur/tasks.py", line 96, in split
x1, y1, x2, y2 = tables[0]._bbox
AttributeError: 'Table' object has no attribute '_bbox'
Refresh does not change anything... if i click on excalibur then i get this msg back
"The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application."
background_lines.pdf

pdf file used is attached

GhostScript error only for lattice when run through command line

This is my code:

**import camelot
import pandas as pd
import string

class CommandLine:
def init(self):
tables=camelot.read_pdf(filepath="C:\Users\Ikhanna\OneDrive - STEPSTONE GROUP LP\Documents\PortCo Ari\Adam\Q1.2018.CAB.FS.Rpt.Orion Euro RE IV.pdf", pages="1")
if name == 'main':
app = CommandLine()**

This runs fine when I run in Jupyter notebook but when I convert this file to a .ipynb or .py and run through cmd it throws the error RuntimeError: Please make sure that Ghostscript is installed. It doesn't give this error if I use the stream flavour. I am not sure why this is happening

Excalibur's data directory is created in HOME

I consider it bad form for Excalibur to create a user-visible folder in the home folder (/Users/akx/excalibur on my Mac, for instance).

It'd be better to use e.g. appdirs to figure out the "user data" directory, and create the Excalibur directory there.

ImportError: cannot import name 'TableList'

C:\Program Files\gs\gs9.26\bin>excalibur initdb
Traceback (most recent call last):
File "c:\programdata\anaconda3\lib\runpy.py", line 193, in _run_module_as_main

"__main__", mod_spec)

File "c:\programdata\anaconda3\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\ProgramData\Anaconda3\Scripts\excalibur.exe_main
.py", line 5, in

File "c:\programdata\anaconda3\lib\site-packages\excalibur\cli.py", line 10, i
n
from .tasks import split, extract
File "c:\programdata\anaconda3\lib\site-packages\excalibur\tasks.py", line 11,
in
from camelot.core import TableList
ImportError: cannot import name 'TableList'

Show errors on webpage

A lot of users are stuck on the Refresh page without any clue about what's happening. We need to show any errors that happen on the webpage itself.

page : Refresh !!!

Refresh!
Please wait while the pages are converted to images. Refresh again in some time.
捕获

data extracted in 2nd,3rd... page is not aligned properly as in 1st page.

When there are 3 pages in the PDF table on the first page is extracted and aligned properly but the in 2nd-page data is not aligned properly. Can you please let us know what needs to be modified.

import Camelot
tables = camelot.read_pdf("D:\drive-download-20180907T081608Z-001\047238.pdf", flavor ='stream', pages='1-end')
tables
tables[0].df
tables.export("C:\Users\fooo.htm", f='html', compress=True)
2018

can't change web port

I want to change the web port from 5000 to another port,so i modify the web port in excalibur.cfg, and then , I reset db, but when start web server,the web port is 5000

Excalibur python error

When I downloaded the extracted file , I got result like below mentioned . How to get the actual value ?
(cid:40)(cid:90)(cid:3)(cid:86)(cid:85)(cid:3)(cid:91)(cid:79)

undefined character maps

Im getting a similar charmap error.

File "c:\users\sebastien.cote\appdata\local\continuum\anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2776' in position 246: character maps to

Is there anyway the program can proceed by skipping the problematic character not being able to encode?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.