atlanhq / camelot Goto Github PK
View Code? Open in Web Editor NEWCamelot: PDF Table Extraction for Humans
Home Page: https://camelot-py.readthedocs.io
License: Other
Camelot: PDF Table Extraction for Humans
Home Page: https://camelot-py.readthedocs.io
License: Other
for example - imagine a PDF with majority text and a small table that is of interest. the score is affected by only how the table is extracted, where as it is good to have stats on such and such LTText(s) etc were ignored et al.
To be updated.
Can process a partial grid using lattice method if the grid generation is done using hough but will need to specify some version of a bounding box calculation of which is already done using contours.
Division of the page will be needed pre-running the hough transform.
For example,
camelot method pdf_dir/
or
camelot method *.pdf
Initially the files would be processed sequentially, but in the future, support for distributed processing should be implemented.
There can be many variations like horizontal table, vertical headers or vertical table, vertical headers etc.
Will need to explore python-based tesseract libraries for OCR.
This thread is for discussion around improving tests.
jtol
is used to take care of any errors that might arise while converting from image coordinate space to the pdf coordinate space, converting line contours to lines. There is no need for it to be user configurable.
Modify _merge_columns
to account for a negative tolerance parameter and not merge x-axis column projections if they are within that tolerance.
Currently, LTTextLH
are assigned to cells based on their x0
coord. Additionally, the area overlap between the LTTextLH
and cell could be used for better assignment. A tolerance parameter for minimum area overlap required for assignment into a cell can be added. For example (see image), it makes sense to add the third text box to column 2 instead of 1. @sharky93
Currently, sometimes due to poor image quality*, extra lines are detected which leads to more/less joints contributing to table building. For dealing with more joints, a filter based on line length can be added. For less joints, a better quality image by tweaking Wand parameters, or using heuristics like merging close contours could be done.
When filling chars in cells after sorting them according to their coordinates, 6th gets filled as th6
. By replacing chars with textlines, which are made by grouping chars, we can get 6th
as output. Though we need to take care of splitting textlines if they span across multiple cells. There should also be an option in both methods to specify whether textlines should be splitted if they span across multiple cells.
This will help in cases where a pdf page has two or more tables with a box outline, but without internal lines to demarcate cells. Need to think on how to integrate find_table_contours
from imgproc
with Stream
column generation.
Modifications to the grid needed post the mode
calculation for the rows where data lies outside the grid formed using mode
number of columns
Something like tabula-table-editor (code: tabulapdf/tabula-table-editor) could be made using matplotlib widgets and event handling.
Use-cases:
The function that checks for page rotation is the culprit. pdfminer's layout analysis takes a long time for such pdfs. Examples: the RNTB pdfs from un-sdg.
In case of tables where there are insufficient lines to demarcate all cells, the relatively smaller lines should be shown in the output as rows or columns of cells filled with the word line
, just to help out someone in post processing.
When loading a lot of pages (~42000) at once, memory gets filled up quickly and parsing stalls. A way to solve this would be to process pages in chunks. (think of generators) Or using multiprocessing
maybe. @sharky93 Would you like to add something?
Find some way to write the log buffer to a file if log is True.
See the pdfminer code and find a way to handle super and subscripts.
Have to figure out if the following dependencies work as expected, in Python 3:
Thinking of starting with using things from __future__
wherever possible (mostly print, division and unicode_literals), and create a release for Python 2. That should make it easy for us to create a Python 3 release once the deps are sorted.
To keep a track of what are the most immediate things necessary and what can be added iteratively.
This thread is for discussion around documentation, what to add/improve, where to host etc.
do not want the screen to get blocked
Currently, the list of x-axis projections of text objects is sorted in (-y, x)
style. These projections are then sequentially merged based on overlaps. This may fail when the projection of the last text object in a column extends to projection of the first text object in the next column, as it will merge the two columns into one. A better way to do this would be to group text objects into columns based on their list index and only merge their boundaries, though more discussion is required.
Check if the LTObjects hierarchy is changed when modifying margins. Also, see how PDFMiner generates newlines and spaces.
This thread is for discussion about the new interface.
Current logs for such cases are of the form WARNING:root:Text did not fit any column.
which don't provide any details.
This thread is for discussion about what CI tool to use.
again, two ways of getting to the grid finally for using the stream method, either you analyse the text to get to vertices OR analyse the blank spaces to get to vertices.
will be parameterised, by default the stream method should utilize both to get to a final set of vertices.
There should be a way to merge tables which span multiple pages.
Cases where the user is aware of which section of PDF houses the interesting things (tables here mostly), a simple configuration option to guide the image processing will reduce run-times further.
For example - when half of the PDF is blank, no need to process those pixels
A documentation explaining the working, installation instructions etc. should be there. README should also be improved.
Will be used for generating better grid coordinates for stream. Happens within the set of rows with mode number of columns and the overlap will lead to increase or decrease in the number of columns which stream guesses. We don't go beyond this 1-pass in the recursion.
The output log should contain more information like the number of tables found, display any tables that couldn't be parsed etc.
One way would be to give a dict on page numbers with list of table_bbox
, subsequently changing how ncolumns
and columns
are passed to Stream
.
This thread is for discussion around licenses.
not all lines are useful. many times PDFs are divided into sections with different line widths. thus a parameter for tuning the line width to consider something as a line.
Would be useful.
In Stream
, the table boundary is taken as (0, width
) and (0, height
) where width
and height
are PDF dimensions. This should be changed to (x0
of first text object, x1
of last text object on x-axis) and (y0
of first text object, y1
of last text object on y-axis) respectively because logic.
imagemagick uses ghostscript for PDF->PNG conversion. However, it calls gs
twice, first for PDF->PS and then PS->PNG. According to this SO answer it doesn't keep quality in first step. gs
can do PDF->PNG in one go and with better quality. (checked this on some PDFs in which more joints were being detected #51)
The UserWarning
s raised in Lattice
and Stream
are not logged into a file right now.
Do it in pdf.py
itself.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.