jstockwin / py-pdf-parser Goto Github PK

View Code? Open in Web Editor NEW

344.0 6.0 43.0 1.31 MB

A Python tool to help extracting information from structured PDFs.

Home Page: https://py-pdf-parser.readthedocs.io/en/latest/

License: MIT License

Dockerfile 0.50% Python 99.50%

pdf parsing pdf-parsing py-pdf-parser

py-pdf-parser's Introduction

py-pdf-parser

Py PDF Parser is a tool to help extracting information from structured PDFs.

Full details and installation instructions can be found at: https://py-pdf-parser.readthedocs.io/en/latest/

This project is based on an original design and protoype by Sam Whitehall (github.com/samwhitehall).

py-pdf-parser's People

Contributors

Stargazers

Watchers

Forkers

simonhurst indrajeetgour jingmouren sullumvoe sdpku fighting41love bwagner d-hoke sshuster bobycv06fpm etretyakov aldenpeterson hexarobi pplonski rbhunwal79 kylelkh fynxiu zhongguogu worker1709 anichochlain piaoxue85 jjbiggins ratnesh93 dantecalderon scuffi capuanob mayhemheroes franga2000 mauricioppires juampamuc iq-scm arandomperson07 anderswoodruff barraponto rkuang enthusiastdev121 rt-research ink-splatters beckjake enverus-pr aiden2480 vlgll

py-pdf-parser's Issues

Use public base for dockerfile, and install all requirements in setup

We currently use an image for our own ECR, but should use a public image.

We tried this but it threw up issues with PyQt5 - these need fixing.

Related -> Currently pyqt5 and matplotlib are installed in the dockerfile using apt, but should be added to setup.py.

Fix LA Params documentation

Some of our guessed defaults were wrong. See the implementation here.

Finish the info screen on visualise tool

You can pass show_info=True to the visualise tool, and this allows you to click on elements and see details etc.

It is unfinished and needs work.

The visuals need improving -> we currently just throw loads of text into a figure. Perhaps this can be better formatted?
Some of the fields could be improved, and we're missing e.g. character margins etc. It would be good if it enabled you to completely understand the LAParams, since these can be quite confusing.
Additionally, once this is done it would be good to add an example to the documentation.

Add code coverage checks to CI

We should add some code coverage checks to the CI.

Initially this will help us find untested areas (note the visualise tool is currently untested as we're not really sure how to go about testing it...).

Moving forwards, it will help catch and PRs which don't add sufficient tests.

Use of Visualize

Bug Report

Please also check that your bug is not actually caused by pdfminer.six, and is really an issue with this project.

example:

from py_pdf_parser.loaders import load_file
from py_pdf_parser.visualise import visualise

document = load_file("1907-1912_RESULTS.pdf")
visualise(document)

Error:

Traceback (most recent call last):
  File "C:\Users\andrewp\AppData\Roaming\Python\Python38\site-packages\wand\api.py", line 180, in <module>
    libraries = load_library()
  File "C:\Users\andrewp\AppData\Roaming\Python\Python38\site-packages\wand\api.py", line 135, in load_library
    raise IOError('cannot find library; tried paths: ' + repr(tried_paths))
OSError: cannot find library; tried paths: [~/CORE_RL_wand_.dll, etc.]

Problem: imagemagic v7+ no longer includes the searched-for dll files.

Resolution:
Uninstall most recent imagemagic (7.0.10-34) and install the latest legacy imagemagic (6.9.11-34)

Solution:
Provide compatibility with v7+ or note that one should install the Legacy version.

Our filtering currently uses Python's frozensets. This is mainly to be 100% sure we're not editing old element lists etc by applying additional operations. Performance-wise, frozensets are the same as sets.

Calling __getitem__ on an element list means we need to sort the frozenset, which we do by calling sorted. Additionally #113 adds some new functions which requite sorting.

We could use sorted sets (e.g. http://www.grantjenks.com/docs/sortedcontainers/introduction.html#sorted-set). I believe we could just override the functions which allow them to be mutated to be sure we're not doing this. This should give us some performance gain since we won't need to keep explicitly sorting the elements. This should be checked.

We should also check that there is no performance loss when doing set operations on a sorted set vs a set. We also need to e.g. add and remove elements from the set, but not by mutating the set. This might require some copies, which could slow things down?

Pros:

Performance gain (needs verification)
Some parts of the code become simpler

Cons:

There'd be an additional dependency
We'd probably want to modify the sorted set class slightly to ensure we don't allow mutations
Possible (but I think unlikely) performance drop when doing set operations?

Thoughts/ideas on this appreciated.

Strip element text by default

Instead of a property element.text we should have a method element.text(strip=True). This means text gets stripped by default, but you can stop this behaviour if you want.

Section visualisations can be made simpler in some cases

#69 Added code to show section outlines. Because we wanted to be 100% correct in which elements were within the outline, the construction is quite complicated and can result in strangely shaped outlines. The code isn't too slow, but it's also not super fast.

In the case where a simple rectangle can be drawn around the section (which should be reasonably trivial to check), this should be done.

Add `include_last_element` to `create_section`

It seems to be a common case that you know the element at which you want to end a section, but you don't want that element to be included (e.g. because this is the start of the next section). We should add an include_last_element=True) argument to create_section. Note that this can simply subtract one from the index of the element and then get the new element from the document.

We could also add include_start_element, but I think this is less useful so maybe we won't bother until it feels like a necessary feature.

Change font sizes to floats

PDFMiner has changed the way it gets the font size since this code was written. I think initially the heights had lots of decimal places etc.

We've just seen a PDF where the heights are now e.g. 7.5. Due to some precision issue (?) this is sometimes rounded up to 8 or down to 7, for different elements with exactly the same font.

We should switch from using int for the font to using float instead. We should also check multiple PDFs to decide if we want to add any rounding, or whether they all seem reasonable.

Extract simple table could be more efficient

At the moment we're looping through the reference row and then the reference column, and computing the element which is in line with both the reference row element and the reference col element.

This means we're doing to geometry checks to work out each row len(cols) times, and the checks to work out each col len(rows) times.

We should do the geometry filtering at the start to create a list of rows and a list of cols. We can then just & the two to get the element.

This should require much less processing as you only do geometry checks len(rows) + len(cols) times, rather then len(rows) * len(cols) times.

Too large a tolerance causes an error

We allow passing a tolerance parameter to the horizontally_in_line_with and vertically_in_line_with functions. However, if you specify a tolerance which is larger than the width/height of the element, this instantiates an invalid bounding box with either x0 > x1 or y0 > y1.

We should cap the tolerance at the width/height of the element.

Add license, contributing, gh template, and changelog files

We should add:

What is the prefix:'PCAGML' of font:"PCAGML+SourceHanSerifCN-Regular,16.0"?

Hi @jstockwin ,

I am trying to load 3 similar files which is auto generated from same system, but get 3 different prefix of font.

>>> set(e.font for e in doc.elements)
{'OHKPGR+SourceHanSerifCN-Regular,10.5', 'Helvetica,12.0', 'OHKPGR+SourceHanSerifCN-Regular,26.0', 'OHKPGR+SourceHanSerifCN-Regular,9.0', 'OHKPGR+SourceHanSerifCN-Regular,12.0', 'OHKPGR+SourceHanSerifCN-Regular,8.5', 'OHKPGR+SourceHanSerifCN-Regular,16.0', 'OHKPGR+SourceHanSerifCN-Regular,14.0'}

>>> set(e.font for e in doc2.elements)
{'Helvetica,12.0', 'FZVVCB+SourceHanSerifCN-Regular,8.5', 'FZVVCB+SourceHanSerifCN-Regular,12.0', 'FZVVCB+SourceHanSerifCN-Regular,26.0', 'FZVVCB+SourceHanSerifCN-Regular,14.0', 'FZVVCB+SourceHanSerifCN-Regular,10.5', 'FZVVCB+SourceHanSerifCN-Regular,16.0', 'FZVVCB+SourceHanSerifCN-Regular,9.0'}

>>> set(e.font for e in doc3.elements)
{'PCAGML+SourceHanSerifCN-Regular,16.0', 'PCAGML+SourceHanSerifCN-Regular,14.0', 'PCAGML+SourceHanSerifCN-Regular,26.0', 'PCAGML+SourceHanSerifCN-Regular,12.0', 'PCAGML+SourceHanSerifCN-Regular,8.5', 'PCAGML+SourceHanSerifCN-Regular,9.0', 'Helvetica,12.0', 'PCAGML+SourceHanSerifCN-Regular,10.5'}

What is the prefix:'PCAGML' of font:"PCAGML+SourceHanSerifCN-Regular,16.0"?
And how to use FONT_MAPPING over these diff?

Thanks in advance.

Some suggestions to enhance locating logics -'offset()' and 'resize()'

Hi @jstockwin ，

I am coming with some suggestions from my recent practice, would you please take a look? Thanks in advance.

Adding 'offset()' and 'resize()' shortcut method of ElementList also Element:
For ElementList.between() method, with inclusive param, users are only able to select elements like (e_1, ...e_n) or [e_1, ..., e_n]. In first case, I want to select elements like (e_1, ...e_n] or [e_1, ..., e_n)(i.e.). In further case, I want to select elements like (e_1-3, ...e_n+4), which requires elements outside the 'between' elements, maybe also requires before, after slices.
So I suggest giving users the freedom to easily choose elements:
'offset()': Giving e_0, I am able to choose e_-3 or e_3, so that I can easily locating elements between 2 offseted elements.
'resize()': Giving an element e_0, I am able to choose [e_0,...e_4].

The following is some demo of my actual usage without offset and resize, which troubles me.

    def locate_contents_between_locs(self, loc_pair_text,
            loc_pair_equal_or_contain=['equal','equal'],
            loc_pair_idx=[0,0],
            loc_pair_offset=[0,0],
            elements_scope=None,
            output_type=list,
            if_print_contents=False,
            ):
        if elements_scope == None:
            elements = self.doc.elements
        else:
            elements = elements_scope

        loc_pair = [None, None]
        for _ in range(2):
            if loc_pair_equal_or_contain[_] =='equal':
                loc_pair[_] = elements.filter_by_text_equal(loc_pair_text[_])[loc_pair_idx[_]]
            elif loc_pair_equal_or_contain[_] =='contain':
                loc_pair[_] = elements.filter_by_text_contains(loc_pair_text[_])[loc_pair_idx[_]]

            if loc_pair_offset[_] > 0:#inclusive [0, 1, 2...
                loc_pair[_] = elements.after(loc_pair[_], inclusive=True)[loc_pair_offset[_]]
            elif loc_pair_offset[_] < 0:#exclusive ...-2, -1]
                loc_pair[_] = elements.before(loc_pair[_], inclusive=False)[loc_pair_offset[_]]

        contents = elements.between(loc_pair[0], loc_pair[1])
        if if_print_contents:
            print(f'{contents}:{[c.text() for c in contents]}')
        
        return contents

Allow getting all sections with a specific name

sectioning should provide something similar to .sections which allows you to filter for sections with a specific (non-unique) name.

Fix visualise page changes when some pages were blank

We always do e.g. self.current_page += 1, however there may be pages 'missing' from the document if some were blank. This can cause crashes etc.

Add feature to remove duplicate header rows

It is often the case that if a table goes over a page break then the header is repeated on the new page.

Even though it's not strictly a pdf parsing thing, it might be nice to add a util to handle this case (similar to the fact we handle adding the header to the table even though that's not strictly pdf parsing).

Essentially, if the header row repeats then it should be removed. Essentially I think all rows would be checked against the header row, and if they match both the text and the font, it should be removed from the table. We'd have to keep track of the removed rows sometimes so that the checks pass (we have checks to ensure the correct number of elements were detected).

There should be a parameter to enable this behaviour and it should default to False.

Add tolerance to geometric filtering functions

We have some cases (in this case in a table) where elements only just overlap along a certain axis, and we don't want to include these in e.g. "vertically in line with", since they barely touch.

We've decided to add a tolerance parameter which defaults to 0, but allows you to specify the extent to which elements much overlap along the relevant axis to be considered in line with.

Create `add_tag_to_elements` to `ElementList` class

There should be an easy way to add a tag to a group of elements

Add some examples to the documentation

It would be good to have some example use cases in the documentation somewhere.

Add tests

Change handling of ignore

The ignore property should be renamed to ignored.
We should implement an ignore() method, which sets ignore to True.
We should implement ignore_elements on an ElementList.

For performance, we should keep a set of ignored_indexes on the document. A PDFElement should have a reference to its PDFDocument so it can check if it's ignored.

Ignored elements will be excluded from all lists, this should be done on __init__ of the ElementLists, and can be done fast as you can just do - self.document.ignored_indexes. The ignore_elements on an ElementList can also be efficient, rather than calling ignore() on each of it's contained elements.

[performance] Disable advanced layout analysis

I noticed that by setting boxes_flow outside the documented range, you can actually disable PDFMiner's advanced layout analysis.

We don't need the advanced analysis since we have no hierarchy of text boxes and we order them ourselves, and it's quite a performance gain to leave these out.

I've filed an issue (and fix) to update the documentation and also allow boxes_flow to be passed as None to explicitly disable this: pdfminer/pdfminer.six#395

Once that's merged, we should either default or hard-code our boxes_flow la param to None. It feels like we should allow it to be overridden, but equally since we ignore the resulting analysis perhaps there's no point and we should hard-code it to None.

Publish to PyPI

We should publish to PyPI to allow for pip3 install py-pdf-parser.

Once this is done, we should update the installation instructions on the documentation.

Tables spanning multiple pages return rows in incorrect order

There is a sort in the table extraction which sorts based on y0. This must also sort by page number, as currently rows are returned in an incorrect order if the table goes over multiple pages.

Use pdfminer high level functions in loaders

There are some new "high level" functions in PDFMiner, for example extract_pages.

We should be able to use these in our loaders, which should save a few lines of code interacting with PDFMiner.

There is one issue, which is that the high level functions require the file to be on your device, and so whilst we could change load_file(path_to_file: str, ...), we will not yet be able to change load(pdf_file: IO, ...). I've created an upstream issue for this here, which I'll also submit a fix for. This issue should probably wait until I've closed the upstream issue.

Run tests on Jenkins

extract_table ignores ordering defined while loading the document

Bug Report

extract_table re-orders the table rows by the y axis (top to bottom), which works for most cases.

The issue comes if we have a table with a header which is below any of the other elements of the table, when we have a table in a page split by 2 columns for example:

In the above case, even if element_ordering is properly set in load to adjust to the page split, extract_table would return:

[["C", "D"], ["E", "F"],["HEADER 1", "HEADER 2"], ["A", "B"]]

Should we make extract_table obey the ordering on which the document elements are defined? Or should we add some sort of rows_sort and columns_sort options to the function?

Add tests for loaders

Currently the loaders aren't tested. We should add a (small) real PDF document to the tests and check that it can be loaded using our loaders.

Update when to use py-pdf-parser documentation

It was pointed out in #100 (comment) that perhaps our documentation is a bit negative, and we could paint the tool as a bit more useful than we do.

@forhonourlx has kindly made a suggestion in #100 (comment), which I'll adapt slightly and add to the documentation.

Add documentation

We have docstrings, but it would be good to have some proper (mostly auto-generated) documentation

Cache filtering by font

Filtering by font occurs frequently, and involves checking the font of each element.

It could be worth caching the filtered elements for each font somewhere on the document. This shouldn't be done at load, only when the function for each font is called.

Allow some gaps in the table for extract_simple_table

A lot of tables simply have a few gaps, but have at least one row and one column which are complete.

Currently in this case we have to use extract_table which is much slower than extract_simple_table (I've not done the maths but extract_simple_table should be ~n^2 whereas extract_table is at least n^3, possibly n^3.

We should allow gaps in the table for extract_simple_table, provided there is complete row and column.

Rename table/text extraction functions

Currently we have

extract_table which takes an ElementList and returns a list of list of PDFElements.
extract_text_from_table which takes an ElementList and returns a list of list of strings.
_extract_text_from_table which takes a list of list of PDFElements and returns a list of list of strings.

We have a use case where we want to play with the "table" of PDFElements and then extract the text (i.e. we want to call _extract_text_from_table). Additionally, the extract_text_from_table name is confusing because it reads as though it expects a table whereas it actually wants an ElementList.

We should have:

extract_table(as_text=True) which takes an ElementList and returns a table (list of list) of either PDFElements or strings depending on the as_text argument. It will also have to take strip=False after #21 is implemented.
get_text_from_table which takes a table (i.e. list of list of PDFElements) and returns a list of list of strings

Add repr to section class

We don't currently have a nice __repr__ for the section class. It would be nice if this e.g. said the name and the number of elements or something similar.

Include text which is within figures

PDFMiner.six has a layout parameter, all_texts, which, if set to True, will also perform layout analysis on text within figures.

Doing this in py pdf parser does nothing, since we only look at text boxes. We should also include text from figures when all_texts=True.

Better visualisations of sections

It would be nice if we could draw large box outlines around elements with each section, to make it easy to see where each section is.

This is a little bit tricky, as sometimes the shape will have to be more complicated than a square, for example:

The problem should be something similar to finding the minimum shape containing all the elements in the section. The problem could be slightly more complicated than the above example if the elements are all over the place, but sections always contain continuous groups of elements so it should be okay.

One issue could be that I think at the moment we order based on the centre of an element. Thus if one element is much smaller than another there could be some overlap which shouldn't be there. I think we will have to cope with it not being perfect, but we should try to do pretty well as we don't want it to be misleading.

Note this is the visualise tool so we're not too interested in performance.

Allow different element orderings

At the moment, elements are ordered left to right, top to bottom.

This should be configurable, which would allow us to handle:
(1) right to left, top to bottom,
(2) pages which are landscape and so come through as vertical,
(3) PDFs with e.g. two columns of content (slightly more complicated).

We should have an argument which has some presets, but also can be a function so you can pass in your own.

Allow filtering by regular expressions

The most basic thing to add would be filter_by_regular_expression (or maybe simply filter_by_regex). I don't think there's any need to add anything which allows you to provide multiple regular expressions as you can just construct a longer single regex to do this.

Unable to install with pip3

Bug Report

Before submitting an issue, please ensure you have read our CONTRIBUTING.md, and follow the Code of Conduct.

Thanks for taking the time to report a bug. To help us fix it quickly, please include the following information:

Unable to install via pip as show in the documentation. Running pip3 install py-pdf-parser I get the error code

    ERROR: Command errored out with exit status 1:
     command: /home/hank/Development/pdf_parse/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-j1z0oskj/wand/setup.py'"'"'; __file__='"'"'/tmp/pip-install-j1z0oskj/wand/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-j1z0oskj/wand/pip-egg-info
         cwd: /tmp/pip-install-j1z0oskj/wand/
    Complete output (18 lines):
    Traceback (most recent call last):
      File "/tmp/pip-install-j1z0oskj/wand/wand/api.py", line 180, in <module>
        libraries = load_library()
      File "/tmp/pip-install-j1z0oskj/wand/wand/api.py", line 135, in load_library
        raise IOError('cannot find library; tried paths: ' + repr(tried_paths))
    OSError: cannot find library; tried paths: []
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-j1z0oskj/wand/setup.py", line 8, in <module>
        from wand.version import VERSION
      File "/tmp/pip-install-j1z0oskj/wand/wand/version.py", line 45, in <module>
        from .api import libmagick, library
      File "/tmp/pip-install-j1z0oskj/wand/wand/api.py", line 198, in <module>
        distname, _, __ = platform.linux_distribution()
    AttributeError: module 'platform' has no attribute 'linux_distribution'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

I am trying to install on KDE Neon and using that and the dev version of py-pdf-parser give me the same error. Though it seems to be working on a Windows boot.

Filtering by fonts is broken

#73 introduced a bug whereby ALL elements which were cached by font were returned, rather than only the ones for that font.

Run tests on GitHub Actions

Currently tests are running on our private jenkins, which isn't good if we want to open source (we do!), so let's use github actions instead!

`create_section` should throw a better error if it isn't passed a `PDFElement`

It's easy to accidentally pass an ElementList as the start/end element, for example, and then the error thrown is AttributeError: 'ElementList' object has no attribute '_index'. This is not very clear and should be improved.

Ensure CI runs on PRs

[tests] Create some tests which use real PDFs

Some things are mocked a lot, for example most of the output from PDFMiner.

We should run some full examples of extracting things from a simple PDF.

Add `remove_duplicate_header_rows` flag to a documentation example

#89 should have had an example added to the documentation. We already have an example that runs through a variety of tables, including ones which go over the page. We should highlight this new option there.

Add more filtering functions

E.g. by text, by font

Use LTChar.size to extract the font size

We currently call .height. However, from pdfminer/pdfminer.six#202 it looks as though LTChar has a size attribute. We should use this instead.

That said, we should check the PDFMiner code and how it deals with precision etc.

Note: This is a potential breaking change if the font sizes end up changing slightly.

Filtering for a non-existent section raises a KeyError

This is implemented as a dictionary access on the sections dict, which means if you filter for a (unique) section name (i.e. including the _idx) then you get a KeyError. Filtering for a non-existent section should probably handle this exception and return an empty section list.

It might be worth adding a get_section method which raises a SectionNotFoundError instead of a KeyError when given a unique section name which doesn't exist.

jstockwin / py-pdf-parser Goto Github PK

py-pdf-parser's Introduction

py-pdf-parser

py-pdf-parser's People

Contributors

Stargazers

Watchers

Forkers

py-pdf-parser's Issues

Recommend Projects

Recommend Topics

Recommend Org