Code Monkey home page Code Monkey logo

py-pdf-parser's Introduction

py-pdf-parser

PyPI version Continuous Integration Documentation Status

Py PDF Parser is a tool to help extracting information from structured PDFs.

Full details and installation instructions can be found at: https://py-pdf-parser.readthedocs.io/en/latest/

This project is based on an original design and protoype by Sam Whitehall (github.com/samwhitehall).

py-pdf-parser's People

Contributors

aceto1 avatar barraponto avatar dantehemerson avatar dependabot[bot] avatar franga2000 avatar fynxiu avatar hexarobi avatar jean-garret avatar jstockwin avatar paulopaixaoamaral avatar pre-commit-ci[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

py-pdf-parser's Issues

Finish the info screen on visualise tool

You can pass show_info=True to the visualise tool, and this allows you to click on elements and see details etc.

It is unfinished and needs work.

  • The visuals need improving -> we currently just throw loads of text into a figure. Perhaps this can be better formatted?
  • Some of the fields could be improved, and we're missing e.g. character margins etc. It would be good if it enabled you to completely understand the LAParams, since these can be quite confusing.
  • Additionally, once this is done it would be good to add an example to the documentation.

Add code coverage checks to CI

We should add some code coverage checks to the CI.

Initially this will help us find untested areas (note the visualise tool is currently untested as we're not really sure how to go about testing it...).

Moving forwards, it will help catch and PRs which don't add sufficient tests.

Use of Visualize

Bug Report

Please also check that your bug is not actually caused by pdfminer.six, and is really an issue with this project.

example:

from py_pdf_parser.loaders import load_file
from py_pdf_parser.visualise import visualise

document = load_file("1907-1912_RESULTS.pdf")
visualise(document)

Error:

Traceback (most recent call last):
  File "C:\Users\andrewp\AppData\Roaming\Python\Python38\site-packages\wand\api.py", line 180, in <module>
    libraries = load_library()
  File "C:\Users\andrewp\AppData\Roaming\Python\Python38\site-packages\wand\api.py", line 135, in load_library
    raise IOError('cannot find library; tried paths: ' + repr(tried_paths))
OSError: cannot find library; tried paths: [~/CORE_RL_wand_.dll, etc.]

Problem: imagemagic v7+ no longer includes the searched-for dll files.

Resolution:
Uninstall most recent imagemagic (7.0.10-34) and install the latest legacy imagemagic (6.9.11-34)

Solution:
Provide compatibility with v7+ or note that one should install the Legacy version.

Consider using sorted sets?

Our filtering currently uses Python's frozensets. This is mainly to be 100% sure we're not editing old element lists etc by applying additional operations. Performance-wise, frozensets are the same as sets.

Calling __getitem__ on an element list means we need to sort the frozenset, which we do by calling sorted. Additionally #113 adds some new functions which requite sorting.

We could use sorted sets (e.g. http://www.grantjenks.com/docs/sortedcontainers/introduction.html#sorted-set). I believe we could just override the functions which allow them to be mutated to be sure we're not doing this. This should give us some performance gain since we won't need to keep explicitly sorting the elements. This should be checked.

We should also check that there is no performance loss when doing set operations on a sorted set vs a set. We also need to e.g. add and remove elements from the set, but not by mutating the set. This might require some copies, which could slow things down?

Pros:

  • Performance gain (needs verification)
  • Some parts of the code become simpler

Cons:

  • There'd be an additional dependency
  • We'd probably want to modify the sorted set class slightly to ensure we don't allow mutations
  • Possible (but I think unlikely) performance drop when doing set operations?

Thoughts/ideas on this appreciated.

Strip element text by default

Instead of a property element.text we should have a method element.text(strip=True). This means text gets stripped by default, but you can stop this behaviour if you want.

Section visualisations can be made simpler in some cases

#69 Added code to show section outlines. Because we wanted to be 100% correct in which elements were within the outline, the construction is quite complicated and can result in strangely shaped outlines. The code isn't too slow, but it's also not super fast.

In the case where a simple rectangle can be drawn around the section (which should be reasonably trivial to check), this should be done.

Add `include_last_element` to `create_section`

It seems to be a common case that you know the element at which you want to end a section, but you don't want that element to be included (e.g. because this is the start of the next section). We should add an include_last_element=True) argument to create_section. Note that this can simply subtract one from the index of the element and then get the new element from the document.

We could also add include_start_element, but I think this is less useful so maybe we won't bother until it feels like a necessary feature.

Change font sizes to floats

PDFMiner has changed the way it gets the font size since this code was written. I think initially the heights had lots of decimal places etc.

We've just seen a PDF where the heights are now e.g. 7.5. Due to some precision issue (?) this is sometimes rounded up to 8 or down to 7, for different elements with exactly the same font.

We should switch from using int for the font to using float instead. We should also check multiple PDFs to decide if we want to add any rounding, or whether they all seem reasonable.

Extract simple table could be more efficient

At the moment we're looping through the reference row and then the reference column, and computing the element which is in line with both the reference row element and the reference col element.

This means we're doing to geometry checks to work out each row len(cols) times, and the checks to work out each col len(rows) times.

We should do the geometry filtering at the start to create a list of rows and a list of cols. We can then just & the two to get the element.

This should require much less processing as you only do geometry checks len(rows) + len(cols) times, rather then len(rows) * len(cols) times.

Too large a tolerance causes an error

We allow passing a tolerance parameter to the horizontally_in_line_with and vertically_in_line_with functions. However, if you specify a tolerance which is larger than the width/height of the element, this instantiates an invalid bounding box with either x0 > x1 or y0 > y1.

We should cap the tolerance at the width/height of the element.

What is the prefix:'PCAGML' of font:"PCAGML+SourceHanSerifCN-Regular,16.0"?

Hi @jstockwin ,

I am trying to load 3 similar files which is auto generated from same system, but get 3 different prefix of font.

>>> set(e.font for e in doc.elements)
{'OHKPGR+SourceHanSerifCN-Regular,10.5', 'Helvetica,12.0', 'OHKPGR+SourceHanSerifCN-Regular,26.0', 'OHKPGR+SourceHanSerifCN-Regular,9.0', 'OHKPGR+SourceHanSerifCN-Regular,12.0', 'OHKPGR+SourceHanSerifCN-Regular,8.5', 'OHKPGR+SourceHanSerifCN-Regular,16.0', 'OHKPGR+SourceHanSerifCN-Regular,14.0'}

>>> set(e.font for e in doc2.elements)
{'Helvetica,12.0', 'FZVVCB+SourceHanSerifCN-Regular,8.5', 'FZVVCB+SourceHanSerifCN-Regular,12.0', 'FZVVCB+SourceHanSerifCN-Regular,26.0', 'FZVVCB+SourceHanSerifCN-Regular,14.0', 'FZVVCB+SourceHanSerifCN-Regular,10.5', 'FZVVCB+SourceHanSerifCN-Regular,16.0', 'FZVVCB+SourceHanSerifCN-Regular,9.0'}

>>> set(e.font for e in doc3.elements)
{'PCAGML+SourceHanSerifCN-Regular,16.0', 'PCAGML+SourceHanSerifCN-Regular,14.0', 'PCAGML+SourceHanSerifCN-Regular,26.0', 'PCAGML+SourceHanSerifCN-Regular,12.0', 'PCAGML+SourceHanSerifCN-Regular,8.5', 'PCAGML+SourceHanSerifCN-Regular,9.0', 'Helvetica,12.0', 'PCAGML+SourceHanSerifCN-Regular,10.5'}

What is the prefix:'PCAGML' of font:"PCAGML+SourceHanSerifCN-Regular,16.0"?
And how to use FONT_MAPPING over these diff?

Thanks in advance.

Some suggestions to enhance locating logics -'offset()' and 'resize()'

Hi @jstockwin ๏ผŒ

I am coming with some suggestions from my recent practice, would you please take a look? Thanks in advance.

  1. Adding 'offset()' and 'resize()' shortcut method of ElementList also Element:
    For ElementList.between() method, with inclusive param, users are only able to select elements like (e_1, ...e_n) or [e_1, ..., e_n]. In first case, I want to select elements like (e_1, ...e_n] or [e_1, ..., e_n)(i.e.). In further case, I want to select elements like (e_1-3, ...e_n+4), which requires elements outside the 'between' elements, maybe also requires before, after slices.
    So I suggest giving users the freedom to easily choose elements:
    'offset()': Giving e_0, I am able to choose e_-3 or e_3, so that I can easily locating elements between 2 offseted elements.
    'resize()': Giving an element e_0, I am able to choose [e_0,...e_4].

The following is some demo of my actual usage without offset and resize, which troubles me.

    def locate_contents_between_locs(self, loc_pair_text,
            loc_pair_equal_or_contain=['equal','equal'],
            loc_pair_idx=[0,0],
            loc_pair_offset=[0,0],
            elements_scope=None,
            output_type=list,
            if_print_contents=False,
            ):
        if elements_scope == None:
            elements = self.doc.elements
        else:
            elements = elements_scope

        loc_pair = [None, None]
        for _ in range(2):
            if loc_pair_equal_or_contain[_] =='equal':
                loc_pair[_] = elements.filter_by_text_equal(loc_pair_text[_])[loc_pair_idx[_]]
            elif loc_pair_equal_or_contain[_] =='contain':
                loc_pair[_] = elements.filter_by_text_contains(loc_pair_text[_])[loc_pair_idx[_]]

            if loc_pair_offset[_] > 0:#inclusive [0, 1, 2...
                loc_pair[_] = elements.after(loc_pair[_], inclusive=True)[loc_pair_offset[_]]
            elif loc_pair_offset[_] < 0:#exclusive ...-2, -1]
                loc_pair[_] = elements.before(loc_pair[_], inclusive=False)[loc_pair_offset[_]]

        contents = elements.between(loc_pair[0], loc_pair[1])
        if if_print_contents:
            print(f'{contents}:{[c.text() for c in contents]}')
        
        return contents

Add feature to remove duplicate header rows

It is often the case that if a table goes over a page break then the header is repeated on the new page.

Even though it's not strictly a pdf parsing thing, it might be nice to add a util to handle this case (similar to the fact we handle adding the header to the table even though that's not strictly pdf parsing).

Essentially, if the header row repeats then it should be removed. Essentially I think all rows would be checked against the header row, and if they match both the text and the font, it should be removed from the table. We'd have to keep track of the removed rows sometimes so that the checks pass (we have checks to ensure the correct number of elements were detected).

There should be a parameter to enable this behaviour and it should default to False.

Add tolerance to geometric filtering functions

We have some cases (in this case in a table) where elements only just overlap along a certain axis, and we don't want to include these in e.g. "vertically in line with", since they barely touch.

We've decided to add a tolerance parameter which defaults to 0, but allows you to specify the extent to which elements much overlap along the relevant axis to be considered in line with.

Change handling of ignore

  • The ignore property should be renamed to ignored.
  • We should implement an ignore() method, which sets ignore to True.
  • We should implement ignore_elements on an ElementList.

For performance, we should keep a set of ignored_indexes on the document. A PDFElement should have a reference to its PDFDocument so it can check if it's ignored.

Ignored elements will be excluded from all lists, this should be done on __init__ of the ElementLists, and can be done fast as you can just do - self.document.ignored_indexes. The ignore_elements on an ElementList can also be efficient, rather than calling ignore() on each of it's contained elements.

[performance] Disable advanced layout analysis

I noticed that by setting boxes_flow outside the documented range, you can actually disable PDFMiner's advanced layout analysis.

We don't need the advanced analysis since we have no hierarchy of text boxes and we order them ourselves, and it's quite a performance gain to leave these out.

I've filed an issue (and fix) to update the documentation and also allow boxes_flow to be passed as None to explicitly disable this: pdfminer/pdfminer.six#395

Once that's merged, we should either default or hard-code our boxes_flow la param to None. It feels like we should allow it to be overridden, but equally since we ignore the resulting analysis perhaps there's no point and we should hard-code it to None.

Publish to PyPI

We should publish to PyPI to allow for pip3 install py-pdf-parser.

Once this is done, we should update the installation instructions on the documentation.

Use pdfminer high level functions in loaders

There are some new "high level" functions in PDFMiner, for example extract_pages.

We should be able to use these in our loaders, which should save a few lines of code interacting with PDFMiner.

There is one issue, which is that the high level functions require the file to be on your device, and so whilst we could change load_file(path_to_file: str, ...), we will not yet be able to change load(pdf_file: IO, ...). I've created an upstream issue for this here, which I'll also submit a fix for. This issue should probably wait until I've closed the upstream issue.

extract_table ignores ordering defined while loading the document

Bug Report

extract_table re-orders the table rows by the y axis (top to bottom), which works for most cases.

The issue comes if we have a table with a header which is below any of the other elements of the table, when we have a table in a page split by 2 columns for example:
extract_table_bug

In the above case, even if element_ordering is properly set in load to adjust to the page split, extract_table would return:

[["C", "D"], ["E", "F"],["HEADER 1", "HEADER 2"], ["A", "B"]]

Should we make extract_table obey the ordering on which the document elements are defined? Or should we add some sort of rows_sort and columns_sort options to the function?

Add tests for loaders

Currently the loaders aren't tested. We should add a (small) real PDF document to the tests and check that it can be loaded using our loaders.

Add documentation

We have docstrings, but it would be good to have some proper (mostly auto-generated) documentation

Cache filtering by font

Filtering by font occurs frequently, and involves checking the font of each element.

It could be worth caching the filtered elements for each font somewhere on the document. This shouldn't be done at load, only when the function for each font is called.

Allow some gaps in the table for extract_simple_table

A lot of tables simply have a few gaps, but have at least one row and one column which are complete.

Currently in this case we have to use extract_table which is much slower than extract_simple_table (I've not done the maths but extract_simple_table should be ~n^2 whereas extract_table is at least n^3, possibly n^3.

We should allow gaps in the table for extract_simple_table, provided there is complete row and column.

Rename table/text extraction functions

Currently we have

  • extract_table which takes an ElementList and returns a list of list of PDFElements.
  • extract_text_from_table which takes an ElementList and returns a list of list of strings.
  • _extract_text_from_table which takes a list of list of PDFElements and returns a list of list of strings.

We have a use case where we want to play with the "table" of PDFElements and then extract the text (i.e. we want to call _extract_text_from_table). Additionally, the extract_text_from_table name is confusing because it reads as though it expects a table whereas it actually wants an ElementList.

We should have:

  • extract_table(as_text=True) which takes an ElementList and returns a table (list of list) of either PDFElements or strings depending on the as_text argument. It will also have to take strip=False after #21 is implemented.
  • get_text_from_table which takes a table (i.e. list of list of PDFElements) and returns a list of list of strings

Add __repr__ to section class

We don't currently have a nice __repr__ for the section class. It would be nice if this e.g. said the name and the number of elements or something similar.

Include text which is within figures

PDFMiner.six has a layout parameter, all_texts, which, if set to True, will also perform layout analysis on text within figures.

Doing this in py pdf parser does nothing, since we only look at text boxes. We should also include text from figures when all_texts=True.

Better visualisations of sections

It would be nice if we could draw large box outlines around elements with each section, to make it easy to see where each section is.

This is a little bit tricky, as sometimes the shape will have to be more complicated than a square, for example:
image

The problem should be something similar to finding the minimum shape containing all the elements in the section. The problem could be slightly more complicated than the above example if the elements are all over the place, but sections always contain continuous groups of elements so it should be okay.

One issue could be that I think at the moment we order based on the centre of an element. Thus if one element is much smaller than another there could be some overlap which shouldn't be there. I think we will have to cope with it not being perfect, but we should try to do pretty well as we don't want it to be misleading.

Note this is the visualise tool so we're not too interested in performance.

Allow different element orderings

At the moment, elements are ordered left to right, top to bottom.

This should be configurable, which would allow us to handle:
(1) right to left, top to bottom,
(2) pages which are landscape and so come through as vertical,
(3) PDFs with e.g. two columns of content (slightly more complicated).

We should have an argument which has some presets, but also can be a function so you can pass in your own.

Allow filtering by regular expressions

The most basic thing to add would be filter_by_regular_expression (or maybe simply filter_by_regex). I don't think there's any need to add anything which allows you to provide multiple regular expressions as you can just construct a longer single regex to do this.

Unable to install with pip3

Bug Report

Before submitting an issue, please ensure you have read our CONTRIBUTING.md, and follow the Code of Conduct.

Thanks for taking the time to report a bug. To help us fix it quickly, please include the following information:

Unable to install via pip as show in the documentation. Running pip3 install py-pdf-parser I get the error code

    ERROR: Command errored out with exit status 1:
     command: /home/hank/Development/pdf_parse/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-j1z0oskj/wand/setup.py'"'"'; __file__='"'"'/tmp/pip-install-j1z0oskj/wand/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-j1z0oskj/wand/pip-egg-info
         cwd: /tmp/pip-install-j1z0oskj/wand/
    Complete output (18 lines):
    Traceback (most recent call last):
      File "/tmp/pip-install-j1z0oskj/wand/wand/api.py", line 180, in <module>
        libraries = load_library()
      File "/tmp/pip-install-j1z0oskj/wand/wand/api.py", line 135, in load_library
        raise IOError('cannot find library; tried paths: ' + repr(tried_paths))
    OSError: cannot find library; tried paths: []
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-j1z0oskj/wand/setup.py", line 8, in <module>
        from wand.version import VERSION
      File "/tmp/pip-install-j1z0oskj/wand/wand/version.py", line 45, in <module>
        from .api import libmagick, library
      File "/tmp/pip-install-j1z0oskj/wand/wand/api.py", line 198, in <module>
        distname, _, __ = platform.linux_distribution()
    AttributeError: module 'platform' has no attribute 'linux_distribution'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

I am trying to install on KDE Neon and using that and the dev version of py-pdf-parser give me the same error. Though it seems to be working on a Windows boot.

Run tests on GitHub Actions

Currently tests are running on our private jenkins, which isn't good if we want to open source (we do!), so let's use github actions instead!

Filtering for a non-existent section raises a KeyError

This is implemented as a dictionary access on the sections dict, which means if you filter for a (unique) section name (i.e. including the _idx) then you get a KeyError. Filtering for a non-existent section should probably handle this exception and return an empty section list.

It might be worth adding a get_section method which raises a SectionNotFoundError instead of a KeyError when given a unique section name which doesn't exist.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.