timclicks / slate Goto Github PK

The simplest way to extract text from PDFs in Python

License: GNU General Public License v3.0

Python 100.00%

slate's Introduction

======================================================
slate: the easiest way to get text from PDFs in Python
======================================================


Slate is a Python package that simplifies the process of extracting
text from PDF files. It depends on the PDFMiner package.

Slate provides one class, PDF. PDF takes a file-like object and
will extract all text from the document, presentating each page
as a string of text:

  >>> with open('example.pdf') as f:
  ...    doc = slate.PDF(f)
  ...
  >>> doc 
  [..., ..., ...]
  >>> doc[1]
  'Text from page 2...'

If your pdf is password protected, pass the password as the
second argument:

  >>> with open('secrets.pdf') as f:
  ...     doc = slate.PDF(f, 'password')
  ...
  >>> doc[0]
  "My mother doesn't know this, but..."

More complex operations
-----------------------

If you would like access to the images, font files and other
information, then take some time to learn the PDFMiner API.


What is wrong with PDFMiner?
----------------------------

  1. Getting simple things done, like extracting the text
     is quite complex. The program is not designed to return
     Python objects, which makes interfacing things irritating.
  2. It's an extremely complete set of tools, with multiple 
     and moderately  steep learning curves.
  3. It's not written with hackability in mind.

slate's People

Contributors

Stargazers

Watchers

Forkers

usethekey ecarreras mlhamel pombredanne amitaronovitch bellisk janbenson gtfierro chrispbailey areyesnav warija canoefzh rbarraud robin051 djangsters aaronjoel mrtitan lovo-h zjc5415 andyinabox maurizioabba pombreda anupam02 why-not-sky topecz silberschleier congressnewmedia jonhersh jimr woodbine markaward ashwini0529 erickpeirson jasco john-keating jordanreiter ichraibi zeerzomn markfontenot stumpyfrostreaver thedatashed novilabs-archived alesman aadu hblamtari scotthb urwithajit9 twistle sanguis rtouze mpower4ru chabobo zlove lucaslamounier urkonn johnbachman wchatx softdeventrepreneurs runkelz maltevogl altschool noni0511 takesxisximada gsdu8g9 rkirmizi 18f sviktorov kindoflogical whalenkaiser ahshida danishabdullah gragtah dolasilpa tianzeqian liambarnett kunksed dbbevan vvnt mcdallas rafaelgontijo christwell b2220333 anirbanbanik1998 rupenp sriramsaiamuluru1 jesserobertson thilakarajk zhangwc2017 idealopamp ritabratamaiti karthik-ir javedch violet4 knfriends vinodkvijayan mengchangnus arunmk pgr-gallup pengzhixu biowar

slate's Issues

Can we get it documented that this doesn't work with python 3.somthing+?

On python 3.7

>>> import slate
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/gmauer/.local/share/virtualenvs/resume-parser-uEcoETqN/lib/python3.7/site-packages/slate/__init__.py", line 48, in <module>
    from slate import PDF
ImportError: cannot import name 'PDF' from 'slate' (/Users/gmauer/.local/share/virtualenvs/resume-parser-uEcoETqN/lib/python3.7/site-packages/slate/__init__.py)

Looking through the issues and unmerged PRs it seems like this doesn't work with some versions of python (3.4+ maybe?). I get that maintainers have limited amounts of time and so on, I am an OSS maintainer as well. But I've now 3 different pdf parsing libraries deep which I set up and start working with to find an issue like this. It is incredibly frustrating.

Can we at least document it toward the top of the readme which notes that the main pip version doesn't work with more recent python versions?

Make sure to Initialize fields

See: cancan101@bfd6336

ImportError: cannot import name 'handle_error' from 'pdfminer.psparser'

from pdfminer.psparser import PSStackParser, PSSyntaxError, PSEOF, literal_name, LIT, KWD, handle_error
ImportError: cannot import name 'handle_error' from 'pdfminer.psparser' (/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pdfminer/psparser.py)

PyPI version is out of date with the current version of PDFminer

When I did pip install with the master branch of this repo, it works.

The version on PyPI gives the error

import slate
Traceback (most recent call last):
File "", line 1, in
File "/Users/anadas/.virtualenvs/scratch/lib/python2.7/site-packages/slate/init.py", line 48, in
from slate import PDF
File "/Users/anadas/.virtualenvs/scratch/lib/python2.7/site-packages/slate/slate.py", line 3, in
from pdfminer.pdfparser import PDFParser, PDFDocument
ImportError: No module named pdfminer.pdfparser

AttributeError when opening documents that contain chinese characters

When opening documents that contain chinese/thai characters i get an exception saying :

File "create_index.py", line 16, in <module>
    pdf_data = slate.PDF(f)
  File "/home/alexander/dev/pdf_indexer/env/local/lib/python2.7/site-packages/slate/slate.py", line 49, in __init__
    self._cleanup()
  File "/home/alexander/dev/pdf_indexer/env/local/lib/python2.7/site-packages/slate/slate.py", line 57, in _cleanup
    del self.device
AttributeError: device

Install error using python 3

$ pip install slate
Collecting slate
  Downloading slate-0.3.zip
Collecting distribute (from slate)
  Using cached distribute-0.7.3.zip
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/r9/kv1j05355r34570x2f5wpxpr0000gn/T/pip-build-l23narcj/distribute/setuptools/__init__.py", line 2, in <module>
        from setuptools.extension import Extension, Library
      File "/private/var/folders/r9/kv1j05355r34570x2f5wpxpr0000gn/T/pip-build-l23narcj/distribute/setuptools/extension.py", line 5, in <module>
        from setuptools.dist import _get_unpatched
      File "/private/var/folders/r9/kv1j05355r34570x2f5wpxpr0000gn/T/pip-build-l23narcj/distribute/setuptools/dist.py", line 7, in <module>
        from setuptools.command.install import install
      File "/private/var/folders/r9/kv1j05355r34570x2f5wpxpr0000gn/T/pip-build-l23narcj/distribute/setuptools/command/__init__.py", line 8, in <module>
        from setuptools.command import install_scripts
      File "/private/var/folders/r9/kv1j05355r34570x2f5wpxpr0000gn/T/pip-build-l23narcj/distribute/setuptools/command/install_scripts.py", line 3, in <module>
        from pkg_resources import Distribution, PathMetadata, ensure_directory
      File "/private/var/folders/r9/kv1j05355r34570x2f5wpxpr0000gn/T/pip-build-l23narcj/distribute/pkg_resources.py", line 1518, in <module>
        register_loader_type(importlib_bootstrap.SourceFileLoader, DefaultProvider)
    AttributeError: module 'importlib._bootstrap' has no attribute 'SourceFileLoader'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/r9/kv1j05355r34570x2f5wpxpr0000gn/T/pip-build-l23narcj/distribute/](url)

use cStringIO rather than StringIO

See: cancan101@c53fa28

PDFDocument() from pdfminer requires a parser argument

Using the fix from #24 and python 3.5.2, I call slate.PDF(file) but PDFDocument requires a parser. What should be put here? I tried self.parser but this didn't work.

Traceback (most recent call last):
File "pdftotext.py", line 7, in
doc = slate.PDF(f)
File "//anaconda/lib/python3.5/site-packages/slate/classes.py", line 56, in init
self.doc = PDFDocument()
TypeError: init() missing 1 required positional argument: 'parser'

putting self.doc = PDFDocument(self.parser) leads to this error that I cannot fix either.

Traceback (most recent call last):
File "pdftotext.py", line 7, in
doc = slate.PDF(f)
File "//anaconda/lib/python3.5/site-packages/slate/classes.py", line 57, in init
self.doc = PDFDocument(self.parser)
File "//anaconda/lib/python3.5/site-packages/pdfminer/pdfdocument.py", line 559, in init
pos = self.find_xref(parser)
File "//anaconda/lib/python3.5/site-packages/pdfminer/pdfdocument.py", line 773, in find_xref
for line in parser.revreadlines():
File "//anaconda/lib/python3.5/site-packages/pdfminer/psparser.py", line 285, in revreadlines
s = self.fp.read(prevpos-pos)
File "//anaconda/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 2: invalid start byte

Upload pypi package

Please publish the new version which fix the compatibility with pdfminer

import error for slate.py file

from pdfminer.pdfparser import PDFParser, PDFDocument

in slate.py generates an error

However, this

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument

does not.

Slate is broken in PyPI

Something in pdfminer has changed and PDFDocument class is not in pdfminer.pdfparser anymore, it's now in pdfminer.pdfdocument. I figured out you have fixed this problem here in GitHub, but the version in PyPI still has this problem. Could you please reupload it?

pip database incorrect?

I was going to report a bug with imports internal to slate, but then I noticed my version doesn't match the one on github, and in fact doesn't even match the version shown as latest on pypi:

(t) howie@local-linux-2:~/tmp$ pip install slate==0.5.2
Collecting slate==0.5.2
  Could not find a version that satisfies the requirement slate==0.5.2 (from versions: 0.2.3, 0.3)
No matching distribution found for slate==0.5.2

Seems that some metadata is wrong somewhere?

Can we get paragraph from the Pdf useking slate?

Hi
this is not issue but i want to know is there way to extract paragraph from the pdf using slate or its parent library pdfminer

Error parsing a pdf with cStringIO support

Slate: master head: ff61328
Ubuntu 14.04 x64
Python 2.7.6

Received the following error when cStringIO is available for import.

.../lib/python2.7/site-packages/slate/slate.pyc in __init__(self, file, password, just_text, check_extractable)
     50                self.resmgr, self.device)
     51             for page in PDFPage.create_pages(self.doc):
---> 52                 self.append(self.interpreter.process_page(page))
     53             self.metadata = self.doc.info
     54         if just_text:

.../lib/python2.7/site-packages/slate/slate.pyc in process_page(self, page)
     34             ctm = (1,0,0,1, -x0,-y0)
     35         self.device.outfp.seek(0)
---> 36         self.device.outfp.buf = ''
     37         self.device.begin_page(page, ctm)
     38         self.render_contents(page.resources, page.contents, ctm=ctm)

AttributeError: 'cStringIO.StringO' object has no attribute 'buf'

Forcing the use of StringIO instead of cStringIO in slate.py fixed the problem.

Is slate dead? What is an alternative?

Hi Tim, as the last commit was in 2017, I assume slate is dead. Maybe you can set the development status to Development Status :: 7 - Inactive?

Which libraries would you recommend for reading PDF files?

Import error

When trying to import slate, the following error message occurs. pdfminer and slate have been installed via pip.

Using Windows XP and Python 2.7

>>> import slate
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "c:\meins\programm\python\hier\lib\site-packages\slate\__init__.py", line 48, in <module>
from slate import PDF
File "c:\meins\programm\python\hier\lib\site-packages\slate\slate.py", line 3, in <module>
from pdfminer.pdfparser import PDFParser, PDFDocument
ImportError: cannot import name PDFDocument

Looking at the pdfminer website, i found the following command that works:

from pdfminer.pdfdocument import PDFDocument

Maybe pdfminer changed their API?

Import Issue for pdfminer (python 2.7)

I am using this tool for converting some of my pdf files into a bare text file. I encountered the following issue.

import slate
Traceback (most recent call last):
File "", line 1, in
File "/Users/ameya/anaconda/envs/webapp_pandas/lib/python2.7/site-packages/slate/init.py", line 48, in
from slate import PDF
File "/Users/ameya/anaconda/envs/webapp_pandas/lib/python2.7/site-packages/slate/slate.py", line 3, in
from pdfminer.pdfparser import PDFParser, PDFDocument
ImportError: cannot import name PDFDocument
with open(' SOME FILE NAME ') as fd:
... doc = slate.PDF(fd)
...
Traceback (most recent call last):
File "", line 2, in
NameError: name 'slate' is not defined
`

I fixed it by downgrading the pdfminer to 20110515 version.
Successfully uninstalled pdfminer-20140328
Successfully installed pdfminer-20110515

I guess the issue of import occurs in following the piece of code:
classes.py

Provide page and line hooks

Provide a hooks parameter to PDF, enabling users to provide a callback for each page and/or each line respectively.

cf requests.Request

AttributeError: module 'slate' has no attribute 'PDF'

filename = "file.pdf"
import slate
with open('file.pdf') as f:
    doc = slate.PDF(f)
    print(doc)

gives

Traceback (most recent call last):
  File "C:\Users\.....\Desktop\interna\slate\slate.py", line 2, in <module>
    import slate
  File "C:\Users\.....\Desktop\interna\slate\slate.py", line 4, in <module>
    doc = slate.PDF(f)
AttributeError: module 'slate' has no attribute 'PDF'

running Python 3.6.4, installed the latest slate downloaded from github today

Convert newlines to whitespaces.

Missing newline separation :(

import error when using git version and python3.5

I have installed the git version using the setup script, and when trying to import slate i get the following error:

 >>> import slate
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/site-packages/slate-0.5.1-py3.5.egg/slate/__init__.py", line 66, in <module>
  File "/usr/local/lib/python3.5/site-packages/slate-0.5.1-py3.5.egg/slate/classes.py", line 25, in <module>
ImportError: No module named 'utils'

support extracting pdf as html using HTMLConverter

from pdfminer.converter import HTMLConverter

Text extraction fails on PDF with text watermark

Using slate installed with pip install slate==0.3 pdfminer==20110515

In [4]: pdf = slate.PDF(virtualFile)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-f5142e9f8ced> in <module>()
----> 1 pdf = slate.PDF(virtualFile)

/usr/local/lib/python2.7/dist-packages/slate/slate.pyc in __init__(self, file, password, just_text)
     47             self.metadata = self.doc.info
     48         if just_text:
---> 49             self._cleanup()
     50 
     51     def _cleanup(self):

/usr/local/lib/python2.7/dist-packages/slate/slate.pyc in _cleanup(self)
     55         PDF.
     56         """
---> 57         del self.device
     58         del self.doc
     59         del self.parser

AttributeError: device

using slate installed from the repository, with pdfminer==20140328
slate.PDF executes without errors,
but returns the empty array, []

One of the many consistently failing PDF:
FailingDocument.pdf

How do I add tabs as delimiter, and new line

How should my code snippet look like to:

Use tabs instead of space as delimiter.
Add new lines (all lines are merged).

import slate

with open('/path/to/file.pdf') as f:
     doc = slate.PDF(f)
print doc

cannot import 'slate' on Ubuntu Utopic & python 3.4.2

As it says on the tin - slate seems unimportable on Ubuntu Utopic, python 3.4.2 and what's currently released on pypi.

See full command output below:

thomi@peek$ virtualenv -p python3 ve3
Already using interpreter /usr/bin/python3
Using base prefix '/usr'
New python executable in ve3/bin/python3
Also creating executable in ve3/bin/python
Installing setuptools, pip...done.

thomi@peek$ . ve3/bin/activate

thomi@peek$ pip install slate
Downloading/unpacking slate
  Downloading slate-0.3.zip
  Running setup.py (path:/home/thomi/tmp/ve3/build/slate/setup.py) egg_info for package slate

Downloading/unpacking distribute (from slate)
  Downloading distribute-0.7.3.zip (145kB): 145kB downloaded
  Running setup.py (path:/home/thomi/tmp/ve3/build/distribute/setup.py) egg_info for package distribute

Requirement already satisfied (use --upgrade to upgrade): setuptools>=0.7 in ./ve3/lib/python3.4/site-packages (from distribute->slate)
Installing collected packages: slate, distribute
  Running setup.py install for slate

  Running setup.py install for distribute

Successfully installed slate distribute
Cleaning up...

thomi@peek$ python
Python 3.4.2 (default, Oct  8 2014, 13:08:17) 
[GCC 4.9.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import slate
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/thomi/tmp/ve3/lib/python3.4/site-packages/slate/__init__.py", line 48, in <module>
    from slate import PDF
ImportError: cannot import name 'PDF'

Getting Assert Error on every PDF

Hey just trying to get started with slate using the tutorial and getting this error. I've tried with 3 pdfs now. Just running the basic code from the tutorial:

import slate with open('testpdf3.pdf') as f: print type(f) doc = slate.PDF(f)

And the stack trace:

Traceback (most recent call last):
File "test.py", line 7, in
doc = slate.PDF(f)
File "C:\Users\S\Anaconda2\lib\site-packages\slate\slate.py", line 38, in init
self.doc.set_parser(self.parser)
File "C:\Users\S\Anaconda2\lib\site-packages\pdfminer\pdfparser.py", line 318, in set_parser
self.xrefs = parser.read_xref()
File "C:\Users\S\Anaconda2\lib\site-packages\pdfminer\pdfparser.py", line 747, in read_xref
self.read_xref_from(pos, xrefs)
File "C:\Users\S\Anaconda2\lib\site-packages\pdfminer\pdfparser.py", line 727, in read_xref_from
xref.load(self, debug=self.debug)
File "C:\Users\S\Anaconda2\lib\site-packages\pdfminer\pdfparser.py", line 99, in load
self.load_trailer(parser)
File "C:\Users\S\Anaconda2\lib\site-packages\pdfminer\pdfparser.py", line 107, in load_trailer
assert kwd is self.KEYWORD_TRAILER
AssertionError

I've tried printing kwd and it's always an integer while KEYWORD_TRAILER is 'trailer'
Used this command to install:

pip install --upgrade --ignore-installed slate==0.3 pdfminer==20110515

I'm on Windows, Python 2.7.12

Not able to read the PDF and the repository doesn't work with Python3.5?

Clubbing the 2 issues together as one issue.
The repository, I installed successfully but there was a issue while running it. The issue was related to the encoding of the file. Please take a note of this.
Then I uninstalled the library and installed with Python2. So I got the following error while running the example:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "build\bdist.win-amd64\egg\slate\classes.py", line 61, in __init__
  File "c:\Python27\lib\site-packages\pdfminer-20140328-py2.7.egg\pdfminer\pdfdocument.py", line 315, in __init__
    xref.load(parser)
  File "c:\Python27\lib\site-packages\pdfminer-20140328-py2.7.egg\pdfminer\pdfdocument.py", line 175, in load
    (_, obj) = parser.nextobject()
  File "c:\Python27\lib\site-packages\pdfminer-20140328-py2.7.egg\pdfminer\psparser.py", line 557, in nextobject
    (pos, token) = self.nexttoken()
  File "c:\Python27\lib\site-packages\pdfminer-20140328-py2.7.egg\pdfminer\psparser.py", line 482, in nexttoken
    self.fillbuf()
  File "c:\Python27\lib\site-packages\pdfminer-20140328-py2.7.egg\pdfminer\psparser.py", line 215, in fillbuf
    raise PSEOF('Unexpected EOF')
pdfminer.psparser.PSEOF: Unexpected EOF

Kindly, help me.

Metadata extraction is poor

Sometimes metadata are extracted as raw bytes, rather than in a specific encoding..

Provide support for iter(PDF) interface

We currently only send a whole PDF back as a list of strings. It would be nice to support something like:

with open('nice-info.pdf') as f:
    for page in slate.PDF(f):
        for line in page:
            ...

This would break backward compatibility though.. so perhaps this would be an okay compromise?

with open('nice-info.pdf') as f:
    for page in iter(slate.PDF(f)):
        for line in page:
        # page would now be a `Page` object that yields lines,
        # rather than a plain str
            ...