Code Monkey home page Code Monkey logo

slate's Introduction

======================================================
slate: the easiest way to get text from PDFs in Python
======================================================


Slate is a Python package that simplifies the process of extracting
text from PDF files. It depends on the PDFMiner package.

Slate provides one class, PDF. PDF takes a file-like object and
will extract all text from the document, presentating each page
as a string of text:

  >>> with open('example.pdf') as f:
  ...    doc = slate.PDF(f)
  ...
  >>> doc 
  [..., ..., ...]
  >>> doc[1]
  'Text from page 2...'

If your pdf is password protected, pass the password as the
second argument:

  >>> with open('secrets.pdf') as f:
  ...     doc = slate.PDF(f, 'password')
  ...
  >>> doc[0]
  "My mother doesn't know this, but..."

More complex operations
-----------------------

If you would like access to the images, font files and other
information, then take some time to learn the PDFMiner API.


What is wrong with PDFMiner?
----------------------------

  1. Getting simple things done, like extracting the text
     is quite complex. The program is not designed to return
     Python objects, which makes interfacing things irritating.
  2. It's an extremely complete set of tools, with multiple 
     and moderately  steep learning curves.
  3. It's not written with hackability in mind.

slate's People

Contributors

anthonygarvan avatar cancan101 avatar chrispbailey avatar erickpeirson avatar silberschleier avatar stumpyfrostreaver avatar timclicks avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

slate's Issues

Can we get it documented that this doesn't work with python 3.somthing+?

On python 3.7

>>> import slate
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/gmauer/.local/share/virtualenvs/resume-parser-uEcoETqN/lib/python3.7/site-packages/slate/__init__.py", line 48, in <module>
    from slate import PDF
ImportError: cannot import name 'PDF' from 'slate' (/Users/gmauer/.local/share/virtualenvs/resume-parser-uEcoETqN/lib/python3.7/site-packages/slate/__init__.py)

Looking through the issues and unmerged PRs it seems like this doesn't work with some versions of python (3.4+ maybe?). I get that maintainers have limited amounts of time and so on, I am an OSS maintainer as well. But I've now 3 different pdf parsing libraries deep which I set up and start working with to find an issue like this. It is incredibly frustrating.

Can we at least document it toward the top of the readme which notes that the main pip version doesn't work with more recent python versions?

PyPI version is out of date with the current version of PDFminer

When I did pip install with the master branch of this repo, it works.

The version on PyPI gives the error

import slate
Traceback (most recent call last):
File "", line 1, in
File "/Users/anadas/.virtualenvs/scratch/lib/python2.7/site-packages/slate/init.py", line 48, in
from slate import PDF
File "/Users/anadas/.virtualenvs/scratch/lib/python2.7/site-packages/slate/slate.py", line 3, in
from pdfminer.pdfparser import PDFParser, PDFDocument
ImportError: No module named pdfminer.pdfparser

AttributeError when opening documents that contain chinese characters

When opening documents that contain chinese/thai characters i get an exception saying :

File "create_index.py", line 16, in <module>
    pdf_data = slate.PDF(f)
  File "/home/alexander/dev/pdf_indexer/env/local/lib/python2.7/site-packages/slate/slate.py", line 49, in __init__
    self._cleanup()
  File "/home/alexander/dev/pdf_indexer/env/local/lib/python2.7/site-packages/slate/slate.py", line 57, in _cleanup
    del self.device
AttributeError: device

Install error using python 3

$ pip install slate
Collecting slate
  Downloading slate-0.3.zip
Collecting distribute (from slate)
  Using cached distribute-0.7.3.zip
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/r9/kv1j05355r34570x2f5wpxpr0000gn/T/pip-build-l23narcj/distribute/setuptools/__init__.py", line 2, in <module>
        from setuptools.extension import Extension, Library
      File "/private/var/folders/r9/kv1j05355r34570x2f5wpxpr0000gn/T/pip-build-l23narcj/distribute/setuptools/extension.py", line 5, in <module>
        from setuptools.dist import _get_unpatched
      File "/private/var/folders/r9/kv1j05355r34570x2f5wpxpr0000gn/T/pip-build-l23narcj/distribute/setuptools/dist.py", line 7, in <module>
        from setuptools.command.install import install
      File "/private/var/folders/r9/kv1j05355r34570x2f5wpxpr0000gn/T/pip-build-l23narcj/distribute/setuptools/command/__init__.py", line 8, in <module>
        from setuptools.command import install_scripts
      File "/private/var/folders/r9/kv1j05355r34570x2f5wpxpr0000gn/T/pip-build-l23narcj/distribute/setuptools/command/install_scripts.py", line 3, in <module>
        from pkg_resources import Distribution, PathMetadata, ensure_directory
      File "/private/var/folders/r9/kv1j05355r34570x2f5wpxpr0000gn/T/pip-build-l23narcj/distribute/pkg_resources.py", line 1518, in <module>
        register_loader_type(importlib_bootstrap.SourceFileLoader, DefaultProvider)
    AttributeError: module 'importlib._bootstrap' has no attribute 'SourceFileLoader'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/r9/kv1j05355r34570x2f5wpxpr0000gn/T/pip-build-l23narcj/distribute/](url)

PDFDocument() from pdfminer requires a parser argument

Using the fix from #24 and python 3.5.2, I call slate.PDF(file) but PDFDocument requires a parser. What should be put here? I tried self.parser but this didn't work.

Traceback (most recent call last):
File "pdftotext.py", line 7, in
doc = slate.PDF(f)
File "//anaconda/lib/python3.5/site-packages/slate/classes.py", line 56, in init
self.doc = PDFDocument()
TypeError: init() missing 1 required positional argument: 'parser'

putting self.doc = PDFDocument(self.parser) leads to this error that I cannot fix either.

Traceback (most recent call last):
File "pdftotext.py", line 7, in
doc = slate.PDF(f)
File "//anaconda/lib/python3.5/site-packages/slate/classes.py", line 57, in init
self.doc = PDFDocument(self.parser)
File "//anaconda/lib/python3.5/site-packages/pdfminer/pdfdocument.py", line 559, in init
pos = self.find_xref(parser)
File "//anaconda/lib/python3.5/site-packages/pdfminer/pdfdocument.py", line 773, in find_xref
for line in parser.revreadlines():
File "//anaconda/lib/python3.5/site-packages/pdfminer/psparser.py", line 285, in revreadlines
s = self.fp.read(prevpos-pos)
File "//anaconda/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 2: invalid start byte

import error for slate.py file

from pdfminer.pdfparser import PDFParser, PDFDocument

in slate.py generates an error

However, this

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument

does not.

Slate is broken in PyPI

Something in pdfminer has changed and PDFDocument class is not in pdfminer.pdfparser anymore, it's now in pdfminer.pdfdocument. I figured out you have fixed this problem here in GitHub, but the version in PyPI still has this problem. Could you please reupload it?

pip database incorrect?

I was going to report a bug with imports internal to slate, but then I noticed my version doesn't match the one on github, and in fact doesn't even match the version shown as latest on pypi:

(t) howie@local-linux-2:~/tmp$ pip install slate==0.5.2
Collecting slate==0.5.2
  Could not find a version that satisfies the requirement slate==0.5.2 (from versions: 0.2.3, 0.3)
No matching distribution found for slate==0.5.2

Seems that some metadata is wrong somewhere?

Error parsing a pdf with cStringIO support

Slate: master head: ff61328
Ubuntu 14.04 x64
Python 2.7.6

Received the following error when cStringIO is available for import.

.../lib/python2.7/site-packages/slate/slate.pyc in __init__(self, file, password, just_text, check_extractable)
     50                self.resmgr, self.device)
     51             for page in PDFPage.create_pages(self.doc):
---> 52                 self.append(self.interpreter.process_page(page))
     53             self.metadata = self.doc.info
     54         if just_text:

.../lib/python2.7/site-packages/slate/slate.pyc in process_page(self, page)
     34             ctm = (1,0,0,1, -x0,-y0)
     35         self.device.outfp.seek(0)
---> 36         self.device.outfp.buf = ''
     37         self.device.begin_page(page, ctm)
     38         self.render_contents(page.resources, page.contents, ctm=ctm)

AttributeError: 'cStringIO.StringO' object has no attribute 'buf'

Forcing the use of StringIO instead of cStringIO in slate.py fixed the problem.

Import error

When trying to import slate, the following error message occurs. pdfminer and slate have been installed via pip.

Using Windows XP and Python 2.7

>>> import slate
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "c:\meins\programm\python\hier\lib\site-packages\slate\__init__.py", line 48, in <module>
from slate import PDF
File "c:\meins\programm\python\hier\lib\site-packages\slate\slate.py", line 3, in <module>
from pdfminer.pdfparser import PDFParser, PDFDocument
ImportError: cannot import name PDFDocument

Looking at the pdfminer website, i found the following command that works:

from pdfminer.pdfdocument import PDFDocument

Maybe pdfminer changed their API?

Import Issue for pdfminer (python 2.7)

I am using this tool for converting some of my pdf files into a bare text file. I encountered the following issue.

`

import slate
Traceback (most recent call last):
File "", line 1, in
File "/Users/ameya/anaconda/envs/webapp_pandas/lib/python2.7/site-packages/slate/init.py", line 48, in
from slate import PDF
File "/Users/ameya/anaconda/envs/webapp_pandas/lib/python2.7/site-packages/slate/slate.py", line 3, in
from pdfminer.pdfparser import PDFParser, PDFDocument
ImportError: cannot import name PDFDocument
with open(' SOME FILE NAME ') as fd:
... doc = slate.PDF(fd)
...
Traceback (most recent call last):
File "", line 2, in
NameError: name 'slate' is not defined
`

I fixed it by downgrading the pdfminer to 20110515 version.
Successfully uninstalled pdfminer-20140328
Successfully installed pdfminer-20110515

I guess the issue of import occurs in following the piece of code:
classes.py

Provide page and line hooks

Provide a hooks parameter to PDF, enabling users to provide a callback for each page and/or each line respectively.

cf requests.Request

AttributeError: module 'slate' has no attribute 'PDF'

filename = "file.pdf"
import slate
with open('file.pdf') as f:
    doc = slate.PDF(f)
    print(doc)

gives

Traceback (most recent call last):
  File "C:\Users\.....\Desktop\interna\slate\slate.py", line 2, in <module>
    import slate
  File "C:\Users\.....\Desktop\interna\slate\slate.py", line 4, in <module>
    doc = slate.PDF(f)
AttributeError: module 'slate' has no attribute 'PDF'

running Python 3.6.4, installed the latest slate downloaded from github today

import error when using git version and python3.5

I have installed the git version using the setup script, and when trying to import slate i get the following error:

 >>> import slate
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/site-packages/slate-0.5.1-py3.5.egg/slate/__init__.py", line 66, in <module>
  File "/usr/local/lib/python3.5/site-packages/slate-0.5.1-py3.5.egg/slate/classes.py", line 25, in <module>
ImportError: No module named 'utils'

Text extraction fails on PDF with text watermark

Using slate installed with pip install slate==0.3 pdfminer==20110515

In [4]: pdf = slate.PDF(virtualFile)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-f5142e9f8ced> in <module>()
----> 1 pdf = slate.PDF(virtualFile)

/usr/local/lib/python2.7/dist-packages/slate/slate.pyc in __init__(self, file, password, just_text)
     47             self.metadata = self.doc.info
     48         if just_text:
---> 49             self._cleanup()
     50 
     51     def _cleanup(self):

/usr/local/lib/python2.7/dist-packages/slate/slate.pyc in _cleanup(self)
     55         PDF.
     56         """
---> 57         del self.device
     58         del self.doc
     59         del self.parser

AttributeError: device

using slate installed from the repository, with pdfminer==20140328
slate.PDF executes without errors,
but returns the empty array, []

One of the many consistently failing PDF:
FailingDocument.pdf

How do I add tabs as delimiter, and new line

How should my code snippet look like to:

  • Use tabs instead of space as delimiter.
  • Add new lines (all lines are merged).
import slate

with open('/path/to/file.pdf') as f:
     doc = slate.PDF(f)
print doc 

cannot import 'slate' on Ubuntu Utopic & python 3.4.2

As it says on the tin - slate seems unimportable on Ubuntu Utopic, python 3.4.2 and what's currently released on pypi.

See full command output below:

thomi@peek$ virtualenv -p python3 ve3
Already using interpreter /usr/bin/python3
Using base prefix '/usr'
New python executable in ve3/bin/python3
Also creating executable in ve3/bin/python
Installing setuptools, pip...done.

thomi@peek$ . ve3/bin/activate

thomi@peek$ pip install slate
Downloading/unpacking slate
  Downloading slate-0.3.zip
  Running setup.py (path:/home/thomi/tmp/ve3/build/slate/setup.py) egg_info for package slate

Downloading/unpacking distribute (from slate)
  Downloading distribute-0.7.3.zip (145kB): 145kB downloaded
  Running setup.py (path:/home/thomi/tmp/ve3/build/distribute/setup.py) egg_info for package distribute

Requirement already satisfied (use --upgrade to upgrade): setuptools>=0.7 in ./ve3/lib/python3.4/site-packages (from distribute->slate)
Installing collected packages: slate, distribute
  Running setup.py install for slate

  Running setup.py install for distribute

Successfully installed slate distribute
Cleaning up...

thomi@peek$ python
Python 3.4.2 (default, Oct  8 2014, 13:08:17) 
[GCC 4.9.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import slate
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/thomi/tmp/ve3/lib/python3.4/site-packages/slate/__init__.py", line 48, in <module>
    from slate import PDF
ImportError: cannot import name 'PDF'

Getting Assert Error on every PDF

Hey just trying to get started with slate using the tutorial and getting this error. I've tried with 3 pdfs now. Just running the basic code from the tutorial:

import slate with open('testpdf3.pdf') as f: print type(f) doc = slate.PDF(f)

And the stack trace:

Traceback (most recent call last):
File "test.py", line 7, in
doc = slate.PDF(f)
File "C:\Users\S\Anaconda2\lib\site-packages\slate\slate.py", line 38, in init
self.doc.set_parser(self.parser)
File "C:\Users\S\Anaconda2\lib\site-packages\pdfminer\pdfparser.py", line 318, in set_parser
self.xrefs = parser.read_xref()
File "C:\Users\S\Anaconda2\lib\site-packages\pdfminer\pdfparser.py", line 747, in read_xref
self.read_xref_from(pos, xrefs)
File "C:\Users\S\Anaconda2\lib\site-packages\pdfminer\pdfparser.py", line 727, in read_xref_from
xref.load(self, debug=self.debug)
File "C:\Users\S\Anaconda2\lib\site-packages\pdfminer\pdfparser.py", line 99, in load
self.load_trailer(parser)
File "C:\Users\S\Anaconda2\lib\site-packages\pdfminer\pdfparser.py", line 107, in load_trailer
assert kwd is self.KEYWORD_TRAILER
AssertionError

I've tried printing kwd and it's always an integer while KEYWORD_TRAILER is 'trailer'
Used this command to install:

pip install --upgrade --ignore-installed slate==0.3 pdfminer==20110515

I'm on Windows, Python 2.7.12

Not able to read the PDF and the repository doesn't work with Python3.5?

Clubbing the 2 issues together as one issue.
The repository, I installed successfully but there was a issue while running it. The issue was related to the encoding of the file. Please take a note of this.
Then I uninstalled the library and installed with Python2. So I got the following error while running the example:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "build\bdist.win-amd64\egg\slate\classes.py", line 61, in __init__
  File "c:\Python27\lib\site-packages\pdfminer-20140328-py2.7.egg\pdfminer\pdfdocument.py", line 315, in __init__
    xref.load(parser)
  File "c:\Python27\lib\site-packages\pdfminer-20140328-py2.7.egg\pdfminer\pdfdocument.py", line 175, in load
    (_, obj) = parser.nextobject()
  File "c:\Python27\lib\site-packages\pdfminer-20140328-py2.7.egg\pdfminer\psparser.py", line 557, in nextobject
    (pos, token) = self.nexttoken()
  File "c:\Python27\lib\site-packages\pdfminer-20140328-py2.7.egg\pdfminer\psparser.py", line 482, in nexttoken
    self.fillbuf()
  File "c:\Python27\lib\site-packages\pdfminer-20140328-py2.7.egg\pdfminer\psparser.py", line 215, in fillbuf
    raise PSEOF('Unexpected EOF')
pdfminer.psparser.PSEOF: Unexpected EOF

Kindly, help me.

Provide support for iter(PDF) interface

We currently only send a whole PDF back as a list of strings. It would be nice to support something like:

with open('nice-info.pdf') as f:
    for page in slate.PDF(f):
        for line in page:
            ...

This would break backward compatibility though.. so perhaps this would be an okay compromise?

with open('nice-info.pdf') as f:
    for page in iter(slate.PDF(f)):
        for line in page:
        # page would now be a `Page` object that yields lines,
        # rather than a plain str
            ...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.