timclicks / slate Goto Github PK
View Code? Open in Web Editor NEWThe simplest way to extract text from PDFs in Python
Home Page: http://timmcnamara.co.nz/
License: GNU General Public License v3.0
The simplest way to extract text from PDFs in Python
Home Page: http://timmcnamara.co.nz/
License: GNU General Public License v3.0
====================================================== slate: the easiest way to get text from PDFs in Python ====================================================== Slate is a Python package that simplifies the process of extracting text from PDF files. It depends on the PDFMiner package. Slate provides one class, PDF. PDF takes a file-like object and will extract all text from the document, presentating each page as a string of text: >>> with open('example.pdf') as f: ... doc = slate.PDF(f) ... >>> doc [..., ..., ...] >>> doc[1] 'Text from page 2...' If your pdf is password protected, pass the password as the second argument: >>> with open('secrets.pdf') as f: ... doc = slate.PDF(f, 'password') ... >>> doc[0] "My mother doesn't know this, but..." More complex operations ----------------------- If you would like access to the images, font files and other information, then take some time to learn the PDFMiner API. What is wrong with PDFMiner? ---------------------------- 1. Getting simple things done, like extracting the text is quite complex. The program is not designed to return Python objects, which makes interfacing things irritating. 2. It's an extremely complete set of tools, with multiple and moderately steep learning curves. 3. It's not written with hackability in mind.
On python 3.7
>>> import slate
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/gmauer/.local/share/virtualenvs/resume-parser-uEcoETqN/lib/python3.7/site-packages/slate/__init__.py", line 48, in <module>
from slate import PDF
ImportError: cannot import name 'PDF' from 'slate' (/Users/gmauer/.local/share/virtualenvs/resume-parser-uEcoETqN/lib/python3.7/site-packages/slate/__init__.py)
Looking through the issues and unmerged PRs it seems like this doesn't work with some versions of python (3.4+ maybe?). I get that maintainers have limited amounts of time and so on, I am an OSS maintainer as well. But I've now 3 different pdf parsing libraries deep which I set up and start working with to find an issue like this. It is incredibly frustrating.
Can we at least document it toward the top of the readme which notes that the main pip version doesn't work with more recent python versions?
See: cancan101@bfd6336
from pdfminer.psparser import PSStackParser, PSSyntaxError, PSEOF, literal_name, LIT, KWD, handle_error
ImportError: cannot import name 'handle_error' from 'pdfminer.psparser' (/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pdfminer/psparser.py)
When I did pip install with the master branch of this repo, it works.
The version on PyPI gives the error
import slate
Traceback (most recent call last):
File "", line 1, in
File "/Users/anadas/.virtualenvs/scratch/lib/python2.7/site-packages/slate/init.py", line 48, in
from slate import PDF
File "/Users/anadas/.virtualenvs/scratch/lib/python2.7/site-packages/slate/slate.py", line 3, in
from pdfminer.pdfparser import PDFParser, PDFDocument
ImportError: No module named pdfminer.pdfparser
When opening documents that contain chinese/thai characters i get an exception saying :
File "create_index.py", line 16, in <module>
pdf_data = slate.PDF(f)
File "/home/alexander/dev/pdf_indexer/env/local/lib/python2.7/site-packages/slate/slate.py", line 49, in __init__
self._cleanup()
File "/home/alexander/dev/pdf_indexer/env/local/lib/python2.7/site-packages/slate/slate.py", line 57, in _cleanup
del self.device
AttributeError: device
$ pip install slate
Collecting slate
Downloading slate-0.3.zip
Collecting distribute (from slate)
Using cached distribute-0.7.3.zip
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/private/var/folders/r9/kv1j05355r34570x2f5wpxpr0000gn/T/pip-build-l23narcj/distribute/setuptools/__init__.py", line 2, in <module>
from setuptools.extension import Extension, Library
File "/private/var/folders/r9/kv1j05355r34570x2f5wpxpr0000gn/T/pip-build-l23narcj/distribute/setuptools/extension.py", line 5, in <module>
from setuptools.dist import _get_unpatched
File "/private/var/folders/r9/kv1j05355r34570x2f5wpxpr0000gn/T/pip-build-l23narcj/distribute/setuptools/dist.py", line 7, in <module>
from setuptools.command.install import install
File "/private/var/folders/r9/kv1j05355r34570x2f5wpxpr0000gn/T/pip-build-l23narcj/distribute/setuptools/command/__init__.py", line 8, in <module>
from setuptools.command import install_scripts
File "/private/var/folders/r9/kv1j05355r34570x2f5wpxpr0000gn/T/pip-build-l23narcj/distribute/setuptools/command/install_scripts.py", line 3, in <module>
from pkg_resources import Distribution, PathMetadata, ensure_directory
File "/private/var/folders/r9/kv1j05355r34570x2f5wpxpr0000gn/T/pip-build-l23narcj/distribute/pkg_resources.py", line 1518, in <module>
register_loader_type(importlib_bootstrap.SourceFileLoader, DefaultProvider)
AttributeError: module 'importlib._bootstrap' has no attribute 'SourceFileLoader'
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/r9/kv1j05355r34570x2f5wpxpr0000gn/T/pip-build-l23narcj/distribute/](url)
See: cancan101@c53fa28
Using the fix from #24 and python 3.5.2, I call slate.PDF(file) but PDFDocument requires a parser. What should be put here? I tried self.parser but this didn't work.
Traceback (most recent call last):
File "pdftotext.py", line 7, in
doc = slate.PDF(f)
File "//anaconda/lib/python3.5/site-packages/slate/classes.py", line 56, in init
self.doc = PDFDocument()
TypeError: init() missing 1 required positional argument: 'parser'
putting self.doc = PDFDocument(self.parser) leads to this error that I cannot fix either.
Traceback (most recent call last):
File "pdftotext.py", line 7, in
doc = slate.PDF(f)
File "//anaconda/lib/python3.5/site-packages/slate/classes.py", line 57, in init
self.doc = PDFDocument(self.parser)
File "//anaconda/lib/python3.5/site-packages/pdfminer/pdfdocument.py", line 559, in init
pos = self.find_xref(parser)
File "//anaconda/lib/python3.5/site-packages/pdfminer/pdfdocument.py", line 773, in find_xref
for line in parser.revreadlines():
File "//anaconda/lib/python3.5/site-packages/pdfminer/psparser.py", line 285, in revreadlines
s = self.fp.read(prevpos-pos)
File "//anaconda/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 2: invalid start byte
Please publish the new version which fix the compatibility with pdfminer
from pdfminer.pdfparser import PDFParser, PDFDocument
in slate.py generates an error
However, this
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
does not.
Something in pdfminer has changed and PDFDocument
class is not in pdfminer.pdfparser
anymore, it's now in pdfminer.pdfdocument
. I figured out you have fixed this problem here in GitHub, but the version in PyPI still has this problem. Could you please reupload it?
I was going to report a bug with imports internal to slate, but then I noticed my version doesn't match the one on github, and in fact doesn't even match the version shown as latest on pypi:
(t) howie@local-linux-2:~/tmp$ pip install slate==0.5.2
Collecting slate==0.5.2
Could not find a version that satisfies the requirement slate==0.5.2 (from versions: 0.2.3, 0.3)
No matching distribution found for slate==0.5.2
Seems that some metadata is wrong somewhere?
Hi
this is not issue but i want to know is there way to extract paragraph from the pdf using slate or its parent library pdfminer
Slate: master head: ff61328
Ubuntu 14.04 x64
Python 2.7.6
Received the following error when cStringIO is available for import.
.../lib/python2.7/site-packages/slate/slate.pyc in __init__(self, file, password, just_text, check_extractable)
50 self.resmgr, self.device)
51 for page in PDFPage.create_pages(self.doc):
---> 52 self.append(self.interpreter.process_page(page))
53 self.metadata = self.doc.info
54 if just_text:
.../lib/python2.7/site-packages/slate/slate.pyc in process_page(self, page)
34 ctm = (1,0,0,1, -x0,-y0)
35 self.device.outfp.seek(0)
---> 36 self.device.outfp.buf = ''
37 self.device.begin_page(page, ctm)
38 self.render_contents(page.resources, page.contents, ctm=ctm)
AttributeError: 'cStringIO.StringO' object has no attribute 'buf'
Forcing the use of StringIO instead of cStringIO in slate.py fixed the problem.
Hi Tim, as the last commit was in 2017, I assume slate is dead. Maybe you can set the development status to Development Status :: 7 - Inactive
?
Which libraries would you recommend for reading PDF files?
When trying to import slate, the following error message occurs. pdfminer and slate have been installed via pip.
Using Windows XP and Python 2.7
>>> import slate
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "c:\meins\programm\python\hier\lib\site-packages\slate\__init__.py", line 48, in <module>
from slate import PDF
File "c:\meins\programm\python\hier\lib\site-packages\slate\slate.py", line 3, in <module>
from pdfminer.pdfparser import PDFParser, PDFDocument
ImportError: cannot import name PDFDocument
Looking at the pdfminer website, i found the following command that works:
from pdfminer.pdfdocument import PDFDocument
Maybe pdfminer changed their API?
I am using this tool for converting some of my pdf files into a bare text file. I encountered the following issue.
`
import slate
Traceback (most recent call last):
File "", line 1, in
File "/Users/ameya/anaconda/envs/webapp_pandas/lib/python2.7/site-packages/slate/init.py", line 48, in
from slate import PDF
File "/Users/ameya/anaconda/envs/webapp_pandas/lib/python2.7/site-packages/slate/slate.py", line 3, in
from pdfminer.pdfparser import PDFParser, PDFDocument
ImportError: cannot import name PDFDocument
with open(' SOME FILE NAME ') as fd:
... doc = slate.PDF(fd)
...
Traceback (most recent call last):
File "", line 2, in
NameError: name 'slate' is not defined
`
I fixed it by downgrading the pdfminer to 20110515 version.
Successfully uninstalled pdfminer-20140328
Successfully installed pdfminer-20110515
I guess the issue of import occurs in following the piece of code:
classes.py
Provide a hooks
parameter to PDF, enabling users to provide a callback for each page and/or each line respectively.
cf requests.Request
filename = "file.pdf"
import slate
with open('file.pdf') as f:
doc = slate.PDF(f)
print(doc)
gives
Traceback (most recent call last):
File "C:\Users\.....\Desktop\interna\slate\slate.py", line 2, in <module>
import slate
File "C:\Users\.....\Desktop\interna\slate\slate.py", line 4, in <module>
doc = slate.PDF(f)
AttributeError: module 'slate' has no attribute 'PDF'
running Python 3.6.4, installed the latest slate downloaded from github today
Missing newline separation :(
I have installed the git version using the setup script, and when trying to import slate i get the following error:
>>> import slate
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/site-packages/slate-0.5.1-py3.5.egg/slate/__init__.py", line 66, in <module>
File "/usr/local/lib/python3.5/site-packages/slate-0.5.1-py3.5.egg/slate/classes.py", line 25, in <module>
ImportError: No module named 'utils'
from pdfminer.converter import HTMLConverter
Using slate installed with pip install slate==0.3 pdfminer==20110515
In [4]: pdf = slate.PDF(virtualFile)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-4-f5142e9f8ced> in <module>()
----> 1 pdf = slate.PDF(virtualFile)
/usr/local/lib/python2.7/dist-packages/slate/slate.pyc in __init__(self, file, password, just_text)
47 self.metadata = self.doc.info
48 if just_text:
---> 49 self._cleanup()
50
51 def _cleanup(self):
/usr/local/lib/python2.7/dist-packages/slate/slate.pyc in _cleanup(self)
55 PDF.
56 """
---> 57 del self.device
58 del self.doc
59 del self.parser
AttributeError: device
using slate installed from the repository, with pdfminer==20140328
slate.PDF executes without errors,
but returns the empty array, []
One of the many consistently failing PDF:
FailingDocument.pdf
How should my code snippet look like to:
import slate
with open('/path/to/file.pdf') as f:
doc = slate.PDF(f)
print doc
As it says on the tin - slate seems unimportable on Ubuntu Utopic, python 3.4.2 and what's currently released on pypi.
See full command output below:
thomi@peek$ virtualenv -p python3 ve3
Already using interpreter /usr/bin/python3
Using base prefix '/usr'
New python executable in ve3/bin/python3
Also creating executable in ve3/bin/python
Installing setuptools, pip...done.
thomi@peek$ . ve3/bin/activate
thomi@peek$ pip install slate
Downloading/unpacking slate
Downloading slate-0.3.zip
Running setup.py (path:/home/thomi/tmp/ve3/build/slate/setup.py) egg_info for package slate
Downloading/unpacking distribute (from slate)
Downloading distribute-0.7.3.zip (145kB): 145kB downloaded
Running setup.py (path:/home/thomi/tmp/ve3/build/distribute/setup.py) egg_info for package distribute
Requirement already satisfied (use --upgrade to upgrade): setuptools>=0.7 in ./ve3/lib/python3.4/site-packages (from distribute->slate)
Installing collected packages: slate, distribute
Running setup.py install for slate
Running setup.py install for distribute
Successfully installed slate distribute
Cleaning up...
thomi@peek$ python
Python 3.4.2 (default, Oct 8 2014, 13:08:17)
[GCC 4.9.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import slate
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/thomi/tmp/ve3/lib/python3.4/site-packages/slate/__init__.py", line 48, in <module>
from slate import PDF
ImportError: cannot import name 'PDF'
Hey just trying to get started with slate using the tutorial and getting this error. I've tried with 3 pdfs now. Just running the basic code from the tutorial:
import slate with open('testpdf3.pdf') as f: print type(f) doc = slate.PDF(f)
And the stack trace:
Traceback (most recent call last):
File "test.py", line 7, in
doc = slate.PDF(f)
File "C:\Users\S\Anaconda2\lib\site-packages\slate\slate.py", line 38, in init
self.doc.set_parser(self.parser)
File "C:\Users\S\Anaconda2\lib\site-packages\pdfminer\pdfparser.py", line 318, in set_parser
self.xrefs = parser.read_xref()
File "C:\Users\S\Anaconda2\lib\site-packages\pdfminer\pdfparser.py", line 747, in read_xref
self.read_xref_from(pos, xrefs)
File "C:\Users\S\Anaconda2\lib\site-packages\pdfminer\pdfparser.py", line 727, in read_xref_from
xref.load(self, debug=self.debug)
File "C:\Users\S\Anaconda2\lib\site-packages\pdfminer\pdfparser.py", line 99, in load
self.load_trailer(parser)
File "C:\Users\S\Anaconda2\lib\site-packages\pdfminer\pdfparser.py", line 107, in load_trailer
assert kwd is self.KEYWORD_TRAILER
AssertionError
I've tried printing kwd and it's always an integer while KEYWORD_TRAILER is 'trailer'
Used this command to install:
pip install --upgrade --ignore-installed slate==0.3 pdfminer==20110515
I'm on Windows, Python 2.7.12
Clubbing the 2 issues together as one issue.
The repository, I installed successfully but there was a issue while running it. The issue was related to the encoding of the file. Please take a note of this.
Then I uninstalled the library and installed with Python2. So I got the following error while running the example:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "build\bdist.win-amd64\egg\slate\classes.py", line 61, in __init__
File "c:\Python27\lib\site-packages\pdfminer-20140328-py2.7.egg\pdfminer\pdfdocument.py", line 315, in __init__
xref.load(parser)
File "c:\Python27\lib\site-packages\pdfminer-20140328-py2.7.egg\pdfminer\pdfdocument.py", line 175, in load
(_, obj) = parser.nextobject()
File "c:\Python27\lib\site-packages\pdfminer-20140328-py2.7.egg\pdfminer\psparser.py", line 557, in nextobject
(pos, token) = self.nexttoken()
File "c:\Python27\lib\site-packages\pdfminer-20140328-py2.7.egg\pdfminer\psparser.py", line 482, in nexttoken
self.fillbuf()
File "c:\Python27\lib\site-packages\pdfminer-20140328-py2.7.egg\pdfminer\psparser.py", line 215, in fillbuf
raise PSEOF('Unexpected EOF')
pdfminer.psparser.PSEOF: Unexpected EOF
Kindly, help me.
Sometimes metadata are extracted as raw bytes, rather than in a specific encoding..
We currently only send a whole PDF back as a list of strings. It would be nice to support something like:
with open('nice-info.pdf') as f:
for page in slate.PDF(f):
for line in page:
...
This would break backward compatibility though.. so perhaps this would be an okay compromise?
with open('nice-info.pdf') as f:
for page in iter(slate.PDF(f)):
for line in page:
# page would now be a `Page` object that yields lines,
# rather than a plain str
...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.