pdfminer / pdfminer.six Goto Github PK

View Code? Open in Web Editor NEW

5.8K 119.0 913.0 14.25 MB

Community maintained fork of pdfminer - we fathom PDF

Home Page: https://pdfminersix.readthedocs.io

License: MIT License

Makefile 0.17% Python 99.78% Shell 0.05%

pdf parser python

pdfminer.six's Introduction

pdfminer.six

We fathom PDF

Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data. Pdfminer.six extracts the text from a page directly from the sourcecode of the PDF. It can also be used to get the exact location, font or color of the text.

It is built in a modular way such that each component of pdfminer.six can be replaced easily. You can implement your own interpreter or rendering device that uses the power of pdfminer.six for other purposes than text analysis.

Check out the full documentation on Read the Docs.

Features

Written entirely in Python.
Parse, analyze, and convert PDF documents.
Extract content as text, images, html or hOCR.
PDF-1.7 specification support. (well, almost).
CJK languages and vertical writing scripts support.
Various font types (Type1, TrueType, Type3, and CID) support.
Support for extracting images (JPG, JBIG2, Bitmaps).
Support for various compressions (ASCIIHexDecode, ASCII85Decode, LZWDecode, FlateDecode, RunLengthDecode, CCITTFaxDecode)
Support for RC4 and AES encryption.
Support for AcroForm interactive form extraction.
Table of contents extraction.
Tagged contents extraction.
Automatic layout analysis.

How to use

Install Python 3.8 or newer.
Install pdfminer.six.
```
pip install pdfminer.six
```
(Optionally) install extra dependencies for extracting images.
```
pip install 'pdfminer.six[image]'
```
Use the command-line interface to extract text from pdf.
```
pdf2txt.py example.pdf
```

Or use it with Python.

from pdfminer.high_level import extract_text

text = extract_text("example.pdf")
print(text)

Contributing

Be sure to read the contribution guidelines.

Acknowledgement

This repository includes code from pyHanko ; the original license has been included here.

pdfminer.six's People

Contributors

Stargazers

Watchers

Forkers

cybjit vivenzio enkore lukashambsch djmm187 orangain giuliajoon endticket livecapital meska pangyuteng lqdc metachris stevenfupc ivanteoh backgroundcheck daniel-km xantheros hnmaher davewilliamstx 0xabu i-frost lwolf gtback kamaleewillis ouhouhsami taeguni codebasic hason-contributions oal rnarciso dmakushin rkirmizi dong-y bdrhoa maykinmedia de-code levantado stamkracht zoranpavlovic brothermalcolm kazuya1985 underdogio hughsw kantinen pcmanticore slangwald roypour callendorph myse1f aaayushsingh tilusnet a4abhiraj akcssk aarondeng arcann yuhonglie invictus-drevil18 hongtaicao industriatech vstoykov samuelblattner brettlangdon sarlc juliano777 husnarafi petrchpetr eyxu raphaelcampos bladefidz devilwwj cowbert tinayating nclementel yeonsuyam strideai hainan89 ciena-blueplanet daksh bbreton3 joseph280 avedensky staccatosound hafizurcse fill1890 oplevne aog5 oculushut tofuchanchan raksoiv skerker dxfl malidrisi adkatrit gazzola kolanich-ml yosida95 zhaoxiongyang wzj-python-wn marek1914

pdfminer.six's Issues

UnicodeEncodeError: 'ascii' codec can't encode character

I get an UnicodeEncodeError when using pdfminer (the version d79612c from git)

pdfminer_sample3.py

Download https://www.dropbox.com/s/khjfr63o82fa5yn/numbers-test-document.pdf?dl=0 and execute the following script:

#!/usr/bin/env python

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,   password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

print(convert_pdf_to_txt("numbers-test-document.pdf"))

Error message

Traceback (most recent call last):
  File "pdfminer_sample3.py", line 32, in <module>
    print(convert_pdf_to_txt("samples/numbers-test-document.pdf"))
  File "pdfminer_sample3.py", line 14, in convert_pdf_to_txt
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/converter.py", line 186, in __init__
    PDFConverter.__init__(self, rsrcmgr, outfp, codec=codec, pageno=pageno, laparams=laparams)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/converter.py", line 173, in __init__
    self.outfp.write(u"é")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)

Possible performance improvements

I looked at psparser.py file and sow bytesindex function which purporse is to replace byteobj[from:to] and more precisely byteobj[idx] to return the same on Python 2 and 3.

From my understanding and experiments the only difference between Python 2 and 3 in regards to getting element or slice from byteobject is that in Python 3 when you get single element you receive integer instead of bytestrig. When you get slice is the same in both Python 2 and 3.

Now the implementation of bytesindex differ from how slices works if to is a negative value. In current implementation if -1 (or any other negative value) is passed as to then it will be to the end of the bytestring instead to the end minus one byte (or the exact number of bytes). Because of that implementation detail all usages of bytesindex where all bytes to the end of the bytestring need to be get are misleading because it uses -1 as argument.

The possible performance improvements can be because of reduced function calls if proper slice is used instead of the function. Also will be more obvious the the reader what exactly data are get from the bytestring.

pdfminer.six removes strings from .pdf file

I ran into the issue of pdfminer.six replacing strings from the text of my PDF file like 'fi', 'ff' etc. with a char which is displayed in console as a question mark (?). I guess it is some non-ASCII char since I can not replace it with searching for the actual char '?'. I found out that these strings ('fi', 'ff' and so on) are found in the file pdffont.py in a list called STANDARD_STRINGS. I tried commenting them out, to see if it would fix my problem, but it did not.

The PKG_INFO file of pdfminer.six says:
Metadata-Version: 1.1
Name: pdfminer.six
Version: 20160614
Summary: PDF parser and analyzer

If more info is needed to fix the issue, let me know. I can also provide the PDF file that produces the issue. Other than that keep the good work up, I really enjoy pdfminer.six!

UnicodeDecodeError

I am using python 2.7.10. When running the following code I get a unicodedecodeerror

This is the code:

`
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from StringIO import StringIO

rsrcmgr = PDFResourceManager()
retstr = StringIO()
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec='utf-8', laparams=laparams)
fp = file('c:\users\public\data\pdfs\policy.pdf', 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 1
caching = True
pagenos=set()

for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)

text = retstr.getvalue()

fp.close()
device.close()
`

And this is the error:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Users\Public\Public Software\WinPython32\python-2.7.10\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 790, in runfile execfile(filename, namespace) File "C:\Users\Public\Public Software\WinPython32\python-2.7.10\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 77, in execfile exec(compile(scripttext, filename, 'exec'), glob, loc) File "C:/Users/Public/Public Software/WinPython32/python-2.7.10/Scripts/pdfextract.py", line 28, in <module> text = retstr.getvalue() File "C:\Users\Public\Public Software\WinPython32\python-2.7.10\lib\StringIO.py", line 272, in getvalue self.buf += ''.join(self.buflist) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

After some digging, turns out the problem is that in PDFConverter the following thing happens:

`class PDFConverter(PDFLayoutAnalyzer):

def __init__(self, rsrcmgr, outfp, codec='utf-8', pageno=1, laparams=None):
    PDFLayoutAnalyzer.__init__(self, rsrcmgr, pageno=pageno, laparams=laparams)
    self.outfp = outfp
    self.codec = codec
    if hasattr(self.outfp, 'mode'):
        if 'b' in self.outfp.mode:
            self.outfp_binary = True
        else:
            self.outfp_binary = False
    else:
        import io
        if isinstance(self.outfp, io.BytesIO):
            self.outfp_binary = True
        elif isinstance(self.outfp, io.StringIO):
            self.outfp_binary = False
        else:
            try:                    
                self.outfp.write(u"é)  
                self.outfp_binary = False
            except TypeError:
                self.outfp_binary = True
    return`

As I am using StringIO from StringIO; the buflist in my StringIO object ends up with the u'é' entry which is unicode type. Later in the code when it is writing from the PDF to this array it writes str types. This mixing causes StringIO to throw a UnicodeDecodeError when it tries to join them all (in the getvalues() call).

I'm not that pro with Python, I managed to get it working by replacing the particular line by:

self.outfp.write(u"é".encode(codec,'ignore'))

But maybe this defeats the purpose of the line (?).

I found a post on StackOverflow with some information that I thought was relevant:

http://stackoverflow.com/questions/5701372/what-caused-this-traceback

Chunk by words or lines in XML

I'm trying to extract all the text from a PDF by this version of PDFminer, but it chunks by letters although I change the -M, -L or -W options.

I need to extract it in XML format, ¿is there any option to extract word by word or line by line?

Thanks

TypeError: ord() expected string of length 1, but int found in pdfminer.utils.decode_text()

  File "mycode.py", line 123, in foo
    for (level, title, destname, actionref, _) in doc.get_outlines():
  File "pdfminer/pdfdocument.py", line 703, in search
    for x in search(entry['First'], level+1):
  File "pdfminer/pdfdocument.py", line 697, in search
    title = decode_text(str_value(entry['Title']))
  File "pdfminer/utils.py", line 271, in decode_text
    return ''.join(PDFDocEncoding[ord(c)] for c in s)
  File "pdfminer/utils.py", line 271, in <genexpr>
    return ''.join(PDFDocEncoding[ord(c)] for c in s)
TypeError: ord() expected string of length 1, but int found

I believe the fix for this in Python3 is pretty simple; we shouldn't use ord():

--- a/pdfminer/utils.py
+++ b/pdfminer/utils.py
@@ -268,7 +268,7 @@ def decode_text(s):
     if s.startswith(b'\xfe\xff'):
         return six.text_type(s[2:], 'utf-16be', 'ignore')
     else:
-        return ''.join(PDFDocEncoding[ord(c)] for c in s)
+        return ''.join(PDFDocEncoding[c] for c in s)

# enc

... However, the reason this is a bug report and not a pull request is that I doubt it's correct for Py2, and don't really know what the correct portable thing to do is.

dumpdf for outline

I am confused about the dumpdf for the outline. I have read about the note and guideline but it doesnt work. https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167 and it might the similar cause by #74 . Can you explain a bit more? @goulu assume there is a block of text discovered from the miner and how can we use to bookmark command to allocation which character in the block contains such bookmark symbol?

ModuleNotFoundError

Hi,

I am using Python 3.6 and I cannot set up Pdfminer. six.
While doing pdf2txt.py samples/simple1.pdf, an error appears :
ModuleNotFoundError: No module named 'pdfminer.settings'

Has anyone run into the same problem?

Thank you very much in advance for your help!

Conversion to xml missing text information

@goulu : Thanks for this awesome package. It works like a charm. It actually resolves this issue which I was facing while using pdfminer3k.

I have ran into an issue with this pdf file. I am trying to get an xml output from it by running pdf2txt.py -A -o output.xml -t xml 2b.pdf. But the output xml just contains the following and misses all the text information:

Interestingly, when I convert this file to xml using pdfminer3k it gives a "list index out of range" error at this line. And if I change the code at that line to the following then it works.

if x:
  try:
    objid1 = x[-2]
    genno = x[-1]
  except:
    return None

Can you please help?

Error when dump all

When I run this command with this file

dumppdf -a invalid.pdf

receive error message:

$ dumppdf -a invalid.pdf 
<pdf>Traceback (most recent call last):
  File "/usr/bin/dumppdf", line 268, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "/usr/bin/dumppdf", line 265, in main
    dumpall=dumpall, codec=codec, extractdir=extractdir)
  File "/usr/bin/dumppdf", line 216, in dumppdf
    dumpallobjs(outfp, doc, codec=codec)
  File "/usr/bin/dumppdf", line 102, in dumpallobjs
    obj = doc.getobj(objid)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfdocument.py", line 658, in getobj
    assert objid != 0
AssertionError

Organization icon and members, spelling

This is not a technical issue. It's more about increasing public trust in this repo and its organization.

I've seen that no members are listed in the pdfminer organization. The discussion that caused the creation of this organization suggests that there are quite a few developers involved. Members, would you mind to make your organization status publicly visible? This should be possible at https://github.com/orgs/pdfminer/people
There is no organization icon. That looks a bit sad. Can we come up with one and upload it to GitHub? e.g. a modified version of what you find in a web search, if the image license allows derivatives, or an actual free PDF icon
The description of the repository has French spelling (a space before the colon). Can this be fixed? e.g. replace "PDF Parser : fork with Python 2+3 support using six " by "Python PDF Parser -- fork with Python 2+3 support using six" on the repo home.

pycrypto install error

C:\Python36\Scripts>.\pip3 install pdfminer.six
Collecting pdfminer.six
Using cached pdfminer.six-20170419.tar.gz
Requirement already satisfied: six in c:\python36\lib\site-packages (from pdfminer.six)
Collecting pycrypto (from pdfminer.six)
Using cached pycrypto-2.6.1.tar.gz
Collecting chardet (from pdfminer.six)
Using cached chardet-3.0.3-py2.py3-none-any.whl
Installing collected packages: pycrypto, chardet, pdfminer.six
Running setup.py install for pycrypto ... error
Complete output from command c:\python36\python.exe -u -c "import setuptools, tokenize;file='C:\Users\john\AppData\Local\Temp\pip-build-thbvu0am\pycrypto\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\john\AppData\Local\Temp\pip-d96gbthz-record\install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_py
creating build
creating build\lib.win-amd64-3.6
creating build\lib.win-amd64-3.6\Crypto
copying lib\Crypto\pct_warnings.py -> build\lib.win-amd64-3.6\Crypto
copying lib\Crypto_init_.py -> build\lib.win-amd64-3.6\Crypto
creating build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\hashalgo.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\HMAC.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\MD2.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\MD4.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\MD5.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\RIPEMD.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\SHA.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\SHA224.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\SHA256.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\SHA384.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\SHA512.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash_init_.py -> build\lib.win-amd64-3.6\Crypto\Hash
creating build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\AES.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\ARC2.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\ARC4.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\blockalgo.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\Blowfish.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\CAST.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\DES.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\DES3.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\PKCS1_OAEP.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\PKCS1_v1_5.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\XOR.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher_init_.py -> build\lib.win-amd64-3.6\Crypto\Cipher
creating build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util\asn1.py -> build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util\Counter.py -> build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util\number.py -> build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util\py3compat.py -> build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util\randpool.py -> build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util\RFC1751.py -> build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util\winrandom.py -> build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util_number_new.py -> build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util_init_.py -> build\lib.win-amd64-3.6\Crypto\Util
creating build\lib.win-amd64-3.6\Crypto\Random
copying lib\Crypto\Random\random.py -> build\lib.win-amd64-3.6\Crypto\Random
copying lib\Crypto\Random_UserFriendlyRNG.py -> build\lib.win-amd64-3.6\Crypto\Random
copying lib\Crypto\Random_init_.py -> build\lib.win-amd64-3.6\Crypto\Random
creating build\lib.win-amd64-3.6\Crypto\Random\Fortuna
copying lib\Crypto\Random\Fortuna\FortunaAccumulator.py -> build\lib.win-amd64-3.6\Crypto\Random\Fortuna
copying lib\Crypto\Random\Fortuna\FortunaGenerator.py -> build\lib.win-amd64-3.6\Crypto\Random\Fortuna
copying lib\Crypto\Random\Fortuna\SHAd256.py -> build\lib.win-amd64-3.6\Crypto\Random\Fortuna
copying lib\Crypto\Random\Fortuna_init_.py -> build\lib.win-amd64-3.6\Crypto\Random\Fortuna
creating build\lib.win-amd64-3.6\Crypto\Random\OSRNG
copying lib\Crypto\Random\OSRNG\fallback.py -> build\lib.win-amd64-3.6\Crypto\Random\OSRNG
copying lib\Crypto\Random\OSRNG\nt.py -> build\lib.win-amd64-3.6\Crypto\Random\OSRNG
copying lib\Crypto\Random\OSRNG\posix.py -> build\lib.win-amd64-3.6\Crypto\Random\OSRNG
copying lib\Crypto\Random\OSRNG\rng_base.py -> build\lib.win-amd64-3.6\Crypto\Random\OSRNG
copying lib\Crypto\Random\OSRNG_init_.py -> build\lib.win-amd64-3.6\Crypto\Random\OSRNG
creating build\lib.win-amd64-3.6\Crypto\SelfTest
copying lib\Crypto\SelfTest\st_common.py -> build\lib.win-amd64-3.6\Crypto\SelfTest
copying lib\Crypto\SelfTest_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest
creating build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\common.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_AES.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_ARC2.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_ARC4.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_Blowfish.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_CAST.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_DES.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_DES3.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_pkcs1_15.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_pkcs1_oaep.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_XOR.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
creating build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\common.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_HMAC.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_MD2.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_MD4.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_MD5.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_RIPEMD.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_SHA.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_SHA224.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_SHA256.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_SHA384.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_SHA512.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
creating build\lib.win-amd64-3.6\Crypto\SelfTest\Protocol
copying lib\Crypto\SelfTest\Protocol\test_AllOrNothing.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Protocol
copying lib\Crypto\SelfTest\Protocol\test_chaffing.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Protocol
copying lib\Crypto\SelfTest\Protocol\test_KDF.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Protocol
copying lib\Crypto\SelfTest\Protocol\test_rfc1751.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Protocol
copying lib\Crypto\SelfTest\Protocol_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Protocol
creating build\lib.win-amd64-3.6\Crypto\SelfTest\PublicKey
copying lib\Crypto\SelfTest\PublicKey\test_DSA.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\PublicKey
copying lib\Crypto\SelfTest\PublicKey\test_ElGamal.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\PublicKey
copying lib\Crypto\SelfTest\PublicKey\test_importKey.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\PublicKey
copying lib\Crypto\SelfTest\PublicKey\test_RSA.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\PublicKey
copying lib\Crypto\SelfTest\PublicKey_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\PublicKey
creating build\lib.win-amd64-3.6\Crypto\SelfTest\Random
copying lib\Crypto\SelfTest\Random\test_random.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random
copying lib\Crypto\SelfTest\Random\test_rpoolcompat.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random
copying lib\Crypto\SelfTest\Random\test__UserFriendlyRNG.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random
copying lib\Crypto\SelfTest\Random_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random
creating build\lib.win-amd64-3.6\Crypto\SelfTest\Random\Fortuna
copying lib\Crypto\SelfTest\Random\Fortuna\test_FortunaAccumulator.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\Fortuna
copying lib\Crypto\SelfTest\Random\Fortuna\test_FortunaGenerator.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\Fortuna
copying lib\Crypto\SelfTest\Random\Fortuna\test_SHAd256.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\Fortuna
copying lib\Crypto\SelfTest\Random\Fortuna_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\Fortuna
creating build\lib.win-amd64-3.6\Crypto\SelfTest\Random\OSRNG
copying lib\Crypto\SelfTest\Random\OSRNG\test_fallback.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\OSRNG
copying lib\Crypto\SelfTest\Random\OSRNG\test_generic.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\OSRNG
copying lib\Crypto\SelfTest\Random\OSRNG\test_nt.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\OSRNG
copying lib\Crypto\SelfTest\Random\OSRNG\test_posix.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\OSRNG
copying lib\Crypto\SelfTest\Random\OSRNG\test_winrandom.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\OSRNG
copying lib\Crypto\SelfTest\Random\OSRNG_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\OSRNG
creating build\lib.win-amd64-3.6\Crypto\SelfTest\Util
copying lib\Crypto\SelfTest\Util\test_asn1.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Util
copying lib\Crypto\SelfTest\Util\test_Counter.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Util
copying lib\Crypto\SelfTest\Util\test_number.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Util
copying lib\Crypto\SelfTest\Util\test_winrandom.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Util
copying lib\Crypto\SelfTest\Util_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Util
creating build\lib.win-amd64-3.6\Crypto\SelfTest\Signature
copying lib\Crypto\SelfTest\Signature\test_pkcs1_15.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Signature
copying lib\Crypto\SelfTest\Signature\test_pkcs1_pss.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Signature
copying lib\Crypto\SelfTest\Signature_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Signature
creating build\lib.win-amd64-3.6\Crypto\Protocol
copying lib\Crypto\Protocol\AllOrNothing.py -> build\lib.win-amd64-3.6\Crypto\Protocol
copying lib\Crypto\Protocol\Chaffing.py -> build\lib.win-amd64-3.6\Crypto\Protocol
copying lib\Crypto\Protocol\KDF.py -> build\lib.win-amd64-3.6\Crypto\Protocol
copying lib\Crypto\Protocol_init_.py -> build\lib.win-amd64-3.6\Crypto\Protocol
creating build\lib.win-amd64-3.6\Crypto\PublicKey
copying lib\Crypto\PublicKey\DSA.py -> build\lib.win-amd64-3.6\Crypto\PublicKey
copying lib\Crypto\PublicKey\ElGamal.py -> build\lib.win-amd64-3.6\Crypto\PublicKey
copying lib\Crypto\PublicKey\pubkey.py -> build\lib.win-amd64-3.6\Crypto\PublicKey
copying lib\Crypto\PublicKey\RSA.py -> build\lib.win-amd64-3.6\Crypto\PublicKey
copying lib\Crypto\PublicKey_DSA.py -> build\lib.win-amd64-3.6\Crypto\PublicKey
copying lib\Crypto\PublicKey_RSA.py -> build\lib.win-amd64-3.6\Crypto\PublicKey
copying lib\Crypto\PublicKey_slowmath.py -> build\lib.win-amd64-3.6\Crypto\PublicKey
copying lib\Crypto\PublicKey_init_.py -> build\lib.win-amd64-3.6\Crypto\PublicKey
creating build\lib.win-amd64-3.6\Crypto\Signature
copying lib\Crypto\Signature\PKCS1_PSS.py -> build\lib.win-amd64-3.6\Crypto\Signature
copying lib\Crypto\Signature\PKCS1_v1_5.py -> build\lib.win-amd64-3.6\Crypto\Signature
copying lib\Crypto\Signature_init_.py -> build\lib.win-amd64-3.6\Crypto\Signature
Skipping optional fixer: buffer
Skipping optional fixer: idioms
Skipping optional fixer: set_literal
Skipping optional fixer: ws_comma
running build_ext
warning: GMP or MPIR library not found; Not building Crypto.PublicKey._fastmath.
building 'Crypto.Random.OSRNG.winrandom' extension
creating build\temp.win-amd64-3.6
creating build\temp.win-amd64-3.6\Release
creating build\temp.win-amd64-3.6\Release\src
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Isrc/ -Isrc/inc-msvc/ -Ic:\python36\include -Ic:\python36\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" "-IC:\Program Files (x86)\Windows Kits\8.1\include\shared" "-IC:\Program Files (x86)\Windows Kits\8.1\include\um" "-IC:\Program Files (x86)\Windows Kits\8.1\include\winrt" /Tcsrc/winrand.c /Fobuild\temp.win-amd64-3.6\Release\src/winrand.obj
winrand.c
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(26): error C2061: syntax error: identifier 'intmax_t'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(27): error C2061: syntax error: identifier 'rem'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(27): error C2059: syntax error: ';'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(28): error C2059: syntax error: '}'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(30): error C2061: syntax error: identifier 'imaxdiv_t'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(30): error C2059: syntax error: ';'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(40): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(41): error C2146: syntax error: missing ')' before identifier '_Number'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(41): error C2061: syntax error: identifier '_Number'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(41): error C2059: syntax error: ';'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(42): error C2059: syntax error: ')'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(45): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(46): error C2146: syntax error: missing ')' before identifier '_Numerator'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(46): error C2061: syntax error: identifier '_Numerator'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(46): error C2059: syntax error: ';'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(46): error C2059: syntax error: ','
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(48): error C2059: syntax error: ')'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(50): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(56): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(63): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(69): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(76): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(82): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(89): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(95): error C2143: syntax error: missing '{' before '__cdecl'
error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe' failed with exit status 2

----------------------------------------

Command "c:\python36\python.exe -u -c "import setuptools, tokenize;file='C:\Users\john\AppData\Local\Temp\pip-build-thbvu0am\pycrypto\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\john\AppData\Local\Temp\pip-d96gbthz-record\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in C:\Users\john\AppData\Local\Temp\pip-build-thbvu0am\pycrypto\

AssertionError during layout analysis

I am using the excellent pdfminer.six package for analysis of text in PDFs that my clients receive from their clients.

I hit an assert failure while using the PDFPageAggregator converter. Here are the code, PDF, and stack trace:

https://github.com/hughsw/pdfminer.six/blob/master/tools/pdfstats.py

arm_ed_t_board_elektor_magazine_article.pdf

Traceback (most recent call last):
  File "./pdfstats.py", line 81, in <module>
    sys.exit(main(sys.argv[1:]))
  File "./pdfstats.py", line 71, in main
    interpreter.process_page(page)
  File "/usr/local/lib/python3.6/site-packages/pdfminer/pdfinterp.py", line 851, in process_page
    self.device.end_page(page)
  File "/usr/local/lib/python3.6/site-packages/pdfminer/converter.py", line 51, in end_page
    self.cur_item.analyze(self.laparams)
  File "/usr/local/lib/python3.6/site-packages/pdfminer/layout.py", line 677, in analyze
    obj.analyze(laparams)
  File "/usr/local/lib/python3.6/site-packages/pdfminer/layout.py", line 724, in analyze
    LTLayoutContainer.analyze(self, laparams)
  File "/usr/local/lib/python3.6/site-packages/pdfminer/layout.py", line 684, in analyze
    textboxes = list(self.group_textlines(laparams, textlines))
  File "/usr/local/lib/python3.6/site-packages/pdfminer/layout.py", line 579, in group_textlines
    neighbors = line.find_neighbors(plane, laparams.line_margin)
  File "/usr/local/lib/python3.6/site-packages/pdfminer/layout.py", line 387, in find_neighbors
    return [obj for obj in objs
  File "/usr/local/lib/python3.6/site-packages/pdfminer/layout.py", line 387, in <listcomp>
    return [obj for obj in objs
  File "/usr/local/lib/python3.6/site-packages/pdfminer/utils.py", line 373, in find
    for k in self._getrange(bbox):
  File "/usr/local/lib/python3.6/site-packages/pdfminer/utils.py", line 335, in _getrange
    for y in drange(y0, y1, self.gridsize):
  File "/usr/local/lib/python3.6/site-packages/pdfminer/utils.py", line 173, in drange
    assert v0 < v1
AssertionError: (807.874, 807.874, 50)

PyCrypto and windows

I've tried to install pdfminer with pip on windows, but I was getting an error over wheels bundle on pyCrypto. The problem was solved when changing a line in the setup file, requires = ['six', 'pycrypto'] to requires = ['six', 'pycryptodome'].

I think it would be nice to detect if it's on windows to set requires correctly.

Type error 'str' does not support the buffer interface

The attached PDF produces this error:

Traceback (most recent call last):
  File "/usr/local/bin/pdf2txt.py", line 4, in <module>
    __import__('pkg_resources').run_script('pdfminer.six==20170119', 'pdf2txt.py')
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 719, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1504, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/EGG-INFO/scripts/pdf2txt.py", line 127, in <module>
    if __name__ == '__main__': sys.exit(main())
  File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/EGG-INFO/scripts/pdf2txt.py", line 122, in main
    outfp = extract_text(**vars(A))
  File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/EGG-INFO/scripts/pdf2txt.py", line 62, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/pdfminer/high_level.py", line 83, in extract_text_to_fp
    interpreter.process_page(page)    
  File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/pdfminer/pdfinterp.py", line 852, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/pdfminer/pdfinterp.py", line 862, in render_contents
    self.init_resources(resources)
  File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/pdfminer/pdfinterp.py", line 362, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/pdfminer/pdfinterp.py", line 212, in get_font
    font = self.get_font(None, subspec)
  File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/pdfminer/pdfinterp.py", line 203, in get_font
    font = PDFCIDFont(self, spec)
  File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/pdfminer/pdffont.py", line 672, in __init__
    CMapParser(self.unicode_map, BytesIO(strm.get_data())).run()
TypeError: a bytes-like object is required, not 'str'

stamp-no.pdf

Problem with uploaded package on PyPi

When I install pdfminer from PyPi the source is different than downloaded from github for the same tag.

One example is logging. More especially this change 1d54ecd which should be present from version 20160614 (one year ago). After that version there are two new versions.

When I download the package from PyPi forfor version 20170419 (https://pypi.python.org/packages/43/71/b592b9b384c9bc4429e9a35cc9d61a5eb7fabef2208140c30550a474defe/pdfminer.six-20170419.tar.gz#md5=c43b443ad759441adb07fde5f1ca3435) this change is not there. But when I download the archive from Github for that tag (https://github.com/pdfminer/pdfminer.six/archive/20170419.zip) everything is there.

Now I'm forced to workaround the installation in requirements.txt by adding:

https://github.com/pdfminer/pdfminer.six/archive/20170419.zip#egg=pdfminer.six==20170419

Which is not ideal.

I'm wondering what may be the issue with the package in PyPi?

TypeError: '<' not supported between instances of 'tuple' and 'int'

I'm occasionally getting an error:

File "c:\Anaconda3\lib\site-packages\pdfminer.six-20170720-py3.6.egg\pdfminer\psparser.py", line 233, in fillbuf
if self.charpos < len(self.buf):
TypeError: '<' not supported between instances of 'tuple' and 'int'

As far as I can tell the only place this might be caused by is line 350 in psparser.py
return (self._parse_comment, len(s))
where a tuple is in fact returned from function _parse_comment(self, s, i)

I hope this is enough info.
Patrick

XMLConverter error

I can't

pdf2txt.py -t xml something.pdf 
<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1" bbox="0.000,0.000,595.320,841.920" rotate="0">
<textbox id="0" bbox="262.250,792.168,339.072,802.128">
<textline bbox="262.250,792.168,339.072,802.128">
Traceback (most recent call last):
  File "/path/.venv/bin/pdf2txt.py", line 126, in <module>
    if __name__ == '__main__': sys.exit(main())
  File "/path/.venv/bin/pdf2txt.py", line 121, in main
    outfp = extract_text(**vars(A))
  File "/path/.venv/bin/pdf2txt.py", line 61, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/path/.venv/lib/python3.5/site-packages/pdfminer/high_level.py", line 83, in extract_text_to_fp
    interpreter.process_page(page)    
  File "/path/.venv/lib/python3.5/site-packages/pdfminer/pdfinterp.py", line 837, in process_page
    self.device.end_page(page)
  File "/path/.venv/lib/python3.5/site-packages/pdfminer/converter.py", line 56, in end_page
    self.receive_layout(self.cur_item)
  File "/path/.venv/lib/python3.5/site-packages/pdfminer/converter.py", line 537, in receive_layout
    render(ltpage)
  File "/path/.venv/lib/python3.5/site-packages/pdfminer/converter.py", line 483, in render
    render(child)
  File "/path/.venv/lib/python3.5/site-packages/pdfminer/converter.py", line 517, in render
    render(child)
  File "/path/.venv/lib/python3.5/site-packages/pdfminer/converter.py", line 508, in render
    render(child)
  File "/path/.venv/lib/python3.5/site-packages/pdfminer/converter.py", line 521, in render
    (enc(item.fontname, None), bbox2str(item.bbox), item.size))
  File "/path/.venv/lib/python3.5/site-packages/pdfminer/utils.py", line 277, in enc
    x = x.replace('&', '&amp;').replace('>', '&gt;').replace('<', '&lt;').replace('"', '&quot;')
TypeError: a bytes-like object is required, not 'str'

Crash in pdf2txt.py, due to recently added code

Running pdf2txt.py on the attached PDF crashes with an attribute error in recently added code, commit 82af7f0 (see #56).

175.pdf

bash-3.2$ python3 /usr/local/bin/pdf2txt.py 175.pdf
INFO:pdfminer.pdfdocument:xref found: pos=b'774066'
INFO:pdfminer.pdfdocument:read_xref_from: start=774066, token=/b'xref'
INFO:pdfminer.pdfdocument:xref objects: {2: (None, 9, 0), 3: (None, 400798, 0), 4: (None, 400895, 0), 5: (None, 773855, 0), 6: (None, 401082, 0), 7: (None, 773571, 0), 8: (None, 773668, 0), 9: (None, 773919, 0), 10: (None, 773970, 0)}
INFO:pdfminer.pdfdocument:trailer: {'Size': 10, 'Root': <PDFObjRef:8>, 'Info': <PDFObjRef:9>}
INFO:pdfminer.pdfdocument:trailer: {'Size': 10, 'Root': <PDFObjRef:8>, 'Info': <PDFObjRef:9>}
Traceback (most recent call last):
  File "/usr/local/bin/pdf2txt.py", line 129, in <module>
    if __name__ == '__main__': sys.exit(main())
  File "/usr/local/bin/pdf2txt.py", line 124, in main
    outfp = extract_text(**vars(A))
  File "/usr/local/bin/pdf2txt.py", line 64, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/usr/local/lib/python3.6/site-packages/pdfminer/high_level.py", line 81, in extract_text_to_fp
    check_extractable=True):
  File "/usr/local/lib/python3.6/site-packages/pdfminer/pdfpage.py", line 121, in get_pages
    doc = PDFDocument(parser, password=password, caching=caching)
  File "/usr/local/lib/python3.6/site-packages/pdfminer/pdfdocument.py", line 579, in __init__
    self.info.append(dict_value(trailer['Info']))
  File "/usr/local/lib/python3.6/site-packages/pdfminer/pdftypes.py", line 164, in dict_value
    x = resolve1(x)
  File "/usr/local/lib/python3.6/site-packages/pdfminer/pdftypes.py", line 84, in resolve1
    x = x.resolve(default=default)
  File "/usr/local/lib/python3.6/site-packages/pdfminer/pdftypes.py", line 71, in resolve
    return self.doc.getobj(self.objid)
  File "/usr/local/lib/python3.6/site-packages/pdfminer/pdfdocument.py", line 689, in getobj
    obj = self._getobj_parse(index, objid)
  File "/usr/local/lib/python3.6/site-packages/pdfminer/pdfdocument.py", line 655, in _getobj_parse
    while kwd is not self.KEYWORD_OBJ:
AttributeError: 'PDFDocument' object has no attribute 'KEYWORD_OBJ'

NameError: global name 'ImageWriter' is not defined

When running this command below on pdfminer.six version 20160202 in Python 2.7.10, NameError: global name 'ImageWriter' is not defined error message occurred.

$ pdf2txt.py -O myoutput -o myoutput/myfile.html -t html -p 1,3 myfile.pdf

In setup.py, URL of project is a 404

The URL should be https://github.com/pdfminer/pdfminer.six/ but is https://github.com/pdfminer/pdfminer
Then, it's also wrong in https://pypi.python.org/pypi/pdfminer.six/20170720

When's the next PyPI release?

Hi,

We're currently in the process of upgrading a codebase from Python 2 to 3, and running into the bug in #15

This has been fixed in master, and a RC with release date Jan. 19th was created. When can this be expected? For the time being, I'll run of a known-good commit, but I'd prefer to install from PyPI at all times.

Add option to ignore Django

I'm running into an issue with pdfminer trying to import settings from Django. I have my virtualenv configured to use the system site-packages so I don't have to constantly recompile packages like numpy. I also happen to have Django installed in my system site-packages directory. Even though my project doesn't need/use Django, pdfminer still tries to access django.conf.settings.PDF_MINER_IS_STRICT.

It would be nice to have a way to ignore Django if it's installed but not actually used.

pip version?

Hi,

Have prevision to release one version in pip?

MaxInt Error

Problem arises when you try to run pdf2txt. Error trace states can not import max int. I'm running Python 3.5.1. After researching the error, I've come to the conclusion that Python 3.x.x have removed the system constant maxint; hence, the inability to import said maxint.
Um...
Please send aid.

image.py: TypeError: object of type 'zip' has no len() in Python 3.5.1

Extracting text with images there is an error "TypeError: object of type 'zip' has no len()
".

"File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pdfminer/image.py", line 74, in export_image
if len(filters) == 1 and filters[0][0] in LITERALS_DCT_DECODE:
TypeError: object of type 'zip' has no len()"

I converted also a PDF file "i1040nr.pdf" in your test set and there is the same error.

pdfminer.six failed to extract layout from some pdf files

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
from pdfminer.pdfdevice import PDFDevice

rsrcmgr = PDFResourceManager()
laparams = LAParams()
laparams.char_margin = 0.5
laparams.word_margin = 0.5

device = PDFPageAggregator(rsrcmgr, laparams=laparams)

interpreter = PDFPageInterpreter(rsrcmgr, device)

for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
layout = device.get_result()

For some pdf files, it appears that the device.get_result() returns an object whose _objs has 0 length. The pdf files contains a table with cells having text or numbers. (I used to have pdfminer3k, under that package, these pdf files will get exception for zlib error.) (For some reason I can't attach the offending pdf here.)

Add LICENSE to MANIFEST?

Hey-lo,

I'm building a version of PDFMiner.six using conda for conda-forge. When possible, we try to include a link to the license file in the meta.yaml specification; doing so requires both:

a license file be included in the package
that license be indexed in MANIFEST.in file.

Would you consider adding a copy of the license to the bundle and updating MANIFEST.in to include it?

Extra quotes in PSLiteral.repr

The .six fork adds extra quotes to PSLiteral.__repr__:

ipdb> from pdfminer.psparser import PSLiteral
ipdb> PSLiteral("Name")
/'Name'

... where regular pdfminer would just print /Name.

This seems to be because of this line, which switched from using '/%s' to '/%r'.

Should be a one-character fix, unless there's some reason using %r is important?

(Seems like a minor issue, I know, but pdfquery uses __repr__ for serializing PDFs, so this becomes a blocker for py3 support.)

Thanks!

Python 3.5.2 import error

>>> pdfminer.__version__
'1.3.0'
>>> from pdfminer.pdfpage import PDFPage
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named 'pdfminer.pdfpage'

it works ok in py2,why?

#py2 demo
from bs4 import BeautifulSoup
import requests
import re
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO  import StringIO
from io import open
from pdfminer.pdfpage import PDFPage
def pdf_txt(url):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    f = requests.get(url).content
    fp = StringIO(f)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos = set()
    for page in PDFPage.get_pages(fp,
                                  pagenos,
                                  maxpages=maxpages,
                                  password=password,
                                  caching=caching,
                                  check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    str = retstr.getvalue()
    retstr.close()
    return str
print pdf_txt('http://pythonscraping.com/pages/warandpeace/chapter1.pdf')

Remove #!/usr/bin/env python from library files

Would it be possible to do this?

I'm maintaining pdfminer in Fedora (and for F24+ I've switched the package over to pdfminer.six; nice work, by the way!), and the use of /usr/bin/env python at the top of the library files is... confusing rpm's automatic dependency detection. As a result, python3-pdfminer thinks it needs a python2 install on the system...

To fix this, I wrote a patch yesterday to remove these lines and rebuilt the package, and things appear to still work just fine. I'd be happy to open a pull request for it assuming that this is a desirable change.

Looking at the code, it seems that:

Many of the library files (i.e. the code under pdfminer/) don't actually do anything if ran directly.
Of those that do, they seem to just run tests. Could we move those tests to the tests/ directory?

Incorrectly Determining Height of Characters

WrongFontSizes3.pdf

The following simple python code illustrates a bug with parsing the attached PDF file. Specifically, it incorrectly determines the height of text. Namely it thinks the small text is much larger than the big text.

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTChar

def parse_pages():

    fp = open('WrongFontSizes3.pdf', 'rb')
    parser = PDFParser(fp)
    doc = PDFDocument(parser)
    parser.set_document(doc)

    rsrcmgr = PDFResourceManager()
    laparams = LAParams(char_margin=3.5, all_texts=True)
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
        layout = device.get_result()
        yield layout

if __name__ == '__main__':
    for page in parse_pages():
        for tbox in page:
            if not isinstance(tbox, LTTextBox):
                continue
            for line in tbox:
                for char in line:
                    if not isinstance(char, LTChar):
                        continue
                    print char.get_text().encode('UTF-8'), char.size

Output:

B 29.4555
i 29.4555
g 29.4555
T 29.4555
e 29.4555
x 29.4555
t 29.4555
S 66.96
m 66.96
a 66.96
l 66.96
l 66.96
  66.96
  66.96
T 66.96
e 66.96
x 66.96
t 66.96

Process finished with exit code 0

Page creation fails because of wrong case

I've got a PDF (can't share it because of sensitive information) that fails during page creation via pdfminer.pdfpage.PDFPage.create_pages because it returns an empty iterable, relevant code with my patch:

    @classmethod
    def create_pages(klass, document, debug=0):
        def search(obj, parent):
            if isinstance(obj, int):
                objid = obj
                tree = dict_value(document.getobj(objid)).copy()
            else:
                objid = obj.objid
                tree = dict_value(obj).copy()
            for (k, v) in parent.iteritems():
                if k in klass.INHERITABLE_ATTRS and k not in tree:
                    tree[k] = v
            # FIXME: wrong case?
            tree_type = tree.get('Type', tree.get('type'))
            if tree_type is LITERAL_PAGES and 'Kids' in tree:
                if 1 <= debug:
                    print >>sys.stderr, 'Pages: Kids=%r' % tree['Kids']
                for c in list_value(tree['Kids']):
                    for x in search(c, tree):
                        yield x
            elif tree_type is LITERAL_PAGE:
                if 1 <= debug:
                    print >>sys.stderr, 'Page: %r' % tree
                yield (objid, tree)
        pages = False
        if 'Pages' in document.catalog:
            for (objid, tree) in search(document.catalog['Pages'], document.catalog):
                yield klass(document, objid, tree)
                pages = True
        if not pages:
            # fallback when /Pages is missing.
            for xref in document.xrefs:
                for objid in xref.get_objids():
                    try:
                        obj = document.getobj(objid)
                        if isinstance(obj, dict) and obj.get('Type') is LITERAL_PAGE:
                            yield klass(document, objid, obj)
                    except PDFObjectNotFound:
                        pass
        return

The relevant bit is tree_type = tree.get('Type', tree.get('type')) - the actual PDF stream has a lowercase /type instead of the expected /Type, causing the generator to never yield, which in turn causes StopIteration in pdfquery.

According to the spec (1.7, page 57-58), this is valid and /Type is a different name object than /type. However, in this case, the meaning is the same, and probably the PDF generator is the 'offending' root cause here.

Would a patch like this where capitalized object is checked first with fallback to lowercase be acceptable?
If not, what would be the best way to handle this?

PDF text not being positioned when converted to HTML

I'm having trouble converting pdf's into html. Everything seems to work fine except the text is not positioned correctly on the page. It seems like all the text is being bunched together into a few span tags.

I've tried the following to no avail:

Tried converting different pdf's
Tried using using various versions of pdfminer.six (tried all three taged versions listed on github as well as the current pypi version)
Tried playing with the LAParams settings

I also received some encoding errors which i was able to get by by using switching from "from io import StringsIO" to "from Six import BytesIO".

Has anyone had any success in converting pdf's to html? If so would you mind sharing your configuration? I've attached a sample config code and html output file for reference:

`
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import HTMLConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from six import BytesIO

def convert_pdf_to_html(pdf_path, html_path):
"""Converts PDF to HTML file
ARGS:
pdf_path: full path of pdf file to convert to html
html_path: full path of html file containing extracted pdf data
"""

rsrcmgr = PDFResourceManager()
retstr = BytesIO()
codec = 'UTF-8'
laparams = LAParams()
device = HTMLConverter(rsrcmgr, retstr, codec=codec)

fp = open(pdf_path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
fstr = ''
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,
                              caching=caching, check_extractable=True):
    interpreter.process_page(page)
fstr = retstr.getvalue()


fp.close()
device.close()
retstr.close()

fstr = fstr.replace(b'\n', b"")
html_file = open(html_path, 'wb')
html_file.write(fstr)`

ice_dom.zip

TypeError: 'str' does not support the buffer interface at pdfinterp.execute

I'm trying to migrate some pdfminer code to python3 (which was working with the upstream pdfminer on python2.7) using this version of pdfminer. It fails on:

  File "mycode.py", line 123, in main
    interpreter.process_page(page)
  File "pdfminer/pdfinterp.py", line 834, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "pdfminer/pdfinterp.py", line 846, in render_contents
    self.execute(list_value(streams))
  File "pdfminer/pdfinterp.py", line 870, in execute
    func(*args)
  File "pdfminer/pdfinterp.py", line 811, in do_Do
    interpreter.render_contents(resources, [xobj], ctm=mult_matrix(matrix, self.ctm))
  File "pdfminer/pdfinterp.py", line 846, in render_contents
    self.execute(list_value(streams))
  File "pdfminer/pdfinterp.py", line 862, in execute
    method = 'do_%s' % name.replace('*', '_a').replace('"', '_w').replace("'", '_q')
TypeError: 'str' does not support the buffer interface

This seems to be a string conversion issue due to the following new code in psparser.py:

def keyword_name(x):
    if not isinstance(x, PSKeyword):
        # (snip)
    else:
        name=x.name
        if six.PY3:
            try:
                name = str(name,'utf-8')
            except:
                pass
    return name

Sticking a 'raise' in there (rather than pass) shows that the utf-8 decoding is failing ("invalid start byte"), and indeed the name looks like binary junk. There are a lot of these bad keyword names in this particular PDF, and they're all on the same page, so it may well be a malformed PDF or a parser bug elsewhere. (Sorry, I can't share the PDF.) Nevertheless, pdfminer should probably be able to handle this more robustly, because these bad names would have been ignored by execute() if STRICT was off, which it is by default in the original pdfminer.

So, I have two sub-buglets here (sorry for lumping them together):

we shouldn't assume that every keyword name can be decoded as utf-8; maybe you could pass errors='ignore' to the str() conversion, or at least return something that's a valid string type, rather than returning the underlying bytes
settings.STRICT should probably remain off by default if you don't want to break existing code

Error on install, UBUNTU 16 LTS

Errors

after pip install pdfminer.six ... 100% ... bit fineshed with permission error,

Successfully built pdfminer.six
Installing collected packages: pdfminer.six
Exception:
Traceback (most recent call last):
  File "/home/user/.local/lib/python2.7/site-packages/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
.....
OSError: [Errno 13] Permission ... '/usr/local/lib/python2.7/dist-packages/pdfminer'

next repeting with sudo, sudo pip install pdfminer.six resulted in

The directory '/home/user/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/home/user/.cache/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting pdfminer.six
Requirement already satisfied: six in /usr/lib/python2.7/dist-packages (from pdfminer.six)
Requirement already satisfied: pycrypto in /usr/lib/python2.7/dist-packages (from pdfminer.six)
Installing collected packages: pdfminer.six
Successfully installed pdfminer.six-20170419

but, no way to call it

pdf2txt.py myFile.pdf produced error "/usr/bin/env: “python\r”: not found"

setup.py Python 2.6 failure

Downloading pdfminer.six-20151013.zip (4.2MB)
  100% |################################| 4.2MB 141kB/s 
  Traceback (most recent call last):
    File "<string>", line 20, in <module>
    File "/tmp/pip-build-QgQ8RT/pdfminer.six/setup.py", line 12, in <module>
      install_requires=['six', 'chardet'] if sys.version_info.major>2 else ['six'],
  AttributeError: 'tuple' object has no attribute 'major'
  Complete output from command python setup.py egg_info:
  Traceback (most recent call last):

    File "<string>", line 20, in <module>

    File "/tmp/pip-build-QgQ8RT/pdfminer.six/setup.py", line 12, in <module>

      install_requires=['six', 'chardet'] if sys.version_info.major>2 else ['six'],

  AttributeError: 'tuple' object has no attribute 'major'

Extract painting information from PDF

Today, the parser ignore the painting information extracted (stroke, colors, fill, etc.), saving only the linewidth. I created a patch do add more information, helping with some cases.

PDF to HTML conversion issues

Hi,

I'm trying to convert a simple PDF to HTML using:
pdf2txt.py test.pdf -t html -o test.html

Here is the test PDF file:
test.pdf

and here is the output html:

html source:

<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head><body>
<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:595px; height:842px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:118px; width:448px; height:45px;"><span style="font-family: ; font-size:16px">The Portable Document Format (PDF) is the world’s leading language for describing <br>the printed page</span><span style="font-family: ; font-size:15px"> <br></span><span style="font-family: ; font-size:15px">	<br></span></div><span style="position:absolute; border: black 1px solid; left:72px; top:121px; width:445px; height:13px;"></span>
<span style="position:absolute; border: black 1px solid; left:72px; top:135px; width:86px; height:13px;"></span>
<div style="position:absolute; top:0px;">Page: <a href="#1">1</a></div>
</body></html>

Now, the problem is that width of the line is incorrectly computed making it to wrap differently then the original doc. This can lead to smth like this:

Is there a fix for this issue? If not can you guide me where to look so that I can make a PR with the fix?

I think this tool may be helpful for what we need and in this case we can contribute to it.

Thx a lot!

Please update version on pypi

https://pypi.python.org/pypi/pdfminer2 has v20151206, while latest seems to be v 20160614.
Thanks!

no module named 'pdfminor.settings'

Hey man! Thks for doing this. I downloaded the package and followed your instructions. Unfortunately when i try to scrap a pdf, I get an Import Error which says: "no module named 'pdfminor.settings'". I checked the folder pdfminor if the settings file was missing but it isn't. Any idea what the problem might be?
cheers, ed

CCITT filters don't handle bytestrings properly

In the feedbytes methods, character conversion to byte happens via ord. On Py3 this is not needed, since we're dealing with bytestrings directly.

This was also mentioned in #24 and subsequently solved, but a couple of cases were still missing.

pdfinterp unorderable types

I'm receiving this error when working with certain PDFs. Because of the nature of the data I'm working with, I'm not at liberty to post a sample file but I've had the same issue with several files in the data set I'm working with.

  File "/usr/local/lib/python3.5/dist-packages/pdfminer/pdfinterp.py", line 852, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/usr/local/lib/python3.5/dist-packages/pdfminer/pdfinterp.py", line 864, in render_contents
    self.execute(list_value(streams))
  File "/usr/local/lib/python3.5/dist-packages/pdfminer/pdfinterp.py", line 875, in execute
    (_, obj) = parser.nextobject()
  File "/usr/local/lib/python3.5/dist-packages/pdfminer/psparser.py", line 583, in nextobject
    (pos, token) = self.nexttoken()
  File "/usr/local/lib/python3.5/dist-packages/pdfminer/psparser.py", line 509, in nexttoken
    self.fillbuf()
  File "/usr/local/lib/python3.5/dist-packages/pdfminer/pdfinterp.py", line 248, in fillbuf
    if self.charpos < len(self.buf):
TypeError: unorderable types: tuple() < int()

Printing the self.charpos variable immediately before that comparison line shows a bunch of integer output as expected and then this right before the error:

(<bound method PSBaseParser._parse_comment of <PDFContentParser: <_io.BytesIO object at 0x7f67996495c8>, bufpos=8192>>, 4096)

Python 3.4 run pdf2txt.py ImportError: No Module named 'six'

updated python from 2.7 to 3.4 on my laptop (Windows)

test run command-line command pdf2txt.py simple1.pdf

return ImportError: no module named 'six'

I guess there's something wrong with the third line: import six

anyone knows how to fix it?

Question: Can pdfminer retrieve text & bboxes without layout?

Is it possible to just retrieve all the text on the page with each fragment returned with its bounding box, i.e., (x1, y1, x2, y2, text) -- with no layout analysis? Use case: this would be ideal for people who want to do their own layout analysis with minimal overheads.

/usr/bin/env: ‘python\r’: No such file or directory

pdf2txt.py fails to run with:

/usr/bin/env: ‘python\r’: No such file or directory

This appears to be due to a DOS carriage return in the shebang line. Running dos2unix pdf2txt.py appears to fix the issue.

Test environment:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04 LTS
Release: 16.04
Codename: xenial

pdfminer.six (20160614) - from PyPi via pip

Running in a virtualenv

$ virtualenv --version
15.0.1

$ python --version
Python 3.5.1+

converting to text outside of command line

Hi there,

I'm currently trying to use pdfminer within a jupyter notebook to convert pdf files to text but fail miserably :/ I know that you provide the command line tool pdf2text.py, but isn't this also possible in another way? Let's say I use the example code you provided up to the following point:

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
# Open a PDF document.
fp = open('sample.pdf', 'rb')
parser = PDFParser(fp)
document = PDFDocument(parser)

How could I create a text file out of this then? Is it somehow possible that you create a function for pdf2text.py functionality?

Thanks anyway for the package :)

/home/ubuntu/anaconda3/envs/py35/lib/python3.5/site-packages/pdfminer.six-20170119-py3.5.egg/pdfminer/utils.py in make_compat_str(in_str)
     24 def make_compat_str(in_str):
     25     "In Py2, does nothing. In Py3, converts to string, guessing encoding."
---> 26     assert isinstance(in_str, (bytes, str, unicode))
     27     if six.PY3 and isinstance(in_str, bytes):
     28         enc = chardet.detect(in_str)

AssertionError: