Code Monkey home page Code Monkey logo

pdfminer.six's Introduction

pdfminer.six

Continuous integration PyPI version gitter

We fathom PDF

Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data. Pdfminer.six extracts the text from a page directly from the sourcecode of the PDF. It can also be used to get the exact location, font or color of the text.

It is built in a modular way such that each component of pdfminer.six can be replaced easily. You can implement your own interpreter or rendering device that uses the power of pdfminer.six for other purposes than text analysis.

Check out the full documentation on Read the Docs.

Features

  • Written entirely in Python.
  • Parse, analyze, and convert PDF documents.
  • Extract content as text, images, html or hOCR.
  • PDF-1.7 specification support. (well, almost).
  • CJK languages and vertical writing scripts support.
  • Various font types (Type1, TrueType, Type3, and CID) support.
  • Support for extracting images (JPG, JBIG2, Bitmaps).
  • Support for various compressions (ASCIIHexDecode, ASCII85Decode, LZWDecode, FlateDecode, RunLengthDecode, CCITTFaxDecode)
  • Support for RC4 and AES encryption.
  • Support for AcroForm interactive form extraction.
  • Table of contents extraction.
  • Tagged contents extraction.
  • Automatic layout analysis.

How to use

  • Install Python 3.8 or newer.

  • Install pdfminer.six.

    pip install pdfminer.six

  • (Optionally) install extra dependencies for extracting images.

    pip install 'pdfminer.six[image]'

  • Use the command-line interface to extract text from pdf.

    pdf2txt.py example.pdf

  • Or use it with Python.

from pdfminer.high_level import extract_text

text = extract_text("example.pdf")
print(text)

Contributing

Be sure to read the contribution guidelines.

Acknowledgement

This repository includes code from pyHanko ; the original license has been included here.

pdfminer.six's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdfminer.six's Issues

Type error 'str' does not support the buffer interface

The attached PDF produces this error:

Traceback (most recent call last):
  File "/usr/local/bin/pdf2txt.py", line 4, in <module>
    __import__('pkg_resources').run_script('pdfminer.six==20170119', 'pdf2txt.py')
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 719, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1504, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/EGG-INFO/scripts/pdf2txt.py", line 127, in <module>
    if __name__ == '__main__': sys.exit(main())
  File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/EGG-INFO/scripts/pdf2txt.py", line 122, in main
    outfp = extract_text(**vars(A))
  File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/EGG-INFO/scripts/pdf2txt.py", line 62, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/pdfminer/high_level.py", line 83, in extract_text_to_fp
    interpreter.process_page(page)    
  File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/pdfminer/pdfinterp.py", line 852, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/pdfminer/pdfinterp.py", line 862, in render_contents
    self.init_resources(resources)
  File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/pdfminer/pdfinterp.py", line 362, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/pdfminer/pdfinterp.py", line 212, in get_font
    font = self.get_font(None, subspec)
  File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/pdfminer/pdfinterp.py", line 203, in get_font
    font = PDFCIDFont(self, spec)
  File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/pdfminer/pdffont.py", line 672, in __init__
    CMapParser(self.unicode_map, BytesIO(strm.get_data())).run()
TypeError: a bytes-like object is required, not 'str'

stamp-no.pdf

AssertionError during layout analysis

I am using the excellent pdfminer.six package for analysis of text in PDFs that my clients receive from their clients.

I hit an assert failure while using the PDFPageAggregator converter. Here are the code, PDF, and stack trace:

https://github.com/hughsw/pdfminer.six/blob/master/tools/pdfstats.py

arm_ed_t_board_elektor_magazine_article.pdf

Traceback (most recent call last):
  File "./pdfstats.py", line 81, in <module>
    sys.exit(main(sys.argv[1:]))
  File "./pdfstats.py", line 71, in main
    interpreter.process_page(page)
  File "/usr/local/lib/python3.6/site-packages/pdfminer/pdfinterp.py", line 851, in process_page
    self.device.end_page(page)
  File "/usr/local/lib/python3.6/site-packages/pdfminer/converter.py", line 51, in end_page
    self.cur_item.analyze(self.laparams)
  File "/usr/local/lib/python3.6/site-packages/pdfminer/layout.py", line 677, in analyze
    obj.analyze(laparams)
  File "/usr/local/lib/python3.6/site-packages/pdfminer/layout.py", line 724, in analyze
    LTLayoutContainer.analyze(self, laparams)
  File "/usr/local/lib/python3.6/site-packages/pdfminer/layout.py", line 684, in analyze
    textboxes = list(self.group_textlines(laparams, textlines))
  File "/usr/local/lib/python3.6/site-packages/pdfminer/layout.py", line 579, in group_textlines
    neighbors = line.find_neighbors(plane, laparams.line_margin)
  File "/usr/local/lib/python3.6/site-packages/pdfminer/layout.py", line 387, in find_neighbors
    return [obj for obj in objs
  File "/usr/local/lib/python3.6/site-packages/pdfminer/layout.py", line 387, in <listcomp>
    return [obj for obj in objs
  File "/usr/local/lib/python3.6/site-packages/pdfminer/utils.py", line 373, in find
    for k in self._getrange(bbox):
  File "/usr/local/lib/python3.6/site-packages/pdfminer/utils.py", line 335, in _getrange
    for y in drange(y0, y1, self.gridsize):
  File "/usr/local/lib/python3.6/site-packages/pdfminer/utils.py", line 173, in drange
    assert v0 < v1
AssertionError: (807.874, 807.874, 50)

PDF to HTML conversion issues

Hi,

I'm trying to convert a simple PDF to HTML using:
pdf2txt.py test.pdf -t html -o test.html

Here is the test PDF file:
test.pdf

and here is the output html:
screen shot 2017-06-14 at 8 17 58 pm

html source:

<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head><body>
<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:595px; height:842px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:118px; width:448px; height:45px;"><span style="font-family: ; font-size:16px">The Portable Document Format (PDF) is the world’s leading language for describing <br>the printed page</span><span style="font-family: ; font-size:15px"> <br></span><span style="font-family: ; font-size:15px">	<br></span></div><span style="position:absolute; border: black 1px solid; left:72px; top:121px; width:445px; height:13px;"></span>
<span style="position:absolute; border: black 1px solid; left:72px; top:135px; width:86px; height:13px;"></span>
<div style="position:absolute; top:0px;">Page: <a href="#1">1</a></div>
</body></html>

Now, the problem is that width of the line is incorrectly computed making it to wrap differently then the original doc. This can lead to smth like this:
screen shot 2017-06-14 at 8 21 36 pm

Is there a fix for this issue? If not can you guide me where to look so that I can make a PR with the fix?

I think this tool may be helpful for what we need and in this case we can contribute to it.

Thx a lot!

Add option to ignore Django

I'm running into an issue with pdfminer trying to import settings from Django. I have my virtualenv configured to use the system site-packages so I don't have to constantly recompile packages like numpy. I also happen to have Django installed in my system site-packages directory. Even though my project doesn't need/use Django, pdfminer still tries to access django.conf.settings.PDF_MINER_IS_STRICT.

It would be nice to have a way to ignore Django if it's installed but not actually used.

pip version?

Hi,

Have prevision to release one version in pip?

Error when dump all

When I run this command with this file

dumppdf -a invalid.pdf

receive error message:

$ dumppdf -a invalid.pdf 
<pdf>Traceback (most recent call last):
  File "/usr/bin/dumppdf", line 268, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "/usr/bin/dumppdf", line 265, in main
    dumpall=dumpall, codec=codec, extractdir=extractdir)
  File "/usr/bin/dumppdf", line 216, in dumppdf
    dumpallobjs(outfp, doc, codec=codec)
  File "/usr/bin/dumppdf", line 102, in dumpallobjs
    obj = doc.getobj(objid)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfdocument.py", line 658, in getobj
    assert objid != 0
AssertionError

Crash in pdf2txt.py, due to recently added code

Running pdf2txt.py on the attached PDF crashes with an attribute error in recently added code, commit 82af7f0 (see #56).

175.pdf

bash-3.2$ python3 /usr/local/bin/pdf2txt.py 175.pdf
INFO:pdfminer.pdfdocument:xref found: pos=b'774066'
INFO:pdfminer.pdfdocument:read_xref_from: start=774066, token=/b'xref'
INFO:pdfminer.pdfdocument:xref objects: {2: (None, 9, 0), 3: (None, 400798, 0), 4: (None, 400895, 0), 5: (None, 773855, 0), 6: (None, 401082, 0), 7: (None, 773571, 0), 8: (None, 773668, 0), 9: (None, 773919, 0), 10: (None, 773970, 0)}
INFO:pdfminer.pdfdocument:trailer: {'Size': 10, 'Root': <PDFObjRef:8>, 'Info': <PDFObjRef:9>}
INFO:pdfminer.pdfdocument:trailer: {'Size': 10, 'Root': <PDFObjRef:8>, 'Info': <PDFObjRef:9>}
Traceback (most recent call last):
  File "/usr/local/bin/pdf2txt.py", line 129, in <module>
    if __name__ == '__main__': sys.exit(main())
  File "/usr/local/bin/pdf2txt.py", line 124, in main
    outfp = extract_text(**vars(A))
  File "/usr/local/bin/pdf2txt.py", line 64, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/usr/local/lib/python3.6/site-packages/pdfminer/high_level.py", line 81, in extract_text_to_fp
    check_extractable=True):
  File "/usr/local/lib/python3.6/site-packages/pdfminer/pdfpage.py", line 121, in get_pages
    doc = PDFDocument(parser, password=password, caching=caching)
  File "/usr/local/lib/python3.6/site-packages/pdfminer/pdfdocument.py", line 579, in __init__
    self.info.append(dict_value(trailer['Info']))
  File "/usr/local/lib/python3.6/site-packages/pdfminer/pdftypes.py", line 164, in dict_value
    x = resolve1(x)
  File "/usr/local/lib/python3.6/site-packages/pdfminer/pdftypes.py", line 84, in resolve1
    x = x.resolve(default=default)
  File "/usr/local/lib/python3.6/site-packages/pdfminer/pdftypes.py", line 71, in resolve
    return self.doc.getobj(self.objid)
  File "/usr/local/lib/python3.6/site-packages/pdfminer/pdfdocument.py", line 689, in getobj
    obj = self._getobj_parse(index, objid)
  File "/usr/local/lib/python3.6/site-packages/pdfminer/pdfdocument.py", line 655, in _getobj_parse
    while kwd is not self.KEYWORD_OBJ:
AttributeError: 'PDFDocument' object has no attribute 'KEYWORD_OBJ'

Add LICENSE to MANIFEST?

Hey-lo,

I'm building a version of PDFMiner.six using conda for conda-forge. When possible, we try to include a link to the license file in the meta.yaml specification; doing so requires both:

  • a license file be included in the package
  • that license be indexed in MANIFEST.in file.

Would you consider adding a copy of the license to the bundle and updating MANIFEST.in to include it?

Python 3.4 run pdf2txt.py ImportError: No Module named 'six'

updated python from 2.7 to 3.4 on my laptop (Windows)

test run command-line command pdf2txt.py simple1.pdf

return ImportError: no module named 'six'

I guess there's something wrong with the third line: import six

anyone knows how to fix it?

TypeError: 'str' does not support the buffer interface at pdfinterp.execute

I'm trying to migrate some pdfminer code to python3 (which was working with the upstream pdfminer on python2.7) using this version of pdfminer. It fails on:

  File "mycode.py", line 123, in main
    interpreter.process_page(page)
  File "pdfminer/pdfinterp.py", line 834, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "pdfminer/pdfinterp.py", line 846, in render_contents
    self.execute(list_value(streams))
  File "pdfminer/pdfinterp.py", line 870, in execute
    func(*args)
  File "pdfminer/pdfinterp.py", line 811, in do_Do
    interpreter.render_contents(resources, [xobj], ctm=mult_matrix(matrix, self.ctm))
  File "pdfminer/pdfinterp.py", line 846, in render_contents
    self.execute(list_value(streams))
  File "pdfminer/pdfinterp.py", line 862, in execute
    method = 'do_%s' % name.replace('*', '_a').replace('"', '_w').replace("'", '_q')
TypeError: 'str' does not support the buffer interface

This seems to be a string conversion issue due to the following new code in psparser.py:

def keyword_name(x):
    if not isinstance(x, PSKeyword):
        # (snip)
    else:
        name=x.name
        if six.PY3:
            try:
                name = str(name,'utf-8')
            except:
                pass
    return name

Sticking a 'raise' in there (rather than pass) shows that the utf-8 decoding is failing ("invalid start byte"), and indeed the name looks like binary junk. There are a lot of these bad keyword names in this particular PDF, and they're all on the same page, so it may well be a malformed PDF or a parser bug elsewhere. (Sorry, I can't share the PDF.) Nevertheless, pdfminer should probably be able to handle this more robustly, because these bad names would have been ignored by execute() if STRICT was off, which it is by default in the original pdfminer.

So, I have two sub-buglets here (sorry for lumping them together):

  • we shouldn't assume that every keyword name can be decoded as utf-8; maybe you could pass errors='ignore' to the str() conversion, or at least return something that's a valid string type, rather than returning the underlying bytes
  • settings.STRICT should probably remain off by default if you don't want to break existing code

Conversion to xml missing text information

@goulu : Thanks for this awesome package. It works like a charm. It actually resolves this issue which I was facing while using pdfminer3k.

I have ran into an issue with this pdf file. I am trying to get an xml output from it by running pdf2txt.py -A -o output.xml -t xml 2b.pdf. But the output xml just contains the following and misses all the text information:

capture

Interestingly, when I convert this file to xml using pdfminer3k it gives a "list index out of range" error at this line. And if I change the code at that line to the following then it works.

if x:
  try:
    objid1 = x[-2]
    genno = x[-1]
  except:
    return None

Can you please help?

pdfinterp unorderable types

I'm receiving this error when working with certain PDFs. Because of the nature of the data I'm working with, I'm not at liberty to post a sample file but I've had the same issue with several files in the data set I'm working with.

  File "/usr/local/lib/python3.5/dist-packages/pdfminer/pdfinterp.py", line 852, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/usr/local/lib/python3.5/dist-packages/pdfminer/pdfinterp.py", line 864, in render_contents
    self.execute(list_value(streams))
  File "/usr/local/lib/python3.5/dist-packages/pdfminer/pdfinterp.py", line 875, in execute
    (_, obj) = parser.nextobject()
  File "/usr/local/lib/python3.5/dist-packages/pdfminer/psparser.py", line 583, in nextobject
    (pos, token) = self.nexttoken()
  File "/usr/local/lib/python3.5/dist-packages/pdfminer/psparser.py", line 509, in nexttoken
    self.fillbuf()
  File "/usr/local/lib/python3.5/dist-packages/pdfminer/pdfinterp.py", line 248, in fillbuf
    if self.charpos < len(self.buf):
TypeError: unorderable types: tuple() < int()

Printing the self.charpos variable immediately before that comparison line shows a bunch of integer output as expected and then this right before the error:

(<bound method PSBaseParser._parse_comment of <PDFContentParser: <_io.BytesIO object at 0x7f67996495c8>, bufpos=8192>>, 4096)

Possible performance improvements

I looked at psparser.py file and sow bytesindex function which purporse is to replace byteobj[from:to] and more precisely byteobj[idx] to return the same on Python 2 and 3.

From my understanding and experiments the only difference between Python 2 and 3 in regards to getting element or slice from byteobject is that in Python 3 when you get single element you receive integer instead of bytestrig. When you get slice is the same in both Python 2 and 3.

Now the implementation of bytesindex differ from how slices works if to is a negative value. In current implementation if -1 (or any other negative value) is passed as to then it will be to the end of the bytestring instead to the end minus one byte (or the exact number of bytes). Because of that implementation detail all usages of bytesindex where all bytes to the end of the bytestring need to be get are misleading because it uses -1 as argument.

The possible performance improvements can be because of reduced function calls if proper slice is used instead of the function. Also will be more obvious the the reader what exactly data are get from the bytestring.

TypeError: ord() expected string of length 1, but int found in pdfminer.utils.decode_text()

  File "mycode.py", line 123, in foo
    for (level, title, destname, actionref, _) in doc.get_outlines():
  File "pdfminer/pdfdocument.py", line 703, in search
    for x in search(entry['First'], level+1):
  File "pdfminer/pdfdocument.py", line 697, in search
    title = decode_text(str_value(entry['Title']))
  File "pdfminer/utils.py", line 271, in decode_text
    return ''.join(PDFDocEncoding[ord(c)] for c in s)
  File "pdfminer/utils.py", line 271, in <genexpr>
    return ''.join(PDFDocEncoding[ord(c)] for c in s)
TypeError: ord() expected string of length 1, but int found

I believe the fix for this in Python3 is pretty simple; we shouldn't use ord():

--- a/pdfminer/utils.py
+++ b/pdfminer/utils.py
@@ -268,7 +268,7 @@ def decode_text(s):
     if s.startswith(b'\xfe\xff'):
         return six.text_type(s[2:], 'utf-16be', 'ignore')
     else:
-        return ''.join(PDFDocEncoding[ord(c)] for c in s)
+        return ''.join(PDFDocEncoding[c] for c in s)

# enc

... However, the reason this is a bug report and not a pull request is that I doubt it's correct for Py2, and don't really know what the correct portable thing to do is.

Extract painting information from PDF

Today, the parser ignore the painting information extracted (stroke, colors, fill, etc.), saving only the linewidth. I created a patch do add more information, helping with some cases.

Problem with uploaded package on PyPi

When I install pdfminer from PyPi the source is different than downloaded from github for the same tag.

One example is logging. More especially this change 1d54ecd which should be present from version 20160614 (one year ago). After that version there are two new versions.

When I download the package from PyPi forfor version 20170419 (https://pypi.python.org/packages/43/71/b592b9b384c9bc4429e9a35cc9d61a5eb7fabef2208140c30550a474defe/pdfminer.six-20170419.tar.gz#md5=c43b443ad759441adb07fde5f1ca3435) this change is not there. But when I download the archive from Github for that tag (https://github.com/pdfminer/pdfminer.six/archive/20170419.zip) everything is there.

Now I'm forced to workaround the installation in requirements.txt by adding:

https://github.com/pdfminer/pdfminer.six/archive/20170419.zip#egg=pdfminer.six==20170419

Which is not ideal.

I'm wondering what may be the issue with the package in PyPi?

Extra quotes in PSLiteral.__repr__

The .six fork adds extra quotes to PSLiteral.__repr__:

ipdb> from pdfminer.psparser import PSLiteral
ipdb> PSLiteral("Name")
/'Name'

... where regular pdfminer would just print /Name.

This seems to be because of this line, which switched from using '/%s' to '/%r'.

Should be a one-character fix, unless there's some reason using %r is important?

(Seems like a minor issue, I know, but pdfquery uses __repr__ for serializing PDFs, so this becomes a blocker for py3 support.)

Thanks!

UnicodeEncodeError: 'ascii' codec can't encode character

I get an UnicodeEncodeError when using pdfminer (the version d79612c from git)

pdfminer_sample3.py

Download https://www.dropbox.com/s/khjfr63o82fa5yn/numbers-test-document.pdf?dl=0 and execute the following script:

#!/usr/bin/env python

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,   password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

print(convert_pdf_to_txt("numbers-test-document.pdf"))

Error message

Traceback (most recent call last):
  File "pdfminer_sample3.py", line 32, in <module>
    print(convert_pdf_to_txt("samples/numbers-test-document.pdf"))
  File "pdfminer_sample3.py", line 14, in convert_pdf_to_txt
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/converter.py", line 186, in __init__
    PDFConverter.__init__(self, rsrcmgr, outfp, codec=codec, pageno=pageno, laparams=laparams)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/converter.py", line 173, in __init__
    self.outfp.write(u"é")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)

/usr/bin/env: ‘python\r’: No such file or directory

pdf2txt.py fails to run with:

/usr/bin/env: ‘python\r’: No such file or directory

This appears to be due to a DOS carriage return in the shebang line. Running dos2unix pdf2txt.py appears to fix the issue.

Test environment:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04 LTS
Release: 16.04
Codename: xenial

pdfminer.six (20160614) - from PyPi via pip

Running in a virtualenv

$ virtualenv --version
15.0.1

$ python --version
Python 3.5.1+

no module named 'pdfminor.settings'

Hey man! Thks for doing this. I downloaded the package and followed your instructions. Unfortunately when i try to scrap a pdf, I get an Import Error which says: "no module named 'pdfminor.settings'". I checked the folder pdfminor if the settings file was missing but it isn't. Any idea what the problem might be?
cheers, ed

pdfminer.six failed to extract layout from some pdf files

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
from pdfminer.pdfdevice import PDFDevice

rsrcmgr = PDFResourceManager()
laparams = LAParams()
laparams.char_margin = 0.5
laparams.word_margin = 0.5

device = PDFPageAggregator(rsrcmgr, laparams=laparams)

interpreter = PDFPageInterpreter(rsrcmgr, device)

for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
layout = device.get_result()

For some pdf files, it appears that the device.get_result() returns an object whose _objs has 0 length. The pdf files contains a table with cells having text or numbers. (I used to have pdfminer3k, under that package, these pdf files will get exception for zlib error.) (For some reason I can't attach the offending pdf here.)

UnicodeDecodeError

I am using python 2.7.10. When running the following code I get a unicodedecodeerror

This is the code:

`
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from StringIO import StringIO

rsrcmgr = PDFResourceManager()
retstr = StringIO()
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec='utf-8', laparams=laparams)
fp = file('c:\users\public\data\pdfs\policy.pdf', 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 1
caching = True
pagenos=set()

for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)

text = retstr.getvalue()

fp.close()
device.close()
`

And this is the error:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Users\Public\Public Software\WinPython32\python-2.7.10\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 790, in runfile execfile(filename, namespace) File "C:\Users\Public\Public Software\WinPython32\python-2.7.10\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 77, in execfile exec(compile(scripttext, filename, 'exec'), glob, loc) File "C:/Users/Public/Public Software/WinPython32/python-2.7.10/Scripts/pdfextract.py", line 28, in <module> text = retstr.getvalue() File "C:\Users\Public\Public Software\WinPython32\python-2.7.10\lib\StringIO.py", line 272, in getvalue self.buf += ''.join(self.buflist) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

After some digging, turns out the problem is that in PDFConverter the following thing happens:

`class PDFConverter(PDFLayoutAnalyzer):

def __init__(self, rsrcmgr, outfp, codec='utf-8', pageno=1, laparams=None):
    PDFLayoutAnalyzer.__init__(self, rsrcmgr, pageno=pageno, laparams=laparams)
    self.outfp = outfp
    self.codec = codec
    if hasattr(self.outfp, 'mode'):
        if 'b' in self.outfp.mode:
            self.outfp_binary = True
        else:
            self.outfp_binary = False
    else:
        import io
        if isinstance(self.outfp, io.BytesIO):
            self.outfp_binary = True
        elif isinstance(self.outfp, io.StringIO):
            self.outfp_binary = False
        else:
            try:                    
                self.outfp.write(u"é)  
                self.outfp_binary = False
            except TypeError:
                self.outfp_binary = True
    return`

As I am using StringIO from StringIO; the buflist in my StringIO object ends up with the u'é' entry which is unicode type. Later in the code when it is writing from the PDF to this array it writes str types. This mixing causes StringIO to throw a UnicodeDecodeError when it tries to join them all (in the getvalues() call).

I'm not that pro with Python, I managed to get it working by replacing the particular line by:

self.outfp.write(u"é".encode(codec,'ignore'))

But maybe this defeats the purpose of the line (?).

I found a post on StackOverflow with some information that I thought was relevant:

http://stackoverflow.com/questions/5701372/what-caused-this-traceback

Remove #!/usr/bin/env python from library files

Would it be possible to do this?

I'm maintaining pdfminer in Fedora (and for F24+ I've switched the package over to pdfminer.six; nice work, by the way!), and the use of /usr/bin/env python at the top of the library files is... confusing rpm's automatic dependency detection. As a result, python3-pdfminer thinks it needs a python2 install on the system...

To fix this, I wrote a patch yesterday to remove these lines and rebuilt the package, and things appear to still work just fine. I'd be happy to open a pull request for it assuming that this is a desirable change.

Looking at the code, it seems that:

  • Many of the library files (i.e. the code under pdfminer/) don't actually do anything if ran directly.
  • Of those that do, they seem to just run tests. Could we move those tests to the tests/ directory?

converting to text outside of command line

Hi there,

I'm currently trying to use pdfminer within a jupyter notebook to convert pdf files to text but fail miserably :/ I know that you provide the command line tool pdf2text.py, but isn't this also possible in another way? Let's say I use the example code you provided up to the following point:

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
# Open a PDF document.
fp = open('sample.pdf', 'rb')
parser = PDFParser(fp)
document = PDFDocument(parser)

How could I create a text file out of this then? Is it somehow possible that you create a function for pdf2text.py functionality?

Thanks anyway for the package :)

XMLConverter error

I can't

pdf2txt.py -t xml something.pdf 
<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1" bbox="0.000,0.000,595.320,841.920" rotate="0">
<textbox id="0" bbox="262.250,792.168,339.072,802.128">
<textline bbox="262.250,792.168,339.072,802.128">
Traceback (most recent call last):
  File "/path/.venv/bin/pdf2txt.py", line 126, in <module>
    if __name__ == '__main__': sys.exit(main())
  File "/path/.venv/bin/pdf2txt.py", line 121, in main
    outfp = extract_text(**vars(A))
  File "/path/.venv/bin/pdf2txt.py", line 61, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/path/.venv/lib/python3.5/site-packages/pdfminer/high_level.py", line 83, in extract_text_to_fp
    interpreter.process_page(page)    
  File "/path/.venv/lib/python3.5/site-packages/pdfminer/pdfinterp.py", line 837, in process_page
    self.device.end_page(page)
  File "/path/.venv/lib/python3.5/site-packages/pdfminer/converter.py", line 56, in end_page
    self.receive_layout(self.cur_item)
  File "/path/.venv/lib/python3.5/site-packages/pdfminer/converter.py", line 537, in receive_layout
    render(ltpage)
  File "/path/.venv/lib/python3.5/site-packages/pdfminer/converter.py", line 483, in render
    render(child)
  File "/path/.venv/lib/python3.5/site-packages/pdfminer/converter.py", line 517, in render
    render(child)
  File "/path/.venv/lib/python3.5/site-packages/pdfminer/converter.py", line 508, in render
    render(child)
  File "/path/.venv/lib/python3.5/site-packages/pdfminer/converter.py", line 521, in render
    (enc(item.fontname, None), bbox2str(item.bbox), item.size))
  File "/path/.venv/lib/python3.5/site-packages/pdfminer/utils.py", line 277, in enc
    x = x.replace('&', '&amp;').replace('>', '&gt;').replace('<', '&lt;').replace('"', '&quot;')
TypeError: a bytes-like object is required, not 'str'

Question: Can pdfminer retrieve text & bboxes without layout?

Is it possible to just retrieve all the text on the page with each fragment returned with its bounding box, i.e., (x1, y1, x2, y2, text) -- with no layout analysis? Use case: this would be ideal for people who want to do their own layout analysis with minimal overheads.

MaxInt Error

Problem arises when you try to run pdf2txt. Error trace states can not import max int. I'm running Python 3.5.1. After researching the error, I've come to the conclusion that Python 3.x.x have removed the system constant maxint; hence, the inability to import said maxint.
Um...
Please send aid.

Incorrectly Determining Height of Characters

WrongFontSizes3.pdf

The following simple python code illustrates a bug with parsing the attached PDF file. Specifically, it incorrectly determines the height of text. Namely it thinks the small text is much larger than the big text.

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTChar

def parse_pages():

    fp = open('WrongFontSizes3.pdf', 'rb')
    parser = PDFParser(fp)
    doc = PDFDocument(parser)
    parser.set_document(doc)

    rsrcmgr = PDFResourceManager()
    laparams = LAParams(char_margin=3.5, all_texts=True)
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
        layout = device.get_result()
        yield layout

if __name__ == '__main__':
    for page in parse_pages():
        for tbox in page:
            if not isinstance(tbox, LTTextBox):
                continue
            for line in tbox:
                for char in line:
                    if not isinstance(char, LTChar):
                        continue
                    print char.get_text().encode('UTF-8'), char.size

Output:

B 29.4555
i 29.4555
g 29.4555
T 29.4555
e 29.4555
x 29.4555
t 29.4555
S 66.96
m 66.96
a 66.96
l 66.96
l 66.96
  66.96
  66.96
T 66.96
e 66.96
x 66.96
t 66.96

Process finished with exit code 0

PyCrypto and windows

I've tried to install pdfminer with pip on windows, but I was getting an error over wheels bundle on pyCrypto. The problem was solved when changing a line in the setup file, requires = ['six', 'pycrypto'] to requires = ['six', 'pycryptodome'].

I think it would be nice to detect if it's on windows to set requires correctly.

Python 3.5.2 import error

>>> pdfminer.__version__
'1.3.0'
>>> from pdfminer.pdfpage import PDFPage
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named 'pdfminer.pdfpage'

it works ok in py2,why?

#py2 demo
from bs4 import BeautifulSoup
import requests
import re
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO  import StringIO
from io import open
from pdfminer.pdfpage import PDFPage
def pdf_txt(url):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    f = requests.get(url).content
    fp = StringIO(f)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos = set()
    for page in PDFPage.get_pages(fp,
                                  pagenos,
                                  maxpages=maxpages,
                                  password=password,
                                  caching=caching,
                                  check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    str = retstr.getvalue()
    retstr.close()
    return str
print pdf_txt('http://pythonscraping.com/pages/warandpeace/chapter1.pdf')

image.py: TypeError: object of type 'zip' has no len() in Python 3.5.1

Extracting text with images there is an error "TypeError: object of type 'zip' has no len()
".

"File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pdfminer/image.py", line 74, in export_image
if len(filters) == 1 and filters[0][0] in LITERALS_DCT_DECODE:
TypeError: object of type 'zip' has no len()"

I converted also a PDF file "i1040nr.pdf" in your test set and there is the same error.

TypeError: '<' not supported between instances of 'tuple' and 'int'

I'm occasionally getting an error:

File "c:\Anaconda3\lib\site-packages\pdfminer.six-20170720-py3.6.egg\pdfminer\psparser.py", line 233, in fillbuf
if self.charpos < len(self.buf):
TypeError: '<' not supported between instances of 'tuple' and 'int'

As far as I can tell the only place this might be caused by is line 350 in psparser.py
return (self._parse_comment, len(s))
where a tuple is in fact returned from function _parse_comment(self, s, i)

I hope this is enough info.
Patrick

Page creation fails because of wrong case

I've got a PDF (can't share it because of sensitive information) that fails during page creation via pdfminer.pdfpage.PDFPage.create_pages because it returns an empty iterable, relevant code with my patch:

    @classmethod
    def create_pages(klass, document, debug=0):
        def search(obj, parent):
            if isinstance(obj, int):
                objid = obj
                tree = dict_value(document.getobj(objid)).copy()
            else:
                objid = obj.objid
                tree = dict_value(obj).copy()
            for (k, v) in parent.iteritems():
                if k in klass.INHERITABLE_ATTRS and k not in tree:
                    tree[k] = v
            # FIXME: wrong case?
            tree_type = tree.get('Type', tree.get('type'))
            if tree_type is LITERAL_PAGES and 'Kids' in tree:
                if 1 <= debug:
                    print >>sys.stderr, 'Pages: Kids=%r' % tree['Kids']
                for c in list_value(tree['Kids']):
                    for x in search(c, tree):
                        yield x
            elif tree_type is LITERAL_PAGE:
                if 1 <= debug:
                    print >>sys.stderr, 'Page: %r' % tree
                yield (objid, tree)
        pages = False
        if 'Pages' in document.catalog:
            for (objid, tree) in search(document.catalog['Pages'], document.catalog):
                yield klass(document, objid, tree)
                pages = True
        if not pages:
            # fallback when /Pages is missing.
            for xref in document.xrefs:
                for objid in xref.get_objids():
                    try:
                        obj = document.getobj(objid)
                        if isinstance(obj, dict) and obj.get('Type') is LITERAL_PAGE:
                            yield klass(document, objid, obj)
                    except PDFObjectNotFound:
                        pass
        return

The relevant bit is tree_type = tree.get('Type', tree.get('type')) - the actual PDF stream has a lowercase /type instead of the expected /Type, causing the generator to never yield, which in turn causes StopIteration in pdfquery.

According to the spec (1.7, page 57-58), this is valid and /Type is a different name object than /type. However, in this case, the meaning is the same, and probably the PDF generator is the 'offending' root cause here.

  1. Would a patch like this where capitalized object is checked first with fallback to lowercase be acceptable?
  2. If not, what would be the best way to handle this?

pdfminer.six removes strings from .pdf file

I ran into the issue of pdfminer.six replacing strings from the text of my PDF file like 'fi', 'ff' etc. with a char which is displayed in console as a question mark (?). I guess it is some non-ASCII char since I can not replace it with searching for the actual char '?'. I found out that these strings ('fi', 'ff' and so on) are found in the file pdffont.py in a list called STANDARD_STRINGS. I tried commenting them out, to see if it would fix my problem, but it did not.

The PKG_INFO file of pdfminer.six says:
Metadata-Version: 1.1
Name: pdfminer.six
Version: 20160614
Summary: PDF parser and analyzer

If more info is needed to fix the issue, let me know. I can also provide the PDF file that produces the issue. Other than that keep the good work up, I really enjoy pdfminer.six!

pycrypto install error

C:\Python36\Scripts>.\pip3 install pdfminer.six
Collecting pdfminer.six
Using cached pdfminer.six-20170419.tar.gz
Requirement already satisfied: six in c:\python36\lib\site-packages (from pdfminer.six)
Collecting pycrypto (from pdfminer.six)
Using cached pycrypto-2.6.1.tar.gz
Collecting chardet (from pdfminer.six)
Using cached chardet-3.0.3-py2.py3-none-any.whl
Installing collected packages: pycrypto, chardet, pdfminer.six
Running setup.py install for pycrypto ... error
Complete output from command c:\python36\python.exe -u -c "import setuptools, tokenize;file='C:\Users\john\AppData\Local\Temp\pip-build-thbvu0am\pycrypto\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\john\AppData\Local\Temp\pip-d96gbthz-record\install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_py
creating build
creating build\lib.win-amd64-3.6
creating build\lib.win-amd64-3.6\Crypto
copying lib\Crypto\pct_warnings.py -> build\lib.win-amd64-3.6\Crypto
copying lib\Crypto_init_.py -> build\lib.win-amd64-3.6\Crypto
creating build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\hashalgo.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\HMAC.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\MD2.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\MD4.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\MD5.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\RIPEMD.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\SHA.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\SHA224.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\SHA256.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\SHA384.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\SHA512.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash_init_.py -> build\lib.win-amd64-3.6\Crypto\Hash
creating build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\AES.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\ARC2.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\ARC4.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\blockalgo.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\Blowfish.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\CAST.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\DES.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\DES3.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\PKCS1_OAEP.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\PKCS1_v1_5.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\XOR.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher_init_.py -> build\lib.win-amd64-3.6\Crypto\Cipher
creating build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util\asn1.py -> build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util\Counter.py -> build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util\number.py -> build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util\py3compat.py -> build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util\randpool.py -> build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util\RFC1751.py -> build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util\winrandom.py -> build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util_number_new.py -> build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util_init_.py -> build\lib.win-amd64-3.6\Crypto\Util
creating build\lib.win-amd64-3.6\Crypto\Random
copying lib\Crypto\Random\random.py -> build\lib.win-amd64-3.6\Crypto\Random
copying lib\Crypto\Random_UserFriendlyRNG.py -> build\lib.win-amd64-3.6\Crypto\Random
copying lib\Crypto\Random_init_.py -> build\lib.win-amd64-3.6\Crypto\Random
creating build\lib.win-amd64-3.6\Crypto\Random\Fortuna
copying lib\Crypto\Random\Fortuna\FortunaAccumulator.py -> build\lib.win-amd64-3.6\Crypto\Random\Fortuna
copying lib\Crypto\Random\Fortuna\FortunaGenerator.py -> build\lib.win-amd64-3.6\Crypto\Random\Fortuna
copying lib\Crypto\Random\Fortuna\SHAd256.py -> build\lib.win-amd64-3.6\Crypto\Random\Fortuna
copying lib\Crypto\Random\Fortuna_init_.py -> build\lib.win-amd64-3.6\Crypto\Random\Fortuna
creating build\lib.win-amd64-3.6\Crypto\Random\OSRNG
copying lib\Crypto\Random\OSRNG\fallback.py -> build\lib.win-amd64-3.6\Crypto\Random\OSRNG
copying lib\Crypto\Random\OSRNG\nt.py -> build\lib.win-amd64-3.6\Crypto\Random\OSRNG
copying lib\Crypto\Random\OSRNG\posix.py -> build\lib.win-amd64-3.6\Crypto\Random\OSRNG
copying lib\Crypto\Random\OSRNG\rng_base.py -> build\lib.win-amd64-3.6\Crypto\Random\OSRNG
copying lib\Crypto\Random\OSRNG_init_.py -> build\lib.win-amd64-3.6\Crypto\Random\OSRNG
creating build\lib.win-amd64-3.6\Crypto\SelfTest
copying lib\Crypto\SelfTest\st_common.py -> build\lib.win-amd64-3.6\Crypto\SelfTest
copying lib\Crypto\SelfTest_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest
creating build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\common.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_AES.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_ARC2.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_ARC4.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_Blowfish.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_CAST.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_DES.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_DES3.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_pkcs1_15.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_pkcs1_oaep.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_XOR.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
creating build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\common.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_HMAC.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_MD2.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_MD4.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_MD5.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_RIPEMD.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_SHA.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_SHA224.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_SHA256.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_SHA384.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_SHA512.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
creating build\lib.win-amd64-3.6\Crypto\SelfTest\Protocol
copying lib\Crypto\SelfTest\Protocol\test_AllOrNothing.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Protocol
copying lib\Crypto\SelfTest\Protocol\test_chaffing.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Protocol
copying lib\Crypto\SelfTest\Protocol\test_KDF.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Protocol
copying lib\Crypto\SelfTest\Protocol\test_rfc1751.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Protocol
copying lib\Crypto\SelfTest\Protocol_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Protocol
creating build\lib.win-amd64-3.6\Crypto\SelfTest\PublicKey
copying lib\Crypto\SelfTest\PublicKey\test_DSA.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\PublicKey
copying lib\Crypto\SelfTest\PublicKey\test_ElGamal.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\PublicKey
copying lib\Crypto\SelfTest\PublicKey\test_importKey.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\PublicKey
copying lib\Crypto\SelfTest\PublicKey\test_RSA.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\PublicKey
copying lib\Crypto\SelfTest\PublicKey_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\PublicKey
creating build\lib.win-amd64-3.6\Crypto\SelfTest\Random
copying lib\Crypto\SelfTest\Random\test_random.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random
copying lib\Crypto\SelfTest\Random\test_rpoolcompat.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random
copying lib\Crypto\SelfTest\Random\test__UserFriendlyRNG.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random
copying lib\Crypto\SelfTest\Random_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random
creating build\lib.win-amd64-3.6\Crypto\SelfTest\Random\Fortuna
copying lib\Crypto\SelfTest\Random\Fortuna\test_FortunaAccumulator.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\Fortuna
copying lib\Crypto\SelfTest\Random\Fortuna\test_FortunaGenerator.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\Fortuna
copying lib\Crypto\SelfTest\Random\Fortuna\test_SHAd256.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\Fortuna
copying lib\Crypto\SelfTest\Random\Fortuna_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\Fortuna
creating build\lib.win-amd64-3.6\Crypto\SelfTest\Random\OSRNG
copying lib\Crypto\SelfTest\Random\OSRNG\test_fallback.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\OSRNG
copying lib\Crypto\SelfTest\Random\OSRNG\test_generic.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\OSRNG
copying lib\Crypto\SelfTest\Random\OSRNG\test_nt.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\OSRNG
copying lib\Crypto\SelfTest\Random\OSRNG\test_posix.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\OSRNG
copying lib\Crypto\SelfTest\Random\OSRNG\test_winrandom.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\OSRNG
copying lib\Crypto\SelfTest\Random\OSRNG_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\OSRNG
creating build\lib.win-amd64-3.6\Crypto\SelfTest\Util
copying lib\Crypto\SelfTest\Util\test_asn1.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Util
copying lib\Crypto\SelfTest\Util\test_Counter.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Util
copying lib\Crypto\SelfTest\Util\test_number.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Util
copying lib\Crypto\SelfTest\Util\test_winrandom.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Util
copying lib\Crypto\SelfTest\Util_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Util
creating build\lib.win-amd64-3.6\Crypto\SelfTest\Signature
copying lib\Crypto\SelfTest\Signature\test_pkcs1_15.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Signature
copying lib\Crypto\SelfTest\Signature\test_pkcs1_pss.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Signature
copying lib\Crypto\SelfTest\Signature_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Signature
creating build\lib.win-amd64-3.6\Crypto\Protocol
copying lib\Crypto\Protocol\AllOrNothing.py -> build\lib.win-amd64-3.6\Crypto\Protocol
copying lib\Crypto\Protocol\Chaffing.py -> build\lib.win-amd64-3.6\Crypto\Protocol
copying lib\Crypto\Protocol\KDF.py -> build\lib.win-amd64-3.6\Crypto\Protocol
copying lib\Crypto\Protocol_init_.py -> build\lib.win-amd64-3.6\Crypto\Protocol
creating build\lib.win-amd64-3.6\Crypto\PublicKey
copying lib\Crypto\PublicKey\DSA.py -> build\lib.win-amd64-3.6\Crypto\PublicKey
copying lib\Crypto\PublicKey\ElGamal.py -> build\lib.win-amd64-3.6\Crypto\PublicKey
copying lib\Crypto\PublicKey\pubkey.py -> build\lib.win-amd64-3.6\Crypto\PublicKey
copying lib\Crypto\PublicKey\RSA.py -> build\lib.win-amd64-3.6\Crypto\PublicKey
copying lib\Crypto\PublicKey_DSA.py -> build\lib.win-amd64-3.6\Crypto\PublicKey
copying lib\Crypto\PublicKey_RSA.py -> build\lib.win-amd64-3.6\Crypto\PublicKey
copying lib\Crypto\PublicKey_slowmath.py -> build\lib.win-amd64-3.6\Crypto\PublicKey
copying lib\Crypto\PublicKey_init_.py -> build\lib.win-amd64-3.6\Crypto\PublicKey
creating build\lib.win-amd64-3.6\Crypto\Signature
copying lib\Crypto\Signature\PKCS1_PSS.py -> build\lib.win-amd64-3.6\Crypto\Signature
copying lib\Crypto\Signature\PKCS1_v1_5.py -> build\lib.win-amd64-3.6\Crypto\Signature
copying lib\Crypto\Signature_init_.py -> build\lib.win-amd64-3.6\Crypto\Signature
Skipping optional fixer: buffer
Skipping optional fixer: idioms
Skipping optional fixer: set_literal
Skipping optional fixer: ws_comma
running build_ext
warning: GMP or MPIR library not found; Not building Crypto.PublicKey._fastmath.
building 'Crypto.Random.OSRNG.winrandom' extension
creating build\temp.win-amd64-3.6
creating build\temp.win-amd64-3.6\Release
creating build\temp.win-amd64-3.6\Release\src
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Isrc/ -Isrc/inc-msvc/ -Ic:\python36\include -Ic:\python36\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" "-IC:\Program Files (x86)\Windows Kits\8.1\include\shared" "-IC:\Program Files (x86)\Windows Kits\8.1\include\um" "-IC:\Program Files (x86)\Windows Kits\8.1\include\winrt" /Tcsrc/winrand.c /Fobuild\temp.win-amd64-3.6\Release\src/winrand.obj
winrand.c
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(26): error C2061: syntax error: identifier 'intmax_t'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(27): error C2061: syntax error: identifier 'rem'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(27): error C2059: syntax error: ';'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(28): error C2059: syntax error: '}'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(30): error C2061: syntax error: identifier 'imaxdiv_t'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(30): error C2059: syntax error: ';'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(40): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(41): error C2146: syntax error: missing ')' before identifier '_Number'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(41): error C2061: syntax error: identifier '_Number'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(41): error C2059: syntax error: ';'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(42): error C2059: syntax error: ')'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(45): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(46): error C2146: syntax error: missing ')' before identifier '_Numerator'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(46): error C2061: syntax error: identifier '_Numerator'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(46): error C2059: syntax error: ';'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(46): error C2059: syntax error: ','
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(48): error C2059: syntax error: ')'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(50): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(56): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(63): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(69): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(76): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(82): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(89): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(95): error C2143: syntax error: missing '{' before '__cdecl'
error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe' failed with exit status 2

----------------------------------------

Command "c:\python36\python.exe -u -c "import setuptools, tokenize;file='C:\Users\john\AppData\Local\Temp\pip-build-thbvu0am\pycrypto\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\john\AppData\Local\Temp\pip-d96gbthz-record\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in C:\Users\john\AppData\Local\Temp\pip-build-thbvu0am\pycrypto\

Organization icon and members, spelling

This is not a technical issue. It's more about increasing public trust in this repo and its organization.

Chunk by words or lines in XML

I'm trying to extract all the text from a PDF by this version of PDFminer, but it chunks by letters although I change the -M, -L or -W options.

I need to extract it in XML format, ¿is there any option to extract word by word or line by line?

Thanks

ModuleNotFoundError

Hi,

I am using Python 3.6 and I cannot set up Pdfminer. six.
While doing pdf2txt.py samples/simple1.pdf, an error appears :
ModuleNotFoundError: No module named 'pdfminer.settings'

Has anyone run into the same problem?

Thank you very much in advance for your help!

NameError: global name 'ImageWriter' is not defined

When running this command below on pdfminer.six version 20160202 in Python 2.7.10, NameError: global name 'ImageWriter' is not defined error message occurred.

$ pdf2txt.py -O myoutput -o myoutput/myfile.html -t html -p 1,3 myfile.pdf

convert to tag does not work in python 3

I am trying to convert PDF to tag file. It worked perfected fine in python 2. Tried the same thing in python 3, getting this error, any workaround?

/home/ubuntu/anaconda3/envs/py35/lib/python3.5/site-packages/pdfminer.six-20170119-py3.5.egg/pdfminer/utils.py in make_compat_str(in_str)
     24 def make_compat_str(in_str):
     25     "In Py2, does nothing. In Py3, converts to string, guessing encoding."
---> 26     assert isinstance(in_str, (bytes, str, unicode))
     27     if six.PY3 and isinstance(in_str, bytes):
     28         enc = chardet.detect(in_str)

AssertionError:

CCITT filters don't handle bytestrings properly

In the feedbytes methods, character conversion to byte happens via ord. On Py3 this is not needed, since we're dealing with bytestrings directly.

This was also mentioned in #24 and subsequently solved, but a couple of cases were still missing.

setup.py Python 2.6 failure

Downloading pdfminer.six-20151013.zip (4.2MB)
  100% |################################| 4.2MB 141kB/s 
  Traceback (most recent call last):
    File "<string>", line 20, in <module>
    File "/tmp/pip-build-QgQ8RT/pdfminer.six/setup.py", line 12, in <module>
      install_requires=['six', 'chardet'] if sys.version_info.major>2 else ['six'],
  AttributeError: 'tuple' object has no attribute 'major'
  Complete output from command python setup.py egg_info:
  Traceback (most recent call last):

    File "<string>", line 20, in <module>

    File "/tmp/pip-build-QgQ8RT/pdfminer.six/setup.py", line 12, in <module>

      install_requires=['six', 'chardet'] if sys.version_info.major>2 else ['six'],

  AttributeError: 'tuple' object has no attribute 'major'

PDF text not being positioned when converted to HTML

I'm having trouble converting pdf's into html. Everything seems to work fine except the text is not positioned correctly on the page. It seems like all the text is being bunched together into a few span tags.

I've tried the following to no avail:

  • Tried converting different pdf's
  • Tried using using various versions of pdfminer.six (tried all three taged versions listed on github as well as the current pypi version)
  • Tried playing with the LAParams settings

I also received some encoding errors which i was able to get by by using switching from "from io import StringsIO" to "from Six import BytesIO".

Has anyone had any success in converting pdf's to html? If so would you mind sharing your configuration? I've attached a sample config code and html output file for reference:

`
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import HTMLConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from six import BytesIO

def convert_pdf_to_html(pdf_path, html_path):
"""Converts PDF to HTML file
ARGS:
pdf_path: full path of pdf file to convert to html
html_path: full path of html file containing extracted pdf data
"""

rsrcmgr = PDFResourceManager()
retstr = BytesIO()
codec = 'UTF-8'
laparams = LAParams()
device = HTMLConverter(rsrcmgr, retstr, codec=codec)

fp = open(pdf_path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
fstr = ''
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,
                              caching=caching, check_extractable=True):
    interpreter.process_page(page)
fstr = retstr.getvalue()


fp.close()
device.close()
retstr.close()

fstr = fstr.replace(b'\n', b"")
html_file = open(html_path, 'wb')
html_file.write(fstr)`

ice_dom.zip

Error on install, UBUNTU 16 LTS

Errors

  1. after pip install pdfminer.six ... 100% ... bit fineshed with permission error,
Successfully built pdfminer.six
Installing collected packages: pdfminer.six
Exception:
Traceback (most recent call last):
  File "/home/user/.local/lib/python2.7/site-packages/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
.....
OSError: [Errno 13] Permission ... '/usr/local/lib/python2.7/dist-packages/pdfminer'
  1. next repeting with sudo, sudo pip install pdfminer.six resulted in
The directory '/home/user/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/home/user/.cache/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting pdfminer.six
Requirement already satisfied: six in /usr/lib/python2.7/dist-packages (from pdfminer.six)
Requirement already satisfied: pycrypto in /usr/lib/python2.7/dist-packages (from pdfminer.six)
Installing collected packages: pdfminer.six
Successfully installed pdfminer.six-20170419

but, no way to call it

pdf2txt.py myFile.pdf produced error "/usr/bin/env: “python\r”: not found"

When's the next PyPI release?

Hi,

We're currently in the process of upgrading a codebase from Python 2 to 3, and running into the bug in #15

This has been fixed in master, and a RC with release date Jan. 19th was created. When can this be expected? For the time being, I'll run of a known-good commit, but I'd prefer to install from PyPI at all times.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.