deanmalmgren / textract Goto Github PK

View Code? Open in Web Editor NEW

3.8K 83.0 571.0 4.42 MB

extract text from any document. no muss. no fuss.

Home Page: http://textract.readthedocs.io

License: MIT License

Python 9.17% Makefile 0.24% Shell 0.53% HTML 77.42% PostScript 0.15% Dockerfile 0.08% Rich Text Format 12.40%

python natural-language-processing data-mining text-mining

textract's Introduction

textract

Extract text from any document. No muss. No fuss.

Full documentation.

Originally written by @deanmalmgren. Maintained by the good people at @jazzband

textract's People

Contributors

Stargazers

Watchers

Forkers

anthonygarvan tmblweed asmunduhreinn leandroloi ahurriyetoglu waytai wubr2000 prashanthd1205 sashka alihalabyah wavelets plq-tags yaktoran bussiere ericme stepthom mattlong yarko jimmy0000 afthill kiwi4py chagge memkite jenner bogdan-pr xingjianpan manniru spider08 twocngdagz prajitr orangelpai wangjiyuan r0h1t4sh xiongzhp amitjha3385 khezamian thatchristoph xjzhou thincamel melvinvarkey lyon-neu manithnuon shobhitmittal nicokoch damianzhou ghotiv prodigeni zpfbupt mikejoseph bag-of-projects dyagmin rtalwar26 pjshi23 divergentdave neilburns pombreda cdgeeko tomzhang christomitov jimmyjose-dev levivm giserh hasantayyar szborows priestd09 ofterdingen achembarpu eiotec kachergis 6r3nt kod3r anderser fastrom mmadsen etfre sexybear geekgao zjc5415 anmol2307 vlt itech001 xunyuw brunolight nift rscipien jsmith-mploir drashbooth jmhansen jpadilla scalp42 shejianmin erickillough lucadealfaro tawanda ankushshah89 bs0310 apolkosnik-old ashhher3 rutledgepaulv vseledkin

textract's Issues

.xls/xlsx support

I think, it would be nice support .xls/xlsx files, I think I can do it using xlrd lib

Related Project: XText from OpenSextant

// I found your project here by way of some Tika committers I work with.

Hi Dean,

I love the idea of textract. We've built a similar capability in Python here where I work. It was a non-Tika solution and had acceptable performance. I saw textract does not support spreadsheets. I developed a lib that wrapped around CSV XLS, and XLSX so that I could interface to any of those file formats with a single API -- for the purpose of retrieving the text/cells from such data. ...

Anyhow, all that is background leading up to the Java variant of my work in this realm, which we have open sourced. The python stuff I refer to above was not open sourced, but helped me scope and design the XText library https://github.com/OpenSextant/Xponents, XText module). As it mentions it is mainly a wrapper around Tika, but allows others to write conversion routines and extend.

The major topic here that I see missing in the various threads is related to managing the metadata. I'm not as much concerned with Java vs. Python or Tika vs non-Tika. When I convert a random document to text, there is no easy place to record the metadata related to the text and document; There are some standard metadata fields; But the user of these text-extractor solutions still need to do a lot of work to manage the "extraction" process. Arguably, metadata properties are part of the text. In XText I put some thought behind how to do this.... and do it in a non-Java, non-Tika fashion so I can use the output of text extraction downstream by Python or other solutions.

I've seen this discussion on the metadata issue from the Tika side, which is XML and Tika-specific. http://wiki.apache.org/tika/MetadataDiscussion. By contrast, in XText I've gone with a JSON approach. I think a step beyond our respective library/solutions is to continue the discussion about the metadata -- its all right there and available during the conversion process, it just seems like a natural place to implement. Think "textract = text + metadata".

Cheers,

Marc

Preliminary documentation on RTD

fallback python-based .doc extraction

The .doc parser currently uses antiword to extract content from word document files. This is a great starting point, but it might be nice to replicate the behavior of the .pdf parser and have a pure python fallback method to make textract usable across platforms. From a minute of googling around, it looks like others have started down this path:

I'm not sure if it makes more sense to roll our own or just use these other packages to extract text in the right way (I have a slight bias for this approach), but I thought I'd throw this issue together in case it inspires ideas or contributions from others.

--metadata flag?

@mubaldino mentioned this in #18 but I thought I'd open a separate issue to have a more focused conversation on this particular feature

Other tools, such as Tika, also extract metadata that is embedded in the document. Is this something that we should also (optionally) extract with textract?

From the outset, the goal of this project has been to provide useful text extraction upstream of any subsequent natural language processing, analysis, and modeling. To the extent that metadata is also important for such applications (I've certainly used metadata in my projects before), I'm completely open to adding this functionality but I do have a strong opinion that parsers should not be required to extract metadata. The most important first step is to extract the text content; metadata can always be extracted later.

If we do end up switching to class-based parsers in #39, this would be relatively trivial to implement on a parser-by-parser basis by just adding a metadata method to the parser class.

What do others think about this?

Any thoughts on format (json vs xml vs csv)? My initial preference would be for dictionaries and json but could be convinced otherwise.

support for .tiff

please add support for .tiff

why not audio files?

including this:
https://pypi.python.org/pypi/SpeechRecognition/

Image file tests fail md5sum check.

I'm getting test failures due to md5sum checks, despite the output looking okay. Some of the other md5sum tests pass fine.

Pass: eml, ps, json,odt
Fail: gif, jpg, png

Is there a known issue, or a known fix? Could this have anything to do with a potentially different version of tesseract? Should tests compare trimmed output text instead of md5sums?

Result of tesseract --version:

tesseract 3.03
leptonica-1.70
libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0

Below is my output from "./bin/textract ./tests/jpg/i_heart_jpegs.jpg > out.txt," which has the md5sum of 74b5fcffef2aa3e284dccc0cca577d47 and has two trailing newlines.

"""
EAS TEST

THIS IS A TEST OF THE NATIONAL
EMERGENCY ALERT SYSTEM

THERE IS NO ACTUAL EMERGENCY

"""

.png support

Provide support for .png image file format using tesseract ocr.

Unable to parse .xls file

When I try to parse .xls file using textract, I get the following error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/textract-0.5.1-py2.7.egg/textract/parsers/__init__.py", line 38, in process
    raise exceptions.ExtensionNotSupported(ext)
textract.exceptions.ExtensionNotSupported: The filename extension .xls is not yet supported by
textract. Please suggest this filename extension here:

    https://github.com/deanmalmgren/textract/issues

Kindly resolve.

hangs on large PDFs

I'm seeing the Python package hang on large PDFs. To reproduce, use a large pdf like this World Bank Annual Report (pdf) and run textract.process('filename.pdf'). I get no output but the command doesn't complete.

Popen.wait() in shell.py is filling up the OS buffer. Switching to Popen.communicate() as suggested in the link above fixes the issue, but since communicate() returns the output (and error) text directly, I had to modify shell and the pdf parser to return text instead of a pipe object.

Happy to submit a pull request, but is it wise to change the return value of shell (and therefore need to change every parser)? Is there something else to return that we can call sdtout.read() on?

FWIW, this post suggests using tempfile.TemporaryFile() to work around the buffer issue, but I couldn't figure out how to call pipe.stdout.read() with the tempfile.

create a debian package with all dependencies

It would be nice if you didn't have to install all of the debian packages and then install all of the python packages on top of it. stdeb looks like an interesting project that can potentially make that easier?

textract 0.5.0 is broken - can't install

Howdy!

It looks like the setup.py for extract 0.5.0 is b0rk'd and missing the README.rst file:

Downloading/unpacking textract
  Getting page https://pypi.python.org/simple/textract/
  URLs to search for versions for textract:
  * https://pypi.python.org/simple/textract/
  Analyzing links from page https://pypi.python.org/simple/textract/
    Found link https://pypi.python.org/packages/source/t/textract/textract-0.1.0.tar.gz#md5=8650df3aa0d72494204a1bc05d689b7a (from https://pypi.python.org/simple/textract/), version: 0.1.0
    Found link https://pypi.python.org/packages/source/t/textract/textract-0.2.0.tar.gz#md5=0d7d5ebf3c435c869ce52c1c59f6f072 (from https://pypi.python.org/simple/textract/), version: 0.2.0
    Found link https://pypi.python.org/packages/source/t/textract/textract-0.3.0.tar.gz#md5=3fcb1db61bfaea3b9658ae0a5a805b41 (from https://pypi.python.org/simple/textract/), version: 0.3.0
    Found link https://pypi.python.org/packages/source/t/textract/textract-0.4.0.tar.gz#md5=931dc639060e3deb481c26f49bafd5c8 (from https://pypi.python.org/simple/textract/), version: 0.4.0
    Found link https://pypi.python.org/packages/source/t/textract/textract-0.5.0.tar.gz#md5=a2eb34a8c66e64b3a6389adfd9707576 (from https://pypi.python.org/simple/textract/), version: 0.5.0
  Using version 0.5.0 (newest of versions: 0.5.0, 0.4.0, 0.3.0, 0.2.0, 0.1.0)
  Downloading textract-0.5.0.tar.gz
  Downloading from URL https://pypi.python.org/packages/source/t/textract/textract-0.5.0.tar.gz#md5=a2eb34a8c66e64b3a6389adfd9707576 (from https://pypi.python.org/simple/textract/)
  Running setup.py (path:/tmp/pip_build_root/textract/setup.py) egg_info for package textract
    Traceback (most recent call last):
      File "<string>", line 17, in <module>
      File "/tmp/pip_build_root/textract/setup.py", line 11, in <module>
        with open("README.rst") as stream:
    IOError: [Errno 2] No such file or directory: 'README.rst'
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 17, in <module>

  File "/tmp/pip_build_root/textract/setup.py", line 11, in <module>

    with open("README.rst") as stream:

IOError: [Errno 2] No such file or directory: 'README.rst'

----------------------------------------
Cleaning up...
  Removing temporary dir /tmp/pip_build_root...
Command python setup.py egg_info failed with error code 1 in /tmp/pip_build_root/textract
Exception information:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/pip/basecommand.py", line 122, in main
    status = self.run(options, args)
  File "/usr/lib/python2.7/dist-packages/pip/commands/install.py", line 278, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/usr/lib/python2.7/dist-packages/pip/req.py", line 1229, in prepare_files
    req_to_install.run_egg_info()
  File "/usr/lib/python2.7/dist-packages/pip/req.py", line 325, in run_egg_info
    command_desc='python setup.py egg_info')
  File "/usr/lib/python2.7/dist-packages/pip/util.py", line 697, in call_subprocess
    % (command_desc, proc.returncode, cwd))
InstallationError: Command python setup.py egg_info failed with error code 1 in /tmp/pip_build_root/textract

This was installing fine until around 1900 PDT on Friday, 20140808.

Thanks and cheers!

.jpeg support

Provide support for .jpeg image file format using tesseract ocr.

odt parser and list elements

There appears to be something strange going on with the odt parser. In this document, for example, I'd expect the textract output to be something like:

the quick 
brown fox
jumps
over
the
lazy dog

but instead we get something like this:

the quick
brown fox
jumps 
over 
the 
lazy dog
jumps over the

It looks like the list elements appear twice and out of order, probably because of how we are extracting the text—first extracting text:p elements, then extracting text:h elements, and finally extracting text:list elements.

xlsx multiple worksheet

xlsx reader failed to read data from multiple sheets. It read data from only one sheet rather than either failing or reading all.

csv parser needs improvement

The present csv parser is working on a temp hack, so let's make it better!

TODO:

Improve csv dialect detection
Improve csv header line detection
Improve output formatting (?)

I have some initial code up - refer Pull Request #75.

.gif support

Provide support for .gif image file format using tesseract ocr.

.eml support

Pythons email package is fine. Question is what content to extract?

Temporary file context manager

The audio and image extraction libraries are using temporary files and relatively complicated shell commands to work around the fact that we need to use temporary files. This seems like a perfect opportunity to use a with statement and a temporary file context file manager.

I'm putting this here as a placeholder in case someone else can get to it before me.

.pst support

This might be a bugbear and is definitely after .eml...but it would be nice.

"pypi site is borked"

redo the README in rst instead of markdown

remove tesseract version info from extracted text

It looks like every image processed with textract has a message like Tesseract Open Source OCR Engine v3.02 with Leptonica at the beginning. This appears to be printed to standard out and is not a reflection of the text that is actually in the document:

vagrant@dev:/vagrant$ textract tests/gif/i_heart_gifs.gif
Tesseract Open Source OCR Engine v3.02 with Leptonica
TEXT


vagrant@dev:/vagrant$ tesseract tests/gif/i_heart_gifs.gif kk.txt
Tesseract Open Source OCR Engine v3.02 with Leptonica
vagrant@dev:/vagrant$ cat kk.txt.txt 
TEXT

vagrant@dev:/vagrant$ tesseract tests/jpg/i_heart_jpegs.jpg gg.txt
Tesseract Open Source OCR Engine v3.02 with Leptonica
vagrant@dev:/vagrant$ tesseract tests/jpg/i_heart_jpegs.jpg gg.txt > kk
vagrant@dev:/vagrant$ cat kk
Tesseract Open Source OCR Engine v3.02 with Leptonica

This should be a simple fix of just redirecting the output of the tesseract command in a smarter way, but I'm in the middle of something else and thought I'd create a placeholder issue in case someone else has the time to address it.

.odt support

Great work, thx. It would be nice to include .odt support.This python project seems to support .odt : https://github.com/btimby/fulltext/

.doc support

antiword

automatically document the entire python package

there has to be a way to do this, help me google

speed up tests/Makefile

It currently rebuilds every single time which is rather slow when you've locally created all of the files to begin with.

.djvu support

There can be text information embedded in djvu documents.

More audio file-type support

I've coded in '.wav' file-type support - PR #56.
Let's get more audio file-types supported!

Ideas:

Keep this extractor in place, and convert other types to '.wav' before use.
- This allows us to continue to use Google's Speech Recognition system, which is pretty good.
- Use a simple file cloud-conversion API to convert everything to '.wav'.
Write different code, and use specialized tools for each type.
- This might result in faster extraction times (no conversion delay).
- Might yield better results in some cases.
- However, more code will need to be written and maintained.

I'm open to all suggestions!

use a random tempfile with tesseract commands

the current implementation uses the same file tmpout which causes problems if you have two threads concurrently extracting text using tesseract.

.txt support (haha)

integrate with apache tika?

Apache Tika supports a pretty wide range of formats and appears to have many of the same goals—namely, extracting text for a very wide range of formats. It seems like it might be good to integrate the Apache Tika extraction capabilities into textract, not unlike how we use external libraries like antidoc or python-pptx to extract content.

The nice thing about this is that it would provide yet another means of extracting raw text from all the features we have and just provide another method for doing it. For PDFs, for example, we expose methods to extract via the pdfminer python package as well as the command line utility pdftotext and so it would be natural to just add another tika extraction method. We could even use tika to have better fallback behavior here when there aren't any natively written extraction methods specified.

The downside is that Tika is written in java and doesn't appear to be the easiest thing to install for maven n00bs like me. Python bindings exist but even those carry big caveats about installation.

Random thought: it would be interesting to do an experiment to look at how effective Tika is at extracting text versus the other utilities that are currently included in textract. Given that we often care about the accuracy of word sequences (or even more forgivingly, word frequencies), maybe we can construct a test to see where it makes sense to include Tika and where (if anywhere) it doesn't perform as well.

This idea came up on twitter (here, here, and here) and I should probably get back to them at some point once we figure out what to do here.

Docker development environment broken?

@ShawnMilo did my changes break your stuff? (Sorry once again!) Maybe we can work together to get this all resolved to support two development environments sometime next week.

Open document formats

I think that support for open formats such as .odt would be useful. Great project!

fallback python-based .jpg, .jpeg, .png, .gif extraction

These parsers currently use the tesseract shell command to extract content from image files. This is a great starting point (!!!), but it might be nice to replicate the behavior of the .pdf parser and have a pure python fallback method to make textract more usable across platforms. From a minute of googling around, it doesn't look like there are any real packages that do this, but scikit-learn does have some examples of doing character classification that might be useful:

This sounds like a pretty serious undertaking (and not terribly urgent given the relative portability of the tesseract-ocr package) but I thought I'd create this issue for posterity in case someone knows of a python-based fallback that would be appropriate in these situations.

readthedocs command line docs still broken

...but I'm running out of ideas to fix. Grrrrrr. Something bad is happening with the virtualenv on RTD that I don't fully understand.

Text file without extension

A text file having a filename without an extension is considered as not supported.

    raise exceptions.ExtensionNotSupported(ext)
textract.exceptions.ExtensionNotSupported: The filename extension  is not yet supported by
textract. Please suggest this filename extension here:

    https://github.com/deanmalmgren/textract/issues

Please clarify error messages exit code -5

I tried to use textract on a pdf and got this error message in response. It doesn't tell me anything useful that I can do to fix the problem

Command failed with exit code -5

I had previously had a different error message (I think 127) trying to run textract on a pdf and it turned out I needed to install poppler

python api documentation not building on rtd

not sure what's going on...

JAWS (FREEDOM SCIENTIFIC)

My girlfriend uses a screen reader, JAWS, and I could not help but wonder as to how applicable this would be to a visually impaired user. It uses a text to speech formula that reads EVERYTHING aloud, including slashes and < > marks. I have no coding skills at all, and so I am unable to contribute to this project myself, but if I am able to bring some awareness of this type of potential user, I feel as though I may be contributing in a small manor.

I know she has great difficulty with PDF files, and other image based text formats, and I think this project has wonderful potential! I wish you guys the best and hope everything blooms delightfully!!

tab complete bug on filenames

when tab completing on filenames, it adds a space. probably need to specify that filename is a file and not a string for this to work properly.

fallback python-based .ps extraction

The .ps parser currently uses pstotext to extract content from postscript files. This is a great starting point, but it might be nice to replicate the behavior of the .pdf parser and have a pure python fallback method to make textract usable across platforms. From a minute of googling around, it looks like others have started down this path:

.msg support

Microsoft exchange email format is tricky. Related to #4

Textract should allow directories as a supported file type

As far as I can see, a directory is not a option while choosing a file.

That would be nice feature, something like:

Indicate a directory with some extensions to match,
Search recursively in the directory for those files,
Finally, extract the content to a single file.

This would be very helpful while translating for example web sites, and extracting all the strings to a resource file. We can discuss about it, if you think it makes sense.

Thanks

Support for PPT

Suggestion for adding support for extension .ppt.

fallback python-based .rtf extraction

This is currently using the unrtf command line tool, but it would be nice to have a pure python extraction method as a fallback.

pdf parser: chain pdftotext/pdfminer + tesseract

@pudo proposed this idea in #66 (comment) and I wanted to be sure to capture it before I forget.

With the way that the pdf parser currently works, you have to know beforehand whether the pdf is a scanned image or whether it has embedded text. This is inconvenient for end users. A better option would be:

textract some_pdf.pdf              # try to extract embedded text first. if that fails, try OCR
textract -m tesseract scanned.pdf  # do OCR
textract -m pdftotext embedded.pdf # do text extraction with pdftotext utility

Command failed with exit code 127

I am getting a "Command failed with exit code 127" message when I try to convert a PDF on my Mac OS X machine.