Extract text from any document. No muss. No fuss.
Originally written by @deanmalmgren. Maintained by the good people at @jazzband
extract text from any document. no muss. no fuss.
Home Page: http://textract.readthedocs.io
License: MIT License
I think, it would be nice support .xls/xlsx files, I think I can do it using xlrd lib
Pdfminer and pdftotext
// I found your project here by way of some Tika committers I work with.
Hi Dean,
I love the idea of textract. We've built a similar capability in Python here where I work. It was a non-Tika solution and had acceptable performance. I saw textract does not support spreadsheets. I developed a lib that wrapped around CSV XLS, and XLSX so that I could interface to any of those file formats with a single API -- for the purpose of retrieving the text/cells from such data. ...
Anyhow, all that is background leading up to the Java variant of my work in this realm, which we have open sourced. The python stuff I refer to above was not open sourced, but helped me scope and design the XText library https://github.com/OpenSextant/Xponents, XText module). As it mentions it is mainly a wrapper around Tika, but allows others to write conversion routines and extend.
The major topic here that I see missing in the various threads is related to managing the metadata. I'm not as much concerned with Java vs. Python or Tika vs non-Tika. When I convert a random document to text, there is no easy place to record the metadata related to the text and document; There are some standard metadata fields; But the user of these text-extractor solutions still need to do a lot of work to manage the "extraction" process. Arguably, metadata properties are part of the text. In XText I put some thought behind how to do this.... and do it in a non-Java, non-Tika fashion so I can use the output of text extraction downstream by Python or other solutions.
I've seen this discussion on the metadata issue from the Tika side, which is XML and Tika-specific. http://wiki.apache.org/tika/MetadataDiscussion. By contrast, in XText I've gone with a JSON approach. I think a step beyond our respective library/solutions is to continue the discussion about the metadata -- its all right there and available during the conversion process, it just seems like a natural place to implement. Think "textract = text + metadata".
Cheers,
Marc
The .doc
parser currently uses antiword
to extract content from word document files. This is a great starting point, but it might be nice to replicate the behavior of the .pdf
parser and have a pure python fallback method to make textract usable across platforms. From a minute of googling around, it looks like others have started down this path:
I'm not sure if it makes more sense to roll our own or just use these other packages to extract text in the right way (I have a slight bias for this approach), but I thought I'd throw this issue together in case it inspires ideas or contributions from others.
@mubaldino mentioned this in #18 but I thought I'd open a separate issue to have a more focused conversation on this particular feature
Other tools, such as Tika, also extract metadata that is embedded in the document. Is this something that we should also (optionally) extract with textract?
From the outset, the goal of this project has been to provide useful text extraction upstream of any subsequent natural language processing, analysis, and modeling. To the extent that metadata is also important for such applications (I've certainly used metadata in my projects before), I'm completely open to adding this functionality but I do have a strong opinion that parsers should not be required to extract metadata. The most important first step is to extract the text content; metadata can always be extracted later.
If we do end up switching to class-based parsers in #39, this would be relatively trivial to implement on a parser-by-parser basis by just adding a metadata
method to the parser class.
What do others think about this?
Any thoughts on format (json vs xml vs csv)? My initial preference would be for dictionaries and json but could be convinced otherwise.
please add support for .tiff
including this:
https://pypi.python.org/pypi/SpeechRecognition/
I'm getting test failures due to md5sum checks, despite the output looking okay. Some of the other md5sum tests pass fine.
Pass: eml, ps, json,odt
Fail: gif, jpg, png
Is there a known issue, or a known fix? Could this have anything to do with a potentially different version of tesseract? Should tests compare trimmed output text instead of md5sums?
Result of tesseract --version:
tesseract 3.03
leptonica-1.70
libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0
Below is my output from "./bin/textract ./tests/jpg/i_heart_jpegs.jpg > out.txt," which has the md5sum of 74b5fcffef2aa3e284dccc0cca577d47 and has two trailing newlines.
"""
EAS TEST
THIS IS A TEST OF THE NATIONAL
EMERGENCY ALERT SYSTEM
THERE IS NO ACTUAL EMERGENCY
"""
Provide support for .png image file format using tesseract ocr.
When I try to parse .xls file using textract, I get the following error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/textract-0.5.1-py2.7.egg/textract/parsers/__init__.py", line 38, in process
raise exceptions.ExtensionNotSupported(ext)
textract.exceptions.ExtensionNotSupported: The filename extension .xls is not yet supported by
textract. Please suggest this filename extension here:
https://github.com/deanmalmgren/textract/issues
Kindly resolve.
I'm seeing the Python package hang on large PDFs. To reproduce, use a large pdf like this World Bank Annual Report (pdf) and run textract.process('filename.pdf')
. I get no output but the command doesn't complete.
Popen.wait() in shell.py is filling up the OS buffer. Switching to Popen.communicate()
as suggested in the link above fixes the issue, but since communicate() returns the output (and error) text directly, I had to modify shell and the pdf parser to return text instead of a pipe object.
Happy to submit a pull request, but is it wise to change the return value of shell (and therefore need to change every parser)? Is there something else to return that we can call sdtout.read()
on?
FWIW, this post suggests using tempfile.TemporaryFile() to work around the buffer issue, but I couldn't figure out how to call pipe.stdout.read()
with the tempfile.
It would be nice if you didn't have to install all of the debian packages and then install all of the python packages on top of it. stdeb looks like an interesting project that can potentially make that easier?
Howdy!
It looks like the setup.py for extract 0.5.0 is b0rk'd and missing the README.rst file:
Downloading/unpacking textract
Getting page https://pypi.python.org/simple/textract/
URLs to search for versions for textract:
* https://pypi.python.org/simple/textract/
Analyzing links from page https://pypi.python.org/simple/textract/
Found link https://pypi.python.org/packages/source/t/textract/textract-0.1.0.tar.gz#md5=8650df3aa0d72494204a1bc05d689b7a (from https://pypi.python.org/simple/textract/), version: 0.1.0
Found link https://pypi.python.org/packages/source/t/textract/textract-0.2.0.tar.gz#md5=0d7d5ebf3c435c869ce52c1c59f6f072 (from https://pypi.python.org/simple/textract/), version: 0.2.0
Found link https://pypi.python.org/packages/source/t/textract/textract-0.3.0.tar.gz#md5=3fcb1db61bfaea3b9658ae0a5a805b41 (from https://pypi.python.org/simple/textract/), version: 0.3.0
Found link https://pypi.python.org/packages/source/t/textract/textract-0.4.0.tar.gz#md5=931dc639060e3deb481c26f49bafd5c8 (from https://pypi.python.org/simple/textract/), version: 0.4.0
Found link https://pypi.python.org/packages/source/t/textract/textract-0.5.0.tar.gz#md5=a2eb34a8c66e64b3a6389adfd9707576 (from https://pypi.python.org/simple/textract/), version: 0.5.0
Using version 0.5.0 (newest of versions: 0.5.0, 0.4.0, 0.3.0, 0.2.0, 0.1.0)
Downloading textract-0.5.0.tar.gz
Downloading from URL https://pypi.python.org/packages/source/t/textract/textract-0.5.0.tar.gz#md5=a2eb34a8c66e64b3a6389adfd9707576 (from https://pypi.python.org/simple/textract/)
Running setup.py (path:/tmp/pip_build_root/textract/setup.py) egg_info for package textract
Traceback (most recent call last):
File "<string>", line 17, in <module>
File "/tmp/pip_build_root/textract/setup.py", line 11, in <module>
with open("README.rst") as stream:
IOError: [Errno 2] No such file or directory: 'README.rst'
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 17, in <module>
File "/tmp/pip_build_root/textract/setup.py", line 11, in <module>
with open("README.rst") as stream:
IOError: [Errno 2] No such file or directory: 'README.rst'
----------------------------------------
Cleaning up...
Removing temporary dir /tmp/pip_build_root...
Command python setup.py egg_info failed with error code 1 in /tmp/pip_build_root/textract
Exception information:
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/pip/basecommand.py", line 122, in main
status = self.run(options, args)
File "/usr/lib/python2.7/dist-packages/pip/commands/install.py", line 278, in run
requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
File "/usr/lib/python2.7/dist-packages/pip/req.py", line 1229, in prepare_files
req_to_install.run_egg_info()
File "/usr/lib/python2.7/dist-packages/pip/req.py", line 325, in run_egg_info
command_desc='python setup.py egg_info')
File "/usr/lib/python2.7/dist-packages/pip/util.py", line 697, in call_subprocess
% (command_desc, proc.returncode, cwd))
InstallationError: Command python setup.py egg_info failed with error code 1 in /tmp/pip_build_root/textract
This was installing fine until around 1900 PDT on Friday, 20140808.
Thanks and cheers!
Provide support for .jpeg image file format using tesseract ocr.
There appears to be something strange going on with the odt parser. In this document, for example, I'd expect the textract output to be something like:
the quick
brown fox
jumps
over
the
lazy dog
but instead we get something like this:
the quick
brown fox
jumps
over
the
lazy dog
jumps over the
It looks like the list elements appear twice and out of order, probably because of how we are extracting the text—first extracting text:p
elements, then extracting text:h
elements, and finally extracting text:list
elements.
xlsx reader failed to read data from multiple sheets. It read data from only one sheet rather than either failing or reading all.
The present csv parser is working on a temp hack, so let's make it better!
TODO:
I have some initial code up - refer Pull Request #75.
Provide support for .gif image file format using tesseract ocr.
Pythons email package is fine. Question is what content to extract?
The audio and image extraction libraries are using temporary files and relatively complicated shell commands to work around the fact that we need to use temporary files. This seems like a perfect opportunity to use a with
statement and a temporary file context file manager.
I'm putting this here as a placeholder in case someone else can get to it before me.
This might be a bugbear and is definitely after .eml...but it would be nice.
redo the README in rst instead of markdown
It looks like every image processed with textract has a message like Tesseract Open Source OCR Engine v3.02 with Leptonica
at the beginning. This appears to be printed to standard out and is not a reflection of the text that is actually in the document:
vagrant@dev:/vagrant$ textract tests/gif/i_heart_gifs.gif
Tesseract Open Source OCR Engine v3.02 with Leptonica
TEXT
vagrant@dev:/vagrant$ tesseract tests/gif/i_heart_gifs.gif kk.txt
Tesseract Open Source OCR Engine v3.02 with Leptonica
vagrant@dev:/vagrant$ cat kk.txt.txt
TEXT
vagrant@dev:/vagrant$ tesseract tests/jpg/i_heart_jpegs.jpg gg.txt
Tesseract Open Source OCR Engine v3.02 with Leptonica
vagrant@dev:/vagrant$ tesseract tests/jpg/i_heart_jpegs.jpg gg.txt > kk
vagrant@dev:/vagrant$ cat kk
Tesseract Open Source OCR Engine v3.02 with Leptonica
This should be a simple fix of just redirecting the output of the tesseract command in a smarter way, but I'm in the middle of something else and thought I'd create a placeholder issue in case someone else has the time to address it.
Great work, thx. It would be nice to include .odt support.This python project seems to support .odt : https://github.com/btimby/fulltext/
antiword
there has to be a way to do this, help me google
It currently rebuilds every single time which is rather slow when you've locally created all of the files to begin with.
There can be text information embedded in djvu documents.
I've coded in '.wav' file-type support - PR #56.
Let's get more audio file-types supported!
Ideas:
I'm open to all suggestions!
the current implementation uses the same file tmpout
which causes problems if you have two threads concurrently extracting text using tesseract.
Apache Tika supports a pretty wide range of formats and appears to have many of the same goals—namely, extracting text for a very wide range of formats. It seems like it might be good to integrate the Apache Tika extraction capabilities into textract, not unlike how we use external libraries like antidoc
or python-pptx
to extract content.
The nice thing about this is that it would provide yet another means of extracting raw text from all the features we have and just provide another method for doing it. For PDFs, for example, we expose methods to extract via the pdfminer
python package as well as the command line utility pdftotext
and so it would be natural to just add another tika
extraction method. We could even use tika to have better fallback behavior here when there aren't any natively written extraction methods specified.
The downside is that Tika is written in java and doesn't appear to be the easiest thing to install for maven n00bs like me. Python bindings exist but even those carry big caveats about installation.
Random thought: it would be interesting to do an experiment to look at how effective Tika is at extracting text versus the other utilities that are currently included in textract. Given that we often care about the accuracy of word sequences (or even more forgivingly, word frequencies), maybe we can construct a test to see where it makes sense to include Tika and where (if anywhere) it doesn't perform as well.
This idea came up on twitter (here, here, and here) and I should probably get back to them at some point once we figure out what to do here.
@ShawnMilo did my changes break your stuff? (Sorry once again!) Maybe we can work together to get this all resolved to support two development environments sometime next week.
I think that support for open formats such as .odt would be useful. Great project!
These parsers currently use the tesseract
shell command to extract content from image files. This is a great starting point (!!!), but it might be nice to replicate the behavior of the .pdf
parser and have a pure python fallback method to make textract more usable across platforms. From a minute of googling around, it doesn't look like there are any real packages that do this, but scikit-learn does have some examples of doing character classification that might be useful:
This sounds like a pretty serious undertaking (and not terribly urgent given the relative portability of the tesseract-ocr
package) but I thought I'd create this issue for posterity in case someone knows of a python-based fallback that would be appropriate in these situations.
...but I'm running out of ideas to fix. Grrrrrr. Something bad is happening with the virtualenv on RTD that I don't fully understand.
A text file having a filename without an extension is considered as not supported.
raise exceptions.ExtensionNotSupported(ext)
textract.exceptions.ExtensionNotSupported: The filename extension is not yet supported by
textract. Please suggest this filename extension here:
https://github.com/deanmalmgren/textract/issues
I tried to use textract on a pdf and got this error message in response. It doesn't tell me anything useful that I can do to fix the problem
Command failed with exit code -5
I had previously had a different error message (I think 127) trying to run textract on a pdf and it turned out I needed to install poppler
not sure what's going on...
My girlfriend uses a screen reader, JAWS, and I could not help but wonder as to how applicable this would be to a visually impaired user. It uses a text to speech formula that reads EVERYTHING aloud, including slashes and < > marks. I have no coding skills at all, and so I am unable to contribute to this project myself, but if I am able to bring some awareness of this type of potential user, I feel as though I may be contributing in a small manor.
I know she has great difficulty with PDF files, and other image based text formats, and I think this project has wonderful potential! I wish you guys the best and hope everything blooms delightfully!!
when tab completing on filenames, it adds a space. probably need to specify that filename
is a file
and not a string for this to work properly.
The .ps
parser currently uses pstotext
to extract content from postscript files. This is a great starting point, but it might be nice to replicate the behavior of the .pdf
parser and have a pure python fallback method to make textract usable across platforms. From a minute of googling around, it looks like others have started down this path:
I'm not sure if it makes more sense to roll our own or just use these other packages to extract text in the right way (I have a slight bias for this approach), but I thought I'd throw this issue together in case it inspires ideas or contributions from others.
Microsoft exchange email format is tricky. Related to #4
As far as I can see, a directory is not a option while choosing a file.
That would be nice feature, something like:
This would be very helpful while translating for example web sites, and extracting all the strings to a resource file. We can discuss about it, if you think it makes sense.
Thanks
Suggestion for adding support for extension .ppt.
This is currently using the unrtf
command line tool, but it would be nice to have a pure python extraction method as a fallback.
@pudo proposed this idea in #66 (comment) and I wanted to be sure to capture it before I forget.
With the way that the pdf parser currently works, you have to know beforehand whether the pdf is a scanned image or whether it has embedded text. This is inconvenient for end users. A better option would be:
textract some_pdf.pdf # try to extract embedded text first. if that fails, try OCR
textract -m tesseract scanned.pdf # do OCR
textract -m pdftotext embedded.pdf # do text extraction with pdftotext utility
I am getting a "Command failed with exit code 127" message when I try to convert a PDF on my Mac OS X machine.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.