Apache Tika supports a pretty wid

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

integrate with apache tika? about textract HOT 11 OPEN

deanmalmgren commented on May 18, 2024

integrate with apache tika?

from textract.

Comments (11)

w1kke commented on May 18, 2024

Does this solve the problem?

Using jnius: finally, I remembered of a library I spotted once, called jnius, that should be made exactly for that purpose: using Java libraries from Python, without the need of wrappers, running the whole thing in a JVM, etc.. at the end, I opted for doing this way.
Setting up pyjnius

Setting things up was pretty straight-forward, as it was just a matter of:

pip install cython
pip install git+git://github.com/kivy/pyjnius.git
Then, I downloaded the tika-app jar, and put it somewhere.

From that point, using the library was a breeze:

If you put the jar in a non-standard location, you need to

prepare the CLASSPATH before importing jnius

import os
os.environ['CLASSPATH'] = "/path/to/tika-app.jar"

from jnius import autoclass

Import the Java classes we are going to need

Tika = autoclass('org.apache.tika.Tika')
Metadata = autoclass('org.apache.tika.metadata.Metadata')
FileInputStream = autoclass('java.io.FileInputStream')

tika = Tika()
meta = Metadata()
text = tika.parseToString(FileInputStream(filename), meta)
That's it! Now, you can just access the text transcript from text, and the file metadata is stored in meta (have a look at the .names() and .get(name) methods).

Integrating this with django and celery tasks was straightforward.

Of course, have a look at the Tika API Documentation for more information on the available methods, signatures, etc.

Taken from this source:
http://www.hackzine.org/using-apache-tika-from-python-with-jnius.html

from textract.

GomesNayagam commented on May 18, 2024

we use text extraction process(using tika) as micro service(Thrift) and finally my python client or any can use Apache thrift client, so that i left with complexity at the same time the native JVM performance etc is intact.It is my personal choice of-course.

from textract.

deanmalmgren commented on May 18, 2024

@w1kke that sounds really promising! I like the idea of using pyjnius to get this to work. The java dependency for installation doesn't sound too terrible and I like how actively developed the pyjnius project is.

@chrismattmann sorry to rope you into this conversation, but do you have any thoughts on using pyjnius to autoclass Tika vs using your python bindings?

Ideally it would be great if we can keep the installation of textract as simple and reliable as possible, with one apt-get install command and one pip install textract command.

@GomesNayagam Can you elaborate a bit more on how Thrift works? Do you have to have a Thrift server process running in order to use Tika with python then? If so, I think I'd prefer to go with either pyjnius or the tika-python options as it makes using textract that much cleaner.

from textract.

bitsgalore commented on May 18, 2024

For what it's worth, I did some quick experiments trying to get the Tika detector interface to work in Python using Pyjnius some months ago. See this repo:

https://github.com/bitsgalore/tikadetect

But then I ran into this:

bitsgalore/tikadetect#1

After which I gave up on it. Just thought I'd drop those links here, just in case it's useful!

from textract.

chrismattmann commented on May 18, 2024

Hi @deanmalmgren thanks for roping me in! To answer your question, I'm open to using pyjnius if it's better than JCC. Right now I'm experimenting with just making Tika available from Python in the easiest manner possible. JCC has its issues mostly having to do with weird configuration that's kind of unknown to me at times. I modeled my tika-python wrapper after Aptivate Tika which to my knowledge about a year ago was really the only well defined effort to expose Tika to Python. It had been forked many times, adapted in various ways, etc.

I'm happy to explore other Python integration facilities, with the ultimate goal of getting this pushed upstream into Apache Tika. I'd like Python support to be a focus there since there are many folks working to improve Tika at Apache and since it we maintain other bindings there (e.g., just as a basic .NET facility, even though there are other examples of such bindings e.g., http://kevm.github.io/tikaondotnet/).

That said, in looking at textract, I'm wondering - how are its goals different than Apache Tika's? If they are the same, it would be good to join forces and simply make a great Python version of Tika and then provide that to the community.

Thoughts? Thanks for checking out tika-python. I'm currently using it on the DARPA XDATA project in concert with a simple ETL library (https://github.com/chrismattmann/etllib) and Apache OODT (http://oodt.apache.org/) and Apache Solr (http://lucene.apache.org/solr/).

from textract.

GomesNayagam commented on May 18, 2024

@deanmalmgren yes, we run the tika jar as separate process with Thrift(java service) and use python thrift client to access the functionality. Since our use case is online collaboration, we need to have scaled manner to deal with this problem. For your case you can go with your approach or use python subprocess to load the jar as command line mode and get the result.

from textract.

deanmalmgren commented on May 18, 2024

@bitsgalore thanks for sharing the links; that's great to know.

from textract.

deanmalmgren commented on May 18, 2024

@chrismattmann Thanks for sharing your thoughts on how you developed the tika python bindings and the pros/cons of pyjnius vs JCC.

That said, in looking at textract, I'm wondering - how are its goals different than Apache Tika's? If they are the same, it would be good to join forces and simply make a great Python version of Tika and then provide that to the community.

This is a great question and admittedly something I have been grappling with quite a bit since I was first made aware of Tika.

One key difference, at least as far as I understand Tika, is that Tika provides one and only one parser class for each document type. Textract, on the other hand, is parser method agnostic meaning that we could have multiple ways of extracting content from the same document type. For example, you can currently either parse PDFs with the pdftotext command line utility or with the pdfminer python package and this can be controlled with the --method command line argument or the method kwarg to textract.process. Sometimes there is a tradeoff on accuracy vs performance and I think its important to give users flexibility when parsing content.

In that vein, it seems natural to extend textract to have tika support (either by JCC or pyjnius) because it is yet another way to extract text. Provided its as easy to install textract as it is today (with one apt-get install command and one pip install command), this seems like a great addition.

I'm certainly not opposed to creating a "Tika for python" but think that realistically other tools beyond Tika are probably just as good, if not better, at extracting content. The intent here is to pull all those together in one easy to use way.

What do you think about this? Am I full of shit or is this a worthy goal?

from textract.

chrismattmann commented on May 18, 2024

Hi @deanmalmgren thanks for your reply. My thoughts are below:

One key difference, at least as far as I understand Tika, is that Tika provides one and only one parser >class for each document type. Textract, on the other hand, is parser method agnostic meaning that >we could have multiple ways of extracting content from the same document type. For example, you >can currently either parse PDFs with the pdftotext command line utility or with the pdfminer python >package and this can be controlled with the --method command line argument or the method kwarg >to textract.process. Sometimes there is a tradeoff on accuracy vs performance and I think its >important to give users flexibility when parsing content.

Tika doesn't only provide one type of parser class for each document type. We have all sorts of ways to combine them (e.g., we have a CompositeParser construct that combines various underlying Parsers to form new and more powerful ones; we have FallbackParsers that try and parse first, and if unsuccessful, use an ordered List of Parsers to fall back on; we have ForkParsers which fork out new processes to control parsers, etc.) The mapping of Parsers to MIME types is also something that is a 1...N relationship (each parser declares its supported MIME types and they can be overlapping).

As for other tools besides Tika being good or better at extracting content - that is entirely possible and I've seen it. Tika isn't meant to be the best parsing toolkit in the world - it's goal is to find all those toolkits and to integrate them. So far the MIME registry we have is compliant with the 1200+ types defined in the IANA registry and we pretty much have parsers for all of those different types and more and more are being added each day. I have funding from NASA, DARPA and the NSF and various other reimbursable efforts (e.g., with bioinformatics companies, etc.) and so we are working to add more and more parsers and support. We also have a healthy active community of developers at Apache (I would say currently there are between 8-10 active developers working on Tika not simply at NASA, but at various companies, and agencies). I'm about to publish a blog post as well showing where we're taking Tika in some areas especially concerning Machine Translation (once you identify language and you have the ability to parse text and metadata from it, when not unify the languages and translate the text, the metadata, etc., too?) We are actively working in that area now. Tika really does aim to be the "digital babel fish". Not sure if you saw but the 1st and 4th chapters from the Tika in Action book are available the 1st one gives the motivation and case for the "Digital Babel Fish": http://manning.com/mattmann/SampleChapter-1.pdf and has some more insight into my and the team's motivations over the years.

So, at the end of the day, I'm definitely biased and think that the goals for textract are actually quite overlapping with Tika. That doesn't mean you have to be swallowed into the Tika project and you may decide, nope, I'm going forward and doing my own thing. If you do, that's fine and Tika can be one of your dependencies since we want anyone to use it and permissively license it under the Apache License version 2. Heck, we don't even mind people competing with Tika and if you build a better library, more power to ya! But, consider this an invite to our team, since I think your goals and philosophies and your code (that you are developing) would be most welcomed in the Apache Tika project.

Cheers!

from textract.

Gagravarr commented on May 18, 2024

If keeping it simple to install things is an objective, then Tika provides two "single jar" executables that you can run to do your parsing. One is the Tika App (tika-app.jar), which will require forking a new JVM each time, but provides a very simple way to feed Tika the file and get back text or metadata. The other is the Tika Server (tika-server.jar), which provides REST-ful services to do things like detection, plain text extraction, html extraction etc. (Tika Server has almost, but not quite as many endpoints as the Tika App has options). If you start a Tika Server, then there's a one-time cost then it's very quick to send over files and get back the parsed response

Otherwise, the Tika App jar contains all of Tika, along with the CLI + GUI interfaces. To keep things simple for Java programmers, we provide the OSGi bundle. For Python users, there's something to be said for grabbing the Tika App jar, adding that single jar to your classpath, then calling the normal Tika methods from within that (skipping the CLI classes). That would make it very simple for you to add Tika in, without the need to play with Maven (which while good, is quite a lot of work to use for a non-Java project)

from textract.

deanmalmgren commented on May 18, 2024

Thanks everybody for your thoughts on this. I'm still not exactly sure what makes sense here—hoping from an epiphany from a little time to consider the options—but in e6cf734 I started a related projects portion of the documentation so that we can list this (and other) packages that have similar goals

from textract.

integrate with apache tika? about textract HOT 11 OPEN

Comments (11)

If you put the jar in a non-standard location, you need to

prepare the CLASSPATH before importing jnius

Import the Java classes we are going to need

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent