digitalpebble / behemoth Goto Github PK

View Code? Open in Web Editor NEW

282.0 282.0 60.0 7.63 MB

Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.

License: Other

Shell 1.14% Java 56.59% HTML 42.27%

hadoop java mapreduce nlp

behemoth's People

Contributors

Stargazers

Watchers

Forkers

umars maczech cesine dynamicguy haiyun-document wujiee mibesr kshiva shivainnec mumrah nellaivijay mravi ahmed26 salvager anupamr agibsonccc darkseed kshtzsharma48 alexmc6 charlesa101 dejori liyanghua zhangjunfs jack19861225 xuyanhui cklaussner hitluobin ranakhattak kiranchitturi gongweijun86 prometheus1923 norfanos matrixy sigmoidanalytics lewismc mrt codeaudit hancy2013 chenchubabuch arthur-evozon anukat2015 sobolsigizmund cheers17 smarthi fmacias64 cigolabs kfriesth billthebest bytearchive witnesslq prashantimpetus iamjoshbinder a2393439531 changjiashuai jiexunhe vijayeluri eluriresearch iq-scm ladybug

behemoth's Issues

Language Identification

Solr has a nice language detection module that is pluggable with detectors from Tika, IBM, etc. It would be nice if we could hook this into the Hadoop side of things

Options to replace input with output of job

The jobs currently generate a new seqfile. it would be great to have a '-r input' option to replace the input with the output if the job is successful. We'd also have a '-i input -o output' for the current behaviour

Output to LucidWorks 2.1

Hey Julien,

Would you be open to a patch that makes Behemoth work LucidWorksEnterprise? It's a standalone module (you can see it on my fork under the LWE branch). It only requires Solr dependencies. In other words it's all open source, it is just the library I use is the Solr one specifically shipping with LucidWorks. It pretty much also shows how Behemoth should be updated for Solr4, as well.

The reason I ask, is I'm tired of having to merge.

Thanks,
Grant

Implement a BehemothCorpusLoader for GATE

It would be nice to be able to load documents from a Behemoth corpus from GATE, as it could facilitate the debugging and testing of the applications.

Improve BehemothDocument.ToString() to display key/values in metadata

We are currently getting things like :
metadata: org.apache.hadoop.io.MapWritable@d3db51

Use regular expressions for annotation and type filters

The UIMA and GATE annotation and type filters are configured using strings; by default if nothing is specified by the user no annotations are produced in the output. Instead it would better to use regular expressions and by default allow any type and feature to be added to the Behemoth document.

UIMAMapper should use the UIMAProcessor

instead of replicating the same code as it is currently the case

Add option to com.digitalpebble.behemoth.util.CorpusReader to hide binaryContent

CorpusGenerator never invokes document.setText

I was trying to use the UIMA module with a text corpus without success (stuff that should have been annotated wasn't). After some inspection of the source code, I found that the problem was that CorpusGenerator never calls setText, using instead exclusively setContent. The catch here is that UIMAMapper silently (unless you adjust the log levels so debug messages are emitted) discards documents with null text.

I have... fixed? this problem by adding two new options to CorpusGenerator:

mode, text or binary. Text mode uses setText, binary mode calls setContent
charset, defaults to UTF-8. Used to transform read input byte arrays to text

I don't know if I'm using behemoth the wrong way or this is a genuine issue. If the latter, you might be interested in a pull request.

Versioning BehemothDocument

Currently BehemothDocument does not contain a version number, which means it will be difficult to maintain compatibility with several versions if in the future fields are added.

Also consider the possibility of using Avro for serialization / deserialization.

Use Commons CLI for command line processing

It would be really helpful to use Apache Commons CLI for command line processing and then to try to standardize the names of input/output arguments, etc.

switch to new Hadoop API

use slf4j for logging

Tests cant be run by more than one person

The tests create files like

/tmp/sfcmt/.foo.crc

These are owned by the first person to do the tests, so might not be created or manipulated by the second person to run the tests.

Suggested change, write to

/tmp/[username]/sfcmt/.foo.crc

(I suppose I should create a patch for something so trivial)

ClassNotFoundException org.apache.mahout.math.Vector

Hello,

I tried out behemoth and walked through the tutorial successfully until I wanted to create the vectors for Mahout.

When running the SparseVectorsFromBehemoth command the hadoop Job fails after several ClassNotFoundExceptions

These classes are "org.apache.mahout.math.Vector" and "org.apache.mahout.math.function.ObjectLongProcedure"

I am using hadoop 1.0.0.

Hope this isn't the wrong place for my request. Didn't find any information for an mailing list or some other place for support :-)

Upgrade to Mahout 0.10.0

As suggested by @jnioche an upgrade to Mahout 0.10 would be nice.
FYI, Mahout have implemented profiles dealing with Hadoop 1.2.1 and 2.X as of Hadoop 0.10 so this is certainly possible without polluting too much of the behemoth-mahout module.
I'll get to work on this when I can.

CTakes modules for Behemoth

Hi Julien,
We will be working with cTAKES [0] most likely over the next while.
I would really like to run it on Behemoth.
I'll try work on this and send a PR soon.

[0] http://ctakes.apache.org/

Ingest times with CorpusGenerator

I am seeing quite slow ingest times with CorpusGenerator - around an hour for the Enron dataset. Consider possible improvements:

Perform ingest as Map task.
Load directly from .zip, .tar or .tgz files like Forqlift.
Stage SequenceFile to local disk before loading to HDFS.

WARC converter to allow custom metadata

similar to what is done by the CorpusGenerator

TikaProcessor should generate annotations for representing the markup

The Tika processor should also generate annotations to represent the XHTML markup returned by the Tika parsing.

CorpusReader generic parameter for annotations

It could make sense to have the possibility to show (via CorpusReader ) only the annotations matching a given regex defined via generic parameter .

Warn when input is not available for CorpusGenerator

Currently does nothing and generates an empty SeqFile as output

Upgrade to Mahout 0.9

A current limitation for using Mahout is for Hadoop 0.20.203 to be installed. This is a Mahout specific dependency.
I propose to upgrade to Mahout 0.9 as this will enable use of Hadoop 1.2.1 as per the current Hadoop dependency in Behemoth.

Classloader problems with job files that include behemoth.core.jar

This is probably a Hadoop class loading issue rather than a problem with Behemoth, but worth being aware of ...

For both versions of Behemoth

bin/hadoop fs -libjars behemoth-core.job -text test/corpus

works fine but

bin/hadoop fs -libjars behemoth-gate.job -text test.corpus

gives

 INFO util.NativeCodeLoader: Loaded the native-hadoop library
 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
 INFO compress.CodecPool: Got brand-new decompressor
 text: java.io.IOException: WritableName can't load class: com.digitalpebble.behemoth.BehemothDocument

Use warc-hadoop library

[https://github.com/ept/warc-hadoop] could be used as a dependency for handling the WARC format in Hadoop. This would be cleaner than having a copy of the lemurproject classes as we currently do.

Exception in thread "main" java.io.IOException: can't find class: com.digitalpebble.behemoth.tika.TextArrayWritable

// Exception in thread "main" java.io.IOException: can't find class: com.digitalpebble.behemoth.tika.TextArrayWritable because com.digitalpebble.behemoth.tika.TextArrayWritable
at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)

Refactor Tika as a DocumentProcessor

Tika is currently called implicitly by the component which converts files into BehemothDocuments + the GATE annotator. It would be better to deploy it as a DocumentProcessor instead and be able to use it on its own and in an explicit manner

Write Tutorial on processing Enron corpus with Tika

See https://issues.apache.org/jira/browse/TIKA-657?focusedCommentId=13030467#comment-13030467

UIMAMapper to use UIMAProcessor

Haven't finished that and the code is duplicated. This will also allow to reuse the JVM and not reinstanciate the UIMA pipeline everytime

Provide Interface Packaging

It would be great if we could have a small Jar of packaging that contained just those things necessary for use down stream. Initially, it is probably just the BehemothDocument definition, but it might include other things. The thinking is, that if I can share the structure of what's in a given SequenceFile, then downstream users need not have all the other pieces/dependencies for use. For instance, in my case, I want to process rich docs using Tika and output BehemothDocuments, from that I intend to convert them to Mahout vectors. If I have a small piece that allows Mahout to rely on the appropriate definitions, I can stay pretty well decoupled from Behemoth while still effectively using it.

organise components as separate plugins

i.e. if people don't need GATE then there is no point putting all its dependencies in the job file

Exception when calling DistributedCache.purgeCache(job) in GATEDriver.java

This may affect UIMADriver.java and SOLRIndexerJob.java as they call purgeCache(job) as well

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.filecache.DistributedCache.purgeCache(Lorg/apache/hadoop/conf/Configuration;)V
at com.digitalpebble.behemoth.gate.GATEDriver.run(GATEDriver.java:112)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at com.digitalpebble.behemoth.gate.GATEDriver.main(GATEDriver.java:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
mark@mark-laptop:~/apps/hadoop-0.20.203.0$

Add negative filter for mimetype

e.g. document.filter.mimetype.skip

We have a positive one already

Unnecessary jars being included in .job files

Various unnecessary .jars were being incorporated into all the .job files e.g.

hsqldb, kfs, jets3t, jsp-2.1, jsp-api-2.1, jasper-compiler, jasper-runtime, jetty-util, ant

These are Hadoop transitive dependencies. They were being put in all the job files. This can be avoided by using the transitive="false" option. Mahout was including unnecessary dependencies, such as the Eclipse IDE Core. This can be avoided by excluding specific transitive dependencies.

However, removing these transitive dependencies caused UIMA to break, because it did not declare runtime dependencies on commons-logging and junit, but it does require them. This means these dependencies have to explicitly added in the ivy.xml file in the behmoth/uima module.

Then, in Maven, it is possible to distinguish between build dependencies and test dependencies using scope. In Ivy is should be possible to do the same thing using configurations. However, GATE, declares some dependencies as optional, such as XStream, or gate-compiler-jdt, but they are in fact required by Behemoth. It seems that Ivy does not include optional dependencies if configurations are used, unless this is specifically overridden. This problem was not apparent without configurations. It is possible to overcome this by explicitly including optional transitive dependencies.

After this work, the job files were significantly smaller:

behemoth-core.job 9,032,525 786,483
behemoth-gate.job 36,641,044 25,533,380
behemoth-io.job 9,753,364 1,508,750
behemoth-mahout.job 16,745,380 8,254,708
behemoth-solr.job 9,977,776 2,065,084
behemoth-tika.job 29,927,363 21,936,385
behemoth-uima.job 11,195,683 3,214,421

However, there is a danger that some files have been excluded, which although not explicitly declared as required transitive dependencies, are actually required. script.sh was a useful resource for identifying these problems, but at the moment it does not test the io/nutch, solr or mahout modules.

Tika Components

I've got some components for a MapReduce job for dealing with rich documents and converting them to Behemoth docs.

Mahout : fix the vocabulary size

See http://osdir.com/ml/general/2011-04/msg00949.html

/////////////////////////////////////////////////////////////

Note that you probably need to introduce an "OTHER" token so that you can
fix the vocabulary size.

Otherwise, hashed representations will let you have an open vocabulary but
still have a fixed feature vector size.

convert Behemoth annotations into native GATE / UIMA annotations for input

The GATE / UIMA components currently ignore input annotations in the BehemothDocuments

Mahout : add Lucene Tokenisation

The Lucene Tokenisation has been replaced with annotations type/value taken from the Behemoth docs. It would be good to add the Lucene Tokenisation back as in the original Mahout class so that users who need Behemoth mostly for converting from Nutch or parsing with Tika don't need to use the GATE or UIMA modules just for getting tokens

Elasticsearch module

Hi @jnioche I'm working on an ES search module as part of using Behemoth in an ongoing project. I'll send you a PR ASAP.

Unable to Index Tika file to Solr using behemoth

/hadoop/bin$ hadoop jar /usr/local/behemoth/solr/target/behemoth-solr-1.0-SNAPSHOT-job.jar com.digitalpebble.behemoth.solr.SOLRIndexerJob shake-behemoth-tika collection1 localhost:9983
Warning: $HADOOP_HOME is deprecated.

13/06/19 10:26:06 INFO mapred.FileInputFormat: Total input paths to process : 2
13/06/19 10:26:07 INFO mapred.JobClient: Running job: job_201306190820_0004
13/06/19 10:26:08 INFO mapred.JobClient: map 0% reduce 0%
13/06/19 10:26:22 INFO mapred.JobClient: Task Id : attempt_201306190820_0004_m_000000_0, Status : FAILED
java.lang.ClassCastException: java.lang.String cannot be cast to java.util.Map
at org.apache.solr.common.cloud.ClusterState.load(ClusterState.java:297)
at org.apache.solr.common.cloud.ClusterState.load(ClusterState.java:270)
at org.apache.solr.common.cloud.ZkStateReader.createClusterStateWatchersAndUpdate(ZkStateReader.java:274)
at org.apache.solr.client.solrj.impl.CloudSolrServer.connect(CloudSolrServer.java:142)
at org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:165)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:122)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:107)
at com.digitalpebble.behemoth.solr.SOLRWriter.write(SOLRWriter.java:88)
at com.digitalpebble.behemoth.solr.SOLROutputFormat$1.write(SOLROutputFormat.java:48)
at com.digitalpebble.behemoth.solr.SOLROutputFormat$1.write(SOLROutputFormat.java:40)
at org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:847)
at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:591)
at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:38)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)

Add module for OpenNLP components

CorpusReader to ignore _logs subdirectory

A directory _logs is created by Hadoop within the output of a job. This causes the CorpusReader to choke.