Code Monkey home page Code Monkey logo

behemoth's People

Contributors

cklaussner avatar gsingers avatar jnioche avatar kiranchitturi avatar lewismc avatar mumrah avatar smarthi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

behemoth's Issues

Language Identification

Solr has a nice language detection module that is pluggable with detectors from Tika, IBM, etc. It would be nice if we could hook this into the Hadoop side of things

Options to replace input with output of job

The jobs currently generate a new seqfile. it would be great to have a '-r input' option to replace the input with the output if the job is successful. We'd also have a '-i input -o output' for the current behaviour

Output to LucidWorks 2.1

Hey Julien,

Would you be open to a patch that makes Behemoth work LucidWorksEnterprise? It's a standalone module (you can see it on my fork under the LWE branch). It only requires Solr dependencies. In other words it's all open source, it is just the library I use is the Solr one specifically shipping with LucidWorks. It pretty much also shows how Behemoth should be updated for Solr4, as well.

The reason I ask, is I'm tired of having to merge.

Thanks,
Grant

Use regular expressions for annotation and type filters

The UIMA and GATE annotation and type filters are configured using strings; by default if nothing is specified by the user no annotations are produced in the output. Instead it would better to use regular expressions and by default allow any type and feature to be added to the Behemoth document.

CorpusGenerator never invokes document.setText

I was trying to use the UIMA module with a text corpus without success (stuff that should have been annotated wasn't). After some inspection of the source code, I found that the problem was that CorpusGenerator never calls setText, using instead exclusively setContent. The catch here is that UIMAMapper silently (unless you adjust the log levels so debug messages are emitted) discards documents with null text.

I have... fixed? this problem by adding two new options to CorpusGenerator:

  1. mode, text or binary. Text mode uses setText, binary mode calls setContent
  2. charset, defaults to UTF-8. Used to transform read input byte arrays to text

I don't know if I'm using behemoth the wrong way or this is a genuine issue. If the latter, you might be interested in a pull request.

Versioning BehemothDocument

Currently BehemothDocument does not contain a version number, which means it will be difficult to maintain compatibility with several versions if in the future fields are added.

Also consider the possibility of using Avro for serialization / deserialization.

Tests cant be run by more than one person

The tests create files like

/tmp/sfcmt/.foo.crc

These are owned by the first person to do the tests, so might not be created or manipulated by the second person to run the tests.

Suggested change, write to

/tmp/[username]/sfcmt/.foo.crc

(I suppose I should create a patch for something so trivial)

ClassNotFoundException org.apache.mahout.math.Vector

Hello,

I tried out behemoth and walked through the tutorial successfully until I wanted to create the vectors for Mahout.

When running the SparseVectorsFromBehemoth command the hadoop Job fails after several ClassNotFoundExceptions

These classes are "org.apache.mahout.math.Vector" and "org.apache.mahout.math.function.ObjectLongProcedure"

I am using hadoop 1.0.0.

Hope this isn't the wrong place for my request. Didn't find any information for an mailing list or some other place for support :-)

Upgrade to Mahout 0.10.0

As suggested by @jnioche an upgrade to Mahout 0.10 would be nice.
FYI, Mahout have implemented profiles dealing with Hadoop 1.2.1 and 2.X as of Hadoop 0.10 so this is certainly possible without polluting too much of the behemoth-mahout module.
I'll get to work on this when I can.

Ingest times with CorpusGenerator

I am seeing quite slow ingest times with CorpusGenerator - around an hour for the Enron dataset. Consider possible improvements:

  • Perform ingest as Map task.
  • Load directly from .zip, .tar or .tgz files like Forqlift.
  • Stage SequenceFile to local disk before loading to HDFS.

Upgrade to Mahout 0.9

A current limitation for using Mahout is for Hadoop 0.20.203 to be installed. This is a Mahout specific dependency.
I propose to upgrade to Mahout 0.9 as this will enable use of Hadoop 1.2.1 as per the current Hadoop dependency in Behemoth.

Classloader problems with job files that include behemoth.core.jar

This is probably a Hadoop class loading issue rather than a problem with Behemoth, but worth being aware of ...

For both versions of Behemoth

bin/hadoop fs -libjars behemoth-core.job -text test/corpus

works fine but

bin/hadoop fs -libjars behemoth-gate.job -text test.corpus

gives

 INFO util.NativeCodeLoader: Loaded the native-hadoop library
 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
 INFO compress.CodecPool: Got brand-new decompressor
 text: java.io.IOException: WritableName can't load class: com.digitalpebble.behemoth.BehemothDocument

Use warc-hadoop library

[https://github.com/ept/warc-hadoop] could be used as a dependency for handling the WARC format in Hadoop. This would be cleaner than having a copy of the lemurproject classes as we currently do.

Refactor Tika as a DocumentProcessor

Tika is currently called implicitly by the component which converts files into BehemothDocuments + the GATE annotator. It would be better to deploy it as a DocumentProcessor instead and be able to use it on its own and in an explicit manner

UIMAMapper to use UIMAProcessor

Haven't finished that and the code is duplicated. This will also allow to reuse the JVM and not reinstanciate the UIMA pipeline everytime

Provide Interface Packaging

It would be great if we could have a small Jar of packaging that contained just those things necessary for use down stream. Initially, it is probably just the BehemothDocument definition, but it might include other things. The thinking is, that if I can share the structure of what's in a given SequenceFile, then downstream users need not have all the other pieces/dependencies for use. For instance, in my case, I want to process rich docs using Tika and output BehemothDocuments, from that I intend to convert them to Mahout vectors. If I have a small piece that allows Mahout to rely on the appropriate definitions, I can stay pretty well decoupled from Behemoth while still effectively using it.

Exception when calling DistributedCache.purgeCache(job) in GATEDriver.java

This may affect UIMADriver.java and SOLRIndexerJob.java as they call purgeCache(job) as well

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.filecache.DistributedCache.purgeCache(Lorg/apache/hadoop/conf/Configuration;)V
at com.digitalpebble.behemoth.gate.GATEDriver.run(GATEDriver.java:112)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at com.digitalpebble.behemoth.gate.GATEDriver.main(GATEDriver.java:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
mark@mark-laptop:~/apps/hadoop-0.20.203.0$

Unnecessary jars being included in .job files

Various unnecessary .jars were being incorporated into all the .job files e.g.

hsqldb, kfs, jets3t, jsp-2.1, jsp-api-2.1, jasper-compiler, jasper-runtime, jetty-util, ant

These are Hadoop transitive dependencies. They were being put in all the job files. This can be avoided by using the transitive="false" option. Mahout was including unnecessary dependencies, such as the Eclipse IDE Core. This can be avoided by excluding specific transitive dependencies.

However, removing these transitive dependencies caused UIMA to break, because it did not declare runtime dependencies on commons-logging and junit, but it does require them. This means these dependencies have to explicitly added in the ivy.xml file in the behmoth/uima module.

Then, in Maven, it is possible to distinguish between build dependencies and test dependencies using scope. In Ivy is should be possible to do the same thing using configurations. However, GATE, declares some dependencies as optional, such as XStream, or gate-compiler-jdt, but they are in fact required by Behemoth. It seems that Ivy does not include optional dependencies if configurations are used, unless this is specifically overridden. This problem was not apparent without configurations. It is possible to overcome this by explicitly including optional transitive dependencies.

After this work, the job files were significantly smaller:

behemoth-core.job 9,032,525 786,483
behemoth-gate.job 36,641,044 25,533,380
behemoth-io.job 9,753,364 1,508,750
behemoth-mahout.job 16,745,380 8,254,708
behemoth-solr.job 9,977,776 2,065,084
behemoth-tika.job 29,927,363 21,936,385
behemoth-uima.job 11,195,683 3,214,421

However, there is a danger that some files have been excluded, which although not explicitly declared as required transitive dependencies, are actually required. script.sh was a useful resource for identifying these problems, but at the moment it does not test the io/nutch, solr or mahout modules.

Tika Components

I've got some components for a MapReduce job for dealing with rich documents and converting them to Behemoth docs.

Mahout : add Lucene Tokenisation

The Lucene Tokenisation has been replaced with annotations type/value taken from the Behemoth docs. It would be good to add the Lucene Tokenisation back as in the original Mahout class so that users who need Behemoth mostly for converting from Nutch or parsing with Tika don't need to use the GATE or UIMA modules just for getting tokens

Elasticsearch module

Hi @jnioche I'm working on an ES search module as part of using Behemoth in an ongoing project. I'll send you a PR ASAP.

Unable to Index Tika file to Solr using behemoth

/hadoop/bin$ hadoop jar /usr/local/behemoth/solr/target/behemoth-solr-1.0-SNAPSHOT-job.jar com.digitalpebble.behemoth.solr.SOLRIndexerJob shake-behemoth-tika collection1 localhost:9983
Warning: $HADOOP_HOME is deprecated.

13/06/19 10:26:06 INFO mapred.FileInputFormat: Total input paths to process : 2
13/06/19 10:26:07 INFO mapred.JobClient: Running job: job_201306190820_0004
13/06/19 10:26:08 INFO mapred.JobClient: map 0% reduce 0%
13/06/19 10:26:22 INFO mapred.JobClient: Task Id : attempt_201306190820_0004_m_000000_0, Status : FAILED
java.lang.ClassCastException: java.lang.String cannot be cast to java.util.Map
at org.apache.solr.common.cloud.ClusterState.load(ClusterState.java:297)
at org.apache.solr.common.cloud.ClusterState.load(ClusterState.java:270)
at org.apache.solr.common.cloud.ZkStateReader.createClusterStateWatchersAndUpdate(ZkStateReader.java:274)
at org.apache.solr.client.solrj.impl.CloudSolrServer.connect(CloudSolrServer.java:142)
at org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:165)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:122)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:107)
at com.digitalpebble.behemoth.solr.SOLRWriter.write(SOLRWriter.java:88)
at com.digitalpebble.behemoth.solr.SOLROutputFormat$1.write(SOLROutputFormat.java:48)
at com.digitalpebble.behemoth.solr.SOLROutputFormat$1.write(SOLROutputFormat.java:40)
at org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:847)
at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:591)
at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:38)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.