Code Monkey home page Code Monkey logo

extract's Introduction

Extract

Circle CI

A cross-platform command line tool for parallelized, distributed content-extraction. Built on top of Apache Tika and an essential part of the engineering behind the Panama Papers, Swiss Leaks and Luxembourg Leaks investigations.

It supports Redis-backed queueing for distributed, parallel extraction and will write to Solr, plain text files or standard output.

For guidance and instructions, please see the wiki.

Credits and Collaboration

Initialy developed by Matthew Caruana Galizia at ICIJ.

We welcome contributions! Please submit pull requests or contact us directly.

License

Copyright (c) 2018 International Consortium of Investigative Journalists. See LICENSE.

extract's People

Contributors

bamthomas avatar dependabot[bot] avatar grenwi avatar julm avatar mattcg avatar mvanzalu avatar pirhoo avatar stephengrey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

extract's Issues

Build error - log attached

I tried to build but got stuck in a test - not that severe, I think - but I cant find any built application, so it is 'that' severe.

log (mvn install -e -X > log 2>&1) attached.

log.txt

Building extract on Ubuntu

The authors have not taken any care to mention any steps / specs for a successful build and no guidance in this area.

  1. Ubuntu server https://www.ubuntu.com/download/server - Ubuntu 18.10

  2. download ISO - http://releases.ubuntu.com/18.10/ubuntu-18.10-live-server-amd64.iso

  3. VMware Workstation 14 Pro

  4. install Ubuntu / login as user / check current dir


environment details and pre-installation commands


ls
sudo add-apt-repository ppa:webupd8team/java
sudo apt update
sudo apt install oracle-java8-installer
javac -version
sudo apt install oracle-java8-set-default
mvn
apt-cache search maven
sudo apt-get install maven
mvn
mvn -version
sudo apt update
sudo apt install tesseract-ocr
git
ls
git clone https://github.com/ICIJ/extract
ls
cd extract/
ls


NOTE: open the pom.xml in the extract folder in a text editor and modify as shown below


<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-gpg-plugin</artifactId>
<version>1.5</version>
<executions>
<execution>
<id>sign-artifacts</id>
<phase>verify</phase>
<goals>
<goal>sign</goal>
</goals>
</execution>
</executions>
<configuration>
<skip>true</skip>
</configuration>
</plugin>
NOTE 2: go to the dir /home/userx/extract/extract-cli/ and open the pom.xml file and modify as below
, you need to add this line 
 <dependency>
       <groupId>org.slf4j</groupId>
       <artifactId>slf4j-simple</artifactId>
       <version>1.7.25</version>
   </dependency>
download the slf4j-simple-1.7.25.jar and put it in /home/user/.m2/repository/org/slf4j/slf4j-api/1.7.25/
folder

mvn install -DskipTests -Dgpg.skip
OR
mvn package -DskipTests -Dgpg.skip

echo "export JAVA_OPTS="-Xms512m -Xmx1024m"" >> ~/.bashrc
source ~/.bashrc

cd /home/userx/extract/extract-cli/

sudo apt-get install libxtst6:i386
sudo apt-get update
sudo apt-get install libxtst6
sudo updatedb
locate libXtst
sudo apt install libxext6
sudo apt-get install libxrender1 libxtst6 libxi6
java -jar extract-cli.jar

result

usage: extract [command] [options]
usage: extract help
usage: extract version

A cross-platform tool for distributed content-extraction by the data team
at the International Consortium of Investigative Journalists.

Commands

load-report
rollback
wipe-report
spew-dump
clean-report
view-report
inspect-dump
commit
load-queue
rehash
wipe-queue
delete
version
help
dump-queue
spew
copy
tag
queue
dump-report

Additional Image Formats

 jpg
 bmp
 gif
 wbmp
 png
 jpeg
 jbig2


Extract will use up to 1 GB of memory on this machine.

Please report issues at: https://github.com/ICIJ/extract/issues.

result


javac 1.8.0_191

Apache Maven 3.5.4
Maven home: /usr/share/maven
Java version: 1.8.0_191, vendor: Oracle Corporation, runtime: /usr/lib/jvm/java-8-oracle/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "4.18.0-10-generic", arch: "amd64", family: "unix"

tesseract 4.0.0-beta.3-249-g607e
leptonica-1.76.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found SSE

/usr/lib/x86_64-linux-gnu/libXtst.so.6
/usr/lib/x86_64-linux-gnu/libXtst.so.6.1.0


https://gorails.com/setup/ubuntu/18.10

ruby --version
curl -sL https://deb.nodesource.com/setup_8.x | sudo -E bash -
cd ..

curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | sudo apt-key add -
echo "deb https://dl.yarnpkg.com/debian/ stable main" | sudo tee /etc/apt/sources.list.d/yarn.list
sudo apt-get update
sudo apt-get install git-core curl zlib1g-dev build-essential libssl-dev libreadline-dev libyaml-dev libsqlite3-dev sqlite3 libxml2-dev libxslt1-dev libcurl4-openssl-dev software-properties-common libffi-dev nodejs yarn
cd
git clone https://github.com/rbenv/rbenv.git ~/.rbenv
echo 'export PATH="$HOME/.rbenv/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(rbenv init -)"' >> ~/.bashrc
exec $SHELL
git clone https://github.com/rbenv/ruby-build.git ~/.rbenv/plugins/ruby-build
echo 'export PATH="$HOME/.rbenv/plugins/ruby-build/bin:$PATH"' >> ~/.bashrc
exec $SHELL

Better README

Could you please add some instructions on how to run this?

Build failure (org.icij.kaxxa:) .. could not be resolved

Hi - not succeeding in building, after following Wiki build instructions:
After 'mvn install' on command line...
error on Mac:
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 07:00 min
[INFO] Finished at: 2017-08-13T12:55:02+02:00
[INFO] Final Memory: 21M/137M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project extract: Could not resolve dependencies for project org.icij.extract:extract:jar:2.0.0: The following artifacts could not be resolved: org.icij.kaxxa:kaxxa-concurrent:jar:1.0-SNAPSHOT, org.icij.kaxxa:kaxxa-events:jar:1.0-SNAPSHOT, org.icij.kaxxa:kaxxa-io:jar:1.0-SNAPSHOT, org.icij.kaxxa:kaxxa-sql:jar:1.0-SNAPSHOT: Could not find artifact org.icij.kaxxa:kaxxa-concurrent:jar:1.0-SNAPSHOT in jitpack.io (https://jitpack.io) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException

REPRODUCING SAME ERROR (on AWS Ubuntu 15.6.0

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 07:00 min
[INFO] Finished at: 2017-08-13T12:55:02+02:00
[INFO] Final Memory: 21M/137M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project extract: Could not resolve dependencies for project org.icij.extract:extract:jar:2.0.0: The following artifacts could not be resolved: org.icij.kaxxa:kaxxa-concurrent:jar:1.0-SNAPSHOT, org.icij.kaxxa:kaxxa-events:jar:1.0-SNAPSHOT, org.icij.kaxxa:kaxxa-io:jar:1.0-SNAPSHOT, org.icij.kaxxa:kaxxa-sql:jar:1.0-SNAPSHOT: Could not find artifact org.icij.kaxxa:kaxxa-concurrent:jar:1.0-SNAPSHOT in jitpack.io (https://jitpack.io) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException

Ignore system files within archives

Ignore files like .DS_Store inside archives. Otherwise an exception like the following is logged:

Sep 13, 2016 4:48:44 PM org.icij.extract.core.ParsingEmbeddedDocumentExtractor parseEmbedded
SEVERE: Unable to parse embedded document in document: Archive.zip.
org.apache.tika.exception.TikaException: Unsupported media type: multipart/appledouble.
        at org.icij.extract.core.ErrorParser.parse(ErrorParser.java:55)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
        at org.icij.extract.core.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:101)
        at org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:219)
        at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:182)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.icij.extract.core.ParsingReader$ParsingTask.run(ParsingReader.java:267)
        at org.icij.extract.core.TextParsingReader$ParsingTask.run(TextParsingReader.java:87)
        at java.lang.Thread.run(Thread.java:745)

File path treatment fails on Windows

I've discovered two problems relating to file path resolution (which I'd be happy to try to fix with some guidance).

  • For some reason, using --file-output-directory with spew only works with a the first folder specified in a path. So specifying the output as C:/folder actually only sends something to C:/.
  • Inputting a file path with a colon (like C:/) causes an InvalidPathException, (altering the path so that it uses a server address, e.g., //myfs/, causes problems as well)

no -d option for spew as seen in docu

docu says

extract spew -d /path/to/files -r redis -o file --file-output-directory /path/to/text

but:

export JAVA_OPTS='-Xms1024m -Xmx10240m'
extract spew -d /path/to/files -r redis -o file --file-output-directory /path/to/text

gives:

myname@extract:~/extract$ ./target/extract-capsule.x  spew -d /tmp/input -r redis -o file --file-output-directory /tmp/output
Jul 29, 2016 11:52:45 AM org.icij.extract.cli.Main main
SEVERE: Failed to parse command line arguments: Unrecognized option: -d

I guess I have to look at the working directory and pattern options instead.

Failed to execute goal org.apache.maven.plugins - method invoked with incorrect number of arguments

Trying to compile/install extract the following error is obtained:

  • method invoked with incorrect number of arguments; expected 1, found 0
$ mvn install -X -DskipTests

[ERROR] COMPILATION ERROR :
[INFO] -------------------------------------------------------------
[ERROR] /home/user/extract/src/main/java/org/icij/extract/cli/Main.java:[119,48] **method invoked with incorrect number of arguments; expected 1, found 0**
[INFO] 1 error
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 14.640 s
[INFO] Finished at: 2017-10-30T04:52:02+01:00
[INFO] Final Memory: 39M/101M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.6.0:compile (default-compile) on project extract: Compilation failure
[ERROR] /home/user/extract/src/main/java/org/icij/extract/cli/Main.java:[119,48] method invoked with incorrect number of arguments; expected 1, found 0
[ERROR] -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.6.0:compile (default-compile) on project extract: Compilation failure
/home/user/extract/src/main/java/org/icij/extract/cli/Main.java:[119,48] **method invoked with incorrect number of arguments; expected 1, found 0**

	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
	at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
	at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
	at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
	at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
	at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
	at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
	at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
	at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
	at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
	at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(java.base@9-internal/Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(java.base@9-internal/NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(java.base@9-internal/DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(java.base@9-internal/Method.java:531)
	at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
	at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
	at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
	at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: org.apache.maven.plugin.compiler.CompilationFailureException: Compilation failure
/home/user/extract/src/main/java/org/icij/extract/cli/Main.java:[119,48] method invoked with incorrect number of arguments; expected 1, found 0

	at org.apache.maven.plugin.compiler.AbstractCompilerMojo.execute(AbstractCompilerMojo.java:1029)
	at org.apache.maven.plugin.compiler.CompilerMojo.execute(CompilerMojo.java:137)
	at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
	... 20 more
[ERROR]
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

Upgrade Java to version 11

Is your feature request related to a problem? Please describe.
We are currently using Java 8, which is largely deprecated. To benefit new versions of dependencies, we would like to upgrade to Java 11.

spew spews exceptions for subdirectories

Maybe that is intended, but.:

 ./target/extract-capsule.x  spew  -o stdout  /usr/share >/dev/null

Will give lots of errors like.:

Jul 29, 2016 12:13:33 PM org.icij.extract.core.Consumer extract
SEVERE: The document stream could not be read: /usr/share/doc/console-setup.
org.apache.tika.io.TaggedIOException: Is a directory

Furthermore - i find the period after the SEVERE line annoying - it is hard to know if it is part of the name reported or not.

solr authentication options?

The documentation shows no way to log into a solr server secured with user/password authentication.
Can it be done with a command line option?

Build Failure: maven-gpg-plugin

Fails to build maven-gpg-plugin with following error even if -Dpgp.skip-true is set.

$ mvn install -Dpgp.skip-true
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.google.inject.internal.cglib.core.$ReflectUtils$1 (file:/usr/share/maven/lib/guice.jar) to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain)
WARNING: Please consider reporting this to the maintainers of com.google.inject.internal.cglib.core.$ReflectUtils$1
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for org.icij.extract:extract-lib:jar:3.6.1
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-surefire-plugin is missing. @ org.icij.extract:extract:3.6.1, /home/arky/Code/Tika/extract/pom.xml, line 106, column 21
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for org.icij.extract:extract:pom:3.6.1
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-surefire-plugin is missing. @ line 106, column 21
[WARNING] 
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING] 
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING] 
[INFO] Inspecting build with total of 3 modules...
[INFO] Installing Nexus Staging features:
[INFO]   ... total of 3 executions of maven-deploy-plugin replaced with nexus-staging-maven-plugin
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO] 
[INFO] ICIJ Extract                                                       [pom]
[INFO] extract-lib                                                        [jar]
[INFO] extract-cli                                                        [jar]
[INFO] 
[INFO] ----------------------< org.icij.extract:extract >----------------------
[INFO] Building ICIJ Extract 3.6.1                                        [1/3]
[INFO] --------------------------------[ pom ]---------------------------------
[INFO] 
[INFO] --- maven-gpg-plugin:1.5:sign (sign-artifacts) @ extract ---
gpg: no default secret key: No secret key
gpg: signing failed: No secret key
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for ICIJ Extract 3.6.1:
[INFO] 
[INFO] ICIJ Extract ....................................... FAILURE [  0.132 s]
[INFO] extract-lib ........................................ SKIPPED
[INFO] extract-cli ........................................ SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  0.601 s
[INFO] Finished at: 2021-03-29T23:18:19+07:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-gpg-plugin:1.5:sign (sign-artifacts) on project extract: Exit code: 2 -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.