Code Monkey home page Code Monkey logo

pst-extraction's Introduction

PST Extraction

place pst files in pst-extract/pst/

  1. bin/explode_psts.sh - runs readpst to convert pst to mbox
  2. bin/normalize_mbox.sh - mbox files to json
  3. bin/run_spark_tika.sh - tika extract text of attachments
  4. bin/run_tika_content_join.sh - join attachment text with email json
  5. bin/run_spark_content_split.sh - removes base64 encoded attachment from emails json and puts the json in to a separate directory
  6. bin/run_spark_emailaddr.sh - email address extraction and community assignment
  7. bin/run_spark_email_community_assign.sh - assign communities to email json objects
  8. bin/run_spark_topic_clustering.sh - assign topic clustering to email json objects output by community assign
  9. bin/run_spark_mitie.sh - Run MITIE to generate entities for email and add to email json generated by topic clustering
  10. bin/run_spark_es_ingest_emailaddr.sh - ingest emailaddrs to ES index
  11. bin/run_spark_es_ingest_attachments.sh - ingest attachments to ES index
  12. bin/run_spark_es_ingest_emails.sh - ingest emails with entities to ES index

Extras

** Location Extraction **

Locations extracted from text

  1. bin/build_clavin_index.sh setup location index (only needs to be run once)
  2. bin/run_location_extract.sh extracts locations from text body uses input from bin/run_spark_content_split task

Locations extracted by IP

  1. bin/setup_geo2ip.sh setup geoip index
  2. bin/run_spark_originating_location.sh extracts location from ip address



Workflow



This product includes GeoLite2 data created by MaxMind, available from http://www.maxmind.com.

pst-extraction's People

Contributors

eickovic avatar jakobzlee avatar raparkhurst avatar scotthaleen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pst-extraction's Issues

step 3 complaining | ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException

altered this step 3 to following command:
spark-submit --master local[*] --driver-memory 2g --jars lib/tika-app-1.10.jar,lib/commons-codec-1.10.jar --conf spark.storage.memoryFraction=1 --class newman.Driver lib/tika-extract_2.10-1.0.1.jar pst-json/ spark-attach/ etc/exts.txt

failing on:
/pst-extract$ spark-submit --master local[*] --driver-memory 2g --jars lib/tika-app-1.10.jar,lib/commons-codec-1.10.jar --conf spark.storage.memoryFraction=1 --class newman.Driver lib/tika-extract_2.10-1.0.1.jar pst-json/ spark-attach/ etc/exts.txt
INFO Running Spark version 1.5.0
WARN Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
WARN
SPARK_WORKER_INSTANCES was detected (set to '4').
This is deprecated in Spark 1.0+.

Please instead use:

  • ./spark-submit with --num-executors to specify the number of executors
  • Or set SPARK_EXECUTOR_INSTANCES
  • spark.executor.instances to configure the number of instances in the spark config.

WARN Your hostname, precise32 resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface eth0)
WARN Set SPARK_LOCAL_IP if you need to bind to another address
INFO Changing view acls to: vagrant
INFO Changing modify acls to: vagrant
INFO SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(vagrant); users with modify permissions: Set(vagrant)
INFO Slf4jLogger started
INFO Starting remoting
INFO Remoting started; listening on addresses :[akka.tcp://[email protected]:54231]
INFO Successfully started service 'sparkDriver' on port 54231.
INFO Registering MapOutputTracker
INFO Registering BlockManagerMaster
INFO Created local directory at /tmp/blockmgr-06245dd6-1764-4ac2-a818-f83c04546e51
INFO MemoryStore started with capacity 1781.8 MB
INFO HTTP File server directory is /tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/httpd-11910909-50ad-4114-9d40-6a1688a10d72
INFO Starting HTTP Server
INFO jetty-8.y.z-SNAPSHOT
INFO Started [email protected]:46457
INFO Successfully started service 'HTTP file server' on port 46457.
INFO Registering OutputCommitCoordinator
INFO jetty-8.y.z-SNAPSHOT
INFO Started [email protected]:4040
INFO Successfully started service 'SparkUI' on port 4040.
INFO Started SparkUI at http://10.0.2.15:4040
INFO Added JAR file:/pst-extract/lib/tika-app-1.10.jar at http://10.0.2.15:46457/jars/tika-app-1.10.jar with timestamp 1448671847626
INFO Added JAR file:/pst-extract/lib/commons-codec-1.10.jar at http://10.0.2.15:46457/jars/commons-codec-1.10.jar with timestamp 1448671847650
INFO Added JAR file:/pst-extract/lib/tika-extract_2.10-1.0.1.jar at http://10.0.2.15:46457/jars/tika-extract_2.10-1.0.1.jar with timestamp 1448671847656
WARN Using default name DAGScheduler for source because spark.app.id is not set.
INFO Starting executor ID driver on host localhost
INFO Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 57497.
INFO Server created on 57497
INFO Trying to register BlockManager
INFO Registering block manager localhost:57497 with 1781.8 MB RAM, BlockManagerId(driver, localhost, 57497)
INFO Registered BlockManager
Extension filter: List(doc, docx, txt, pdf, xls, xlsx, rtf, xml, html, htm, ppt, pptx)
WARN Failed to check whether UseCompressedOops is set; assuming yes
INFO ensureFreeSpace(123856) called with curMem=0, maxMem=1868326502
INFO Block broadcast_0 stored as values in memory (estimated size 121.0 KB, free 1781.7 MB)
INFO ensureFreeSpace(11436) called with curMem=123856, maxMem=1868326502
INFO Block broadcast_0_piece0 stored as bytes in memory (estimated size 11.2 KB, free 1781.6 MB)
INFO Added broadcast_0_piece0 in memory on localhost:57497 (size: 11.2 KB, free: 1781.8 MB)
INFO Created broadcast 0 from textFile at Driver.scala:133
INFO mapred.tip.id is deprecated. Instead, use mapreduce.task.id
INFO mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
INFO mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
INFO mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
INFO mapred.job.id is deprecated. Instead, use mapreduce.job.id
INFO Total input paths to process : 10
INFO Starting job: saveAsTextFile at Driver.scala:134
INFO Got job 0 (saveAsTextFile at Driver.scala:134) with 43 output partitions
INFO Final stage: ResultStage 0(saveAsTextFile at Driver.scala:134)
INFO Parents of final stage: List()
INFO Missing parents: List()
INFO Submitting ResultStage 0 (MapPartitionsRDD[3] at saveAsTextFile at Driver.scala:134), which has no missing parents
INFO ensureFreeSpace(104024) called with curMem=135292, maxMem=1868326502
INFO Block broadcast_1 stored as values in memory (estimated size 101.6 KB, free 1781.5 MB)
INFO ensureFreeSpace(34556) called with curMem=239316, maxMem=1868326502
INFO Block broadcast_1_piece0 stored as bytes in memory (estimated size 33.7 KB, free 1781.5 MB)
INFO Added broadcast_1_piece0 in memory on localhost:57497 (size: 33.7 KB, free: 1781.7 MB)
INFO Created broadcast 1 from broadcast at DAGScheduler.scala:861
INFO Submitting 43 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at saveAsTextFile at Driver.scala:134)
INFO Adding task set 0.0 with 43 tasks
INFO Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 2334 bytes)
INFO Running task 0.0 in stage 0.0 (TID 0)
INFO Fetching http://10.0.2.15:46457/jars/tika-extract_2.10-1.0.1.jar with timestamp 1448671847656
INFO Fetching http://10.0.2.15:46457/jars/tika-extract_2.10-1.0.1.jar to /tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/userFiles-2b6f5374-cc8d-4464-9916-e8aaccf906dd/fetchFileTemp6999556033463677797.tmp
INFO Adding file:/tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/userFiles-2b6f5374-cc8d-4464-9916-e8aaccf906dd/tika-extract_2.10-1.0.1.jar to class loader
INFO Fetching http://10.0.2.15:46457/jars/commons-codec-1.10.jar with timestamp 1448671847650
INFO Fetching http://10.0.2.15:46457/jars/commons-codec-1.10.jar to /tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/userFiles-2b6f5374-cc8d-4464-9916-e8aaccf906dd/fetchFileTemp5562645691215034148.tmp
INFO Adding file:/tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/userFiles-2b6f5374-cc8d-4464-9916-e8aaccf906dd/commons-codec-1.10.jar to class loader
INFO Fetching http://10.0.2.15:46457/jars/tika-app-1.10.jar with timestamp 1448671847626
INFO Fetching http://10.0.2.15:46457/jars/tika-app-1.10.jar to /tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/userFiles-2b6f5374-cc8d-4464-9916-e8aaccf906dd/fetchFileTemp4410224147291656134.tmp
INFO Adding file:/tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/userFiles-2b6f5374-cc8d-4464-9916-e8aaccf906dd/tika-app-1.10.jar to class loader
INFO Input split: file:/pst-extract/pst-json/output_part_000003:0+146275050
INFO Saved output of task 'attempt_201511280050_0000_m_000000_0' to file:/pst-extract/spark-attach/_temporary/0/task_201511280050_0000_m_000000
INFO attempt_201511280050_0000_m_000000_0: Committed
INFO Finished task 0.0 in stage 0.0 (TID 0). 2044 bytes result sent to driver
INFO Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 2334 bytes)
INFO Running task 1.0 in stage 0.0 (TID 1)
INFO Finished task 0.0 in stage 0.0 (TID 0) in 1955996 ms on localhost (1/43)
INFO Input split: file:/pst-extract/pst-json/output_part_000004:0+134217728
INFO Document is encrypted
INFO Saved output of task 'attempt_201511280050_0000_m_000001_1' to file:/pst-extract/spark-attach/_temporary/0/task_201511280050_0000_m_000001
INFO attempt_201511280050_0000_m_000001_1: Committed
INFO Finished task 1.0 in stage 0.0 (TID 1). 2044 bytes result sent to driver
INFO Starting task 2.0 in stage 0.0 (TID 2, localhost, PROCESS_LOCAL, 2334 bytes)
INFO Running task 2.0 in stage 0.0 (TID 2)
INFO Finished task 1.0 in stage 0.0 (TID 1) in 2028530 ms on localhost (2/43)
INFO Input split: file:/pst-extract/pst-json/output_part_000004:134217728+134217728
INFO Saved output of task 'attempt_201511280050_0000_m_000002_2' to file:/pst-extract/spark-attach/_temporary/0/task_201511280050_0000_m_000002
INFO attempt_201511280050_0000_m_000002_2: Committed
INFO Finished task 2.0 in stage 0.0 (TID 2). 2044 bytes result sent to driver
INFO Starting task 3.0 in stage 0.0 (TID 3, localhost, PROCESS_LOCAL, 2334 bytes)
INFO Running task 3.0 in stage 0.0 (TID 3)
INFO Finished task 2.0 in stage 0.0 (TID 2) in 2334175 ms on localhost (3/43)
INFO Input split: file:/pst-extract/pst-json/output_part_000004:268435456+134217728
ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException
INFO Document is encrypted
INFO Saved output of task 'attempt_201511280050_0000_m_000003_3' to file:/pst-extract/spark-attach/_temporary/0/task_201511280050_0000_m_000003
INFO attempt_201511280050_0000_m_000003_3: Committed
INFO Finished task 3.0 in stage 0.0 (TID 3). 2044 bytes result sent to driver
INFO Starting task 4.0 in stage 0.0 (TID 4, localhost, PROCESS_LOCAL, 2334 bytes)
INFO Running task 4.0 in stage 0.0 (TID 4)
INFO Finished task 3.0 in stage 0.0 (TID 3) in 2291524 ms on localhost (4/43)
INFO Input split: file:/pst-extract/pst-json/output_part_000004:402653184+134217728
INFO Saved output of task 'attempt_201511280050_0000_m_000004_4' to file:/pst-extract/spark-attach/_temporary/0/task_201511280050_0000_m_000004
INFO attempt_201511280050_0000_m_000004_4: Committed
INFO Finished task 4.0 in stage 0.0 (TID 4). 2044 bytes result sent to driver
INFO Starting task 5.0 in stage 0.0 (TID 5, localhost, PROCESS_LOCAL, 2334 bytes)
INFO Finished task 4.0 in stage 0.0 (TID 4) in 2228523 ms on localhost (5/43)
INFO Running task 5.0 in stage 0.0 (TID 5)
INFO Input split: file:/pst-extract/pst-json/output_part_000004:536870912+106907138
ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException..
.. so what is correct then ? instead of original recipe for step 3 ?

dependencies

Are there any dependencies that should be installed before running these scripts. I am getting errors at multiple locations

Original MIME type not stored / some attachments are missing an extension

Some images are delivered with filenames like "picture" with no extension. The MIME type is available in the original eml / mbox but not stored in ES. This causes misleading results with aggregation and prevents to UI from elegantly loading them correctly.

Missing extensions occur on other types not just images.
File types aggregation for emails should also back to MIME rather than an extension.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.