Code Monkey home page Code Monkey logo

hadoop-pot's Introduction

Hadoop implementation of the Pooled Time Series (PoT) algorithm

PoT java implementation using Apache Hadoop.

Dependencies

  • Maven (Version shouldn't matter much. Tested with 2.x and 3.x.)
  • OpenCV 2.4.x (Tested with 2.4.9 and 2.4.11)

Pre-requisites

If you get any errors running brew install opencv related to numpy, please run:

  1. pip install numpy

Now move on to OpenCV (More detailed instructions in wiki/Installing-opencv)

  1. brew install opencv --with-java

The above should leave you with a:

/usr/local/Cellar/opencv/<VERSION>/share/OpenCV/java

Directory which contains the associated dylib OpenCV dynamic library along with the OpenCV jar file.

Getting started

  1. cd hadoop-pot-assembly
  2. mvn install assembly:assembly
  3. Set OPENCV_JAVA_HOME, e.g., to export OPENCV_JAVA_HOME=/usr/local/Cellar/opencv/2.4.9/share/OpenCV/java
  4. Set POOLED_TIME_SERIES_HOME, e.g., to export POOLED_TIME_SERIES_HOME=$HOME/hadoop-pot/src/main
  5. Run pooled-time-series, e.g., by creating an alias, alias pooled-time-series="$POOLED_TIME_SERIES_HOME/bin/pooled-time-series"

The above should produce:

usage: pooled_time_series
 -d,--dir <directory>            A directory with image files in it
 -f,--file <file>                Path to a single file
 -h,--help                       Print this message.
 -j,--json                       Set similarity output format to JSON.
                                 Defaults to .txt
 -o,--outputfile <output file>   File containing similarity results.
                                 Defaults to ./similarity.txt
 -p,--pathfile <path file>       A file containing full absolute paths to
                                 videos. Previous default was
                                 memex-index_temp.txt

So, to call the code e.g., on a directory of files called data, you would run (e.g., with OpenCV 2.4.9):

pooled-times-series -d data

Alternatively you can create (independently of this tool) a file with absolute file paths to video files, 1 per line, and then pass it with the -p file to the above program.

Running Hadoop Jobs

Config and Getting Started

Add the following to your .bashrc

export HADOOP_OPTS="-Djava.library.path=<path to OpenCV jar> -Dmapred.map.child.java.opts=-Djava.library.path=<path to OpenCV jar>"
alias pooled-time-series-hadoop="$POOLED_TIME_SERIES_HOME/bin/pooled-time-series-hadoop"

Build and clean up the jar for running

# Compile everything
mvn install assembly:assembly

# Drop the LICENSE file from our jar that will give us headaches otherwise
zip -d target/pooled-time-series-1.0-SNAPSHOT-jar-with-dependencies.jar META-INF/LICENSE

Documentation moving to the wiki

We are moving our documentation to the wiki. Please bear with us and report issues as you find them.

Research Background and Detail

This is a source code used in the following conference paper [1]. It includes the pooled time series (PoT) representation framework as well as basic per-frame descriptor extractions including histogram of optical flows (HOF) and histogram of oriented gradients (HOG). For more detailed information on the approach, please check the paper.

If you take advantage of this code for any academic purpose, please do cite:

[1] Mattmann, Chris A., and Madhav Sharan. "Scalable Hadoop-Based Pooled Time Series of Big Video Data from the Deep Web." Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval. ACM, 2017.
[2] M. S. Ryoo, B. Rothrock, and L. Matthies, "Pooled Motion Features for First-Person Videos", IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.

https://arxiv.org/abs/1610.06669
http://arxiv.org/pdf/1412.6505v2.pdf

@inproceedings{mattmann2017scalable,
  title={Scalable Hadoop-Based Pooled Time Series of Big Video Data from the Deep Web},
  author={Mattmann, Chris A and Sharan, Madhav},
  booktitle={Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval},
  pages={117--120},
  year={2017},
  organization={ACM}
}

@inproceedings{ryoo2015pot,
 title={Pooled Motion Features for First-Person Videos},
 author={M. S. Ryoo and B. Rothrock and L. Matthies},
 booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
 year={2015},
 month={June},
 address={Boston, MA},
}

Evaluation

HMDB Dataset - http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/

hadoop-pot's People

Contributors

adhulipa avatar chrismattmann avatar mjjoyce avatar mryoo avatar smadha avatar soumyaravi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

hadoop-pot's Issues

Unable to run SimilarityCalculation on Hadoop mode

Following the instructions in the README file to run in hadoop mode
Here :
It says:

# Run the meanChiSquaredDistance job
hadoop jar target/pooled-time-series-1.0-SNAPSHOT-jar-with-dependencies.jar \
    org.pooledtimeseries.SimilarityCalculation SimilarityInput/ MeanChiOutput/

However, the code expects more than 2 arguments. Check this line

The error I am getting :

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2
    at org.pooledtimeseries.SimilarityCalculation.main(SimilarityCalculation.java:115)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

Please fix or update the README

Enhanced architecture for similarity jobs

HDFS is made to work with large files. In current architecture we produce 2 files for each video size ~ 1.5-2 MB. So for a set of 1000 videos we produce 2000 files of ~ 3-4 GB. With current architecture we pass video file path to map job and each job read 4 files per map.

We need to think more and if possible come up with architecture which receives file content in mapper instead of file paths.

Below is a impl of cartesian product pattern which is not exactly the same but could be useful -
https://github.com/adamjshook/mapreducepatterns/blob/master/MRDP/src/main/java/mrdp/ch5/CartesianProduct.java

Remove unreadable videos

some videos are not readable by OpenCV and app is unable to generate vectors for those videos.

These videos need to be removed from input videos and video pair inputs for further jobs

Documentation is out of sync

@smadha Can you describe the parameters to the SimilarityCalculation job

This is said in ReadMe, but not very clear:

# Run the similarity job (using the value calculated in the previous job)
hadoop jar target/pooled-time-series-1.0-SNAPSHOT-jar-with-dependencies.jar  \
org.pooledtimeseries.SimilarityCalculation SimilarityInput/ SimilarityOutput/ ./MeanChiOutput/meanChiSquaredDistances.txt

Split /MeanChiSquareAndSimilarityInput to ulize multiple containers

Input to MeanChiSquare and Similarity calcs are file paths and mapper reads from these paths. This hides the actual input as it lies in file not file paths.
hadoop split input in default size which might not be enough to utilize all containers.

We can increase input split size or split input file before processing to utilize more containers.

Cache combined FeatureVector along with "of" and "hog" series

First set of MR jobs generate optical flow and optical gradient vectors ( of.txt and hog.txt ) by reading videos.

These vectors are then used to compute FeatureVector after temporal pooling. During similarity calculations (mean chi square and similarity calc) we compute FeatureVector repeatedly for each video. [0]

We can cache this FeatureVector along with .of.txt and .hog.txt

[0]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.