Code Monkey home page Code Monkey logo

spark-neighbors's Introduction

spark-neighbors GitHub version Build Status

Spark-based approximate nearest neighbors (ANN) using locality-sensitive hashing (LSH)

Motivation

Spark's MLlib doesn't yet support locality-sensitive hashing or nearest neighbor search. While there are LSH Spark packages available for a variety of distance measures, it has been open question how to support multiple distance measures and hashing schemes behind a unified interface. This library presents one possible way to do that.

Features

  • Computation of nearest neighbors for each point in a dataset
  • Computation of query point nearest neighbors against a dataset
  • Support for five distance measures:
    • Hamming distance via bit sampling LSH
    • Cosine distance via sign-random-projection LSH
    • Euclidean and Manhattan distances via scalar-random-projection LSH
    • Jaccard distance via Minhash LSH

Linking

You can link against this library (for Spark 1.6+) in your program at the following coordinates:

Using SBT:

libraryDependencies += "com.github.karlhigley" %% "spark-neighbors" % "0.2.2"

Using Maven:

<dependency>
    <groupId>com.github.karlhigley</groupId>
    <artifactId>spark-neighbors_2.10</artifactId>
    <version>0.2.2</version>
</dependency>

This library can also be added to Spark jobs launched through spark-shell or spark-submit by using the --packages command line option. For example, to include it when starting the spark shell:

$ bin/spark-shell --packages com.github.karlhigley:spark-neighbors_2.10:0.2.2

Unlike using --jars, using --packages ensures that this library and its dependencies will be added to the classpath. The --packages argument can also be used with bin/spark-submit.

Usage

ANN models are created using the builder pattern:

import com.github.karlhigley.spark.neighbors.ANN

val annModel =
  new ANN(dimensions = 1000, measure = "hamming")
    .setTables(4)
    .setSignatureLength(64)
    .train(points)

An ANN model can compute a variable number of approximate nearest neighbors:

val neighbors = model.neighbors(10)

Distance Measures and Parameters

The supported distance measures are "hamming", "cosine", "euclidean", "manhattan", and "jaccard". All distance measures allow the number of hash tables and the length of the computed hash signatures to be configured as above. The hashing schemes for Euclidean, Manhattan, and Jaccard distances have some additional configurable parameters:

Euclidean and Manhattan Distances

This hash function depends on a bucket or quantization width. Higher widths lead to signatures that are more similar:

val annModel =
  new ANN(dimensions = 1000, measure = "euclidean")
    .setTables(4)
    .setSignatureLength(64)
    .setBucketWidth(5)
    .train(points)
Jaccard Distance

Minhashing requires two additional parameters: a prime larger than the number of input dimensions and the number of minhash bands. The prime is used in the permutation functions that generate minhash signatures, and the number of bands is used in the process of generating candidate pairs from the signatures.

val annModel =
  new ANN(dimensions = 1000, measure = "jaccard")
    .setTables(4)
    .setSignatureLength(128)
    .setPrimeModulus(739)
    .setBands(16)
    .train(points)

References

Sign random projection:

Scalar random projection:

Minhash:

Survey papers:

Performance evaluation and tuning:

spark-neighbors's People

Contributors

jcynico avatar karlhigley avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-neighbors's Issues

Help needed in using the library

Thank you for the very helpful library. I need some help with setting the parameters for the case of Euclidean distance. I have a set of very close 2-dimensional coordinates. However, the nearest neighbors do not get identified at all unless the distance is 0. I am guessing I need to probably better set the parameters. Could you please help me with this ?

61,139
63,140
64,129
68,128
71,140
73,141
75,128

Failed "compute manhattan nearest neighbors as a batch" test

"compute manhattan nearest neighbors as a batch" test failed in Travis CI.

[info] - compute manhattan nearest neighbors as a batch *** FAILED ***
[info] java.lang.AssertionError: assertion failed
[info] at scala.Predef$.assert(Predef.scala:165)
[info] at com.github.karlhigley.spark.neighbors.ANNSuite$.runAssertions(ANNSuite.scala:152)
[info] at com.github.karlhigley.spark.neighbors.ANNSuite$$anonfun$4.apply$mcV$sp(ANNSuite.scala:84)
[info] at com.github.karlhigley.spark.neighbors.ANNSuite$$anonfun$4.apply(ANNSuite.scala:71)
[info] at com.github.karlhigley.spark.neighbors.ANNSuite$$anonfun$4.apply(ANNSuite.scala:71)
[info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
[info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info] at org.scalatest.Transformer.apply(Transformer.scala:22)
[info] at org.scalatest.Transformer.apply(Transformer.scala:20)

For detail: https://travis-ci.org/karlhigley/spark-neighbors/builds/139490517

error occurs when importing sbt project

The following is the error message. Can someone help me?

Error:Error while importing SBT project:
...

[info] Resolving org.scala-sbt#tracking;0.13.15 ...
[info] Resolving com.thoughtworks.paranamer#paranamer;2.6 ...
[info] Resolving org.scala-sbt#completion;0.13.15 ...
[info] Resolving org.scala-sbt#control;0.13.15 ...
[info] Resolving org.scala-sbt#sbt;0.13.15 ...
[info] Resolving org.scala-sbt#run;0.13.15 ...
[info] Resolving org.scala-sbt.ivy#ivy;2.3.0-sbt-48dd0744422128446aee9ac31aa356ee203cc9f4 ...
[info] Resolving org.scala-sbt#test-interface;1.0 ...
[info] Resolving com.jcraft#jsch;0.1.50 ...
[info] Resolving org.scala-lang#scala-compiler;2.10.6 ...
[info] Resolving jline#jline;2.14.3 ...
[info] Resolving org.scala-sbt#compiler-ivy-integration;0.13.15 ...
[info] Resolving org.scala-sbt#incremental-compiler;0.13.15 ...
[info] Resolving org.scala-sbt#logic;0.13.15 ...
[info] Resolving org.scala-sbt#main-settings;0.13.15 ...
[trace] Stack trace suppressed: run 'last :update' for the full output.
[trace] Stack trace suppressed: run 'last :ssExtractDependencies' for the full output.
[error] (
:update) sbt.ResolveException: unresolved dependency: com.github.karlhigley#spark-neighbors_2.11;0.2.2: not found
[error] (
:ssExtractDependencies) sbt.ResolveException: unresolved dependency: com.github.karlhigley#spark-neighbors_2.11;0.2.2: not found
[error] Total time: 31 s, completed Jul 13, 2017 6:19:16 PM

ANNModel.neighbors() function returns null results

List<Tuple2<Object, SparseVector>> svList = new ArrayList<>();
        svList.add(new Tuple2<Object, SparseVector>(3L,
                (Vectors.sparse(20, new int[] { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 },
                        new double[] { 5.0f, 3.0f, 4.0f, 5.0f, 5.0f, 1.0f, 5.0f, 3.0f, 4.0f, 5.0f, 5.0f, 3.0f, 4.0f,
                                5.0f, 5.0f, 1.0f, 5.0f, 3.0f, 4.0f, 5.0f })
                        .toSparse())));
        svList.add(new Tuple2<Object, SparseVector>(4L,
                (Vectors.sparse(20, new int[] { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 },
                        new double[] { 1.0f, 2.0f, 1.0f, 5.0f, 1.0f, 5.0f, 1.0f, 4.0f, 1.0f, 3.0, 1.0f, 2.0f, 1.0f,
                                5.0f, 1.0f, 5.0f, 1.0f, 4.0f, 1.0f, 3.0f })
                        .toSparse())));
        svList.add(new Tuple2<Object, SparseVector>(6L,
                (Vectors.sparse(20, new int[] { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 },
                        new double[] { 5.0f, 3.0f, 4.0f, 1.0f, 5.0f, 4.0f, 1.0f, 3.0f, 4.0f, 5.0f, 5.0f, 3.0f, 4.0f,
                                1.0f, 5.0f, 4.0f, 1.0f, 3.0f, 4.0f, 5.0f })
                        .toSparse())));

RDD<Tuple2<Object, SparseVector>> points = sc.parallelize(svList).rdd();

ANNModel annModel =
                new ANN(20, "cosine")
                .setTables(1)
                .setSignatureLength(20).setRandomSeed(3)
                .train(points,StorageLevel.MEMORY_AND_DISK());

JavaRDD<Tuple2<Object, Tuple2<Object, Object>[]>> neighbors = annModel.neighbors(3).toJavaRDD();
neighbors.foreach(f->{
			for(int i=0;i<f._2.length;i++){
				System.out.println(f._1+"====="+f._2[i]._1+"---"+f._2[i]._2);
			}
		});

The Output of the code is :
3=====null---null
3=====null---null
4=====null---null
4=====null---null
6=====null---null
6=====null---null

Spark 2.0/Scala 2.11 compatibility

We are upgrading to spark 2.0 and require scala 2.11 compatibility for our spark code; I have been using some of the spark-neighbors functionality which appears to only have scala 2.10 compatibility. Any chance someone might be working on upgrading this?

build error for scala 2.11

change the scala verison to 2.11
and sbt build error:
[info] Compiling 15 Scala sources to /data0/jd_ad/lisai/keyword/lsh/spark-neighbors-master/target/scala-2.11/classes...
[error] /data0/jd_ad/lisai/keyword/lsh/spark-neighbors-master/src/main/scala/com/github/karlhigley/spark/neighbors/ANNModel.scala:153: type arguments [Any] do not conform to class HashTableEntry's type parameter bounds [+S <: com.github.karlhigley.spark.neighbors.lsh.Signature[_]]
[error] indHashFunctions.map {
[error] ^
[error] /data0/jd_ad/lisai/keyword/lsh/spark-neighbors-master/src/main/scala/com/github/karlhigley/spark/neighbors/ANNModel.scala:153: type arguments [Any] do not conform to class HashTableEntry's type parameter bounds [+S <: com.github.karlhigley.spark.neighbors.lsh.Signature[_]]
[error] indHashFunctions.map {
[error] ^
[error] two errors found
[error] (compile:compileIncremental) Compilation failed

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.