saurfang / spark-knn Goto Github PK

k-Nearest Neighbors algorithm on Spark

License: Apache License 2.0

Scala 87.68% R 0.83% Shell 3.54% Python 7.95%

spark knn

spark-knn's Issues

PySpark interface

Do you think it is would be easy to add a Python interface for this module?
I might try to do this for my research but I am not sure yet what is the best way to support Python. How would the code be sent to Spark? My best idea is that it would require to add a Python egg with a wrapper of this module and a jar with the Scala source to the spark-submit command (since Python users usually don't use SBT).

KNNModel : Return distance with neighbors option in transform()

I've noticed that the dataset return by knnModel.transform(dataset) only contains the neighbors without their distance (which is actually available in the tree and thus really easy to fetch). Would it be possible to add an option so that the distance might be returned if asked ?
I'm actually implementing recommendation based on KNN, and having the possibility to weigh each recommendation based on it's distance would be really interesting.

PS: a compatibility with spark 2.0 would also be really appreciated

Support custom distance function

provide custom distance function instead of using fixed Euclidean distance, e.g.

def distance[T](point1:T , point2:T):Double

KNNClassifier : object not found

After loading the sample_libsvm_data.txt,
val knn=new KNNClassifier() gives an error saying KNNClassifer is not found.
How to use that line? Directly in command prompt ?
And where to save the spark-knn folder..?

Sparse Vectors?

It appears that spark-knn needs to transform dense vectors into their sparse form. This creates a limitation when using spark-knn for very wide, sparse datasets such as document-term matrices used in NLP.

Is there any interest in supporting sparse vectors within spark-knn?

How to get the exact nearest neighbors?

Hi,
The code commets said that we could get the exact nearest neighbors by tuning parameters. But I haven't figured out how to do it.
I tried to set tau to 0.0, but didn't get the exact result. I think the reason is that KNN.searchIndices of the topTree doesn't always return all the possible partition ids which contains nearest neighbors.
Of course, I could rewrite the KNN.searchIndices to make sure all possible partition ids are included. But I still want to know if there is a way to get exact nearest neighbors by just tuning parameters in the existing code.

Thanks!

KNN set tree size and sub tree size and leave size

while fitting training data, on what parameter does top tree size, leave size and sub tree leave size depends?

Model save support

I saved a model(KNNClassificationModel) using java serialization and when I use it later, I always get java.lang.IllegalArgumentException: Flat hash tables cannot contain null elements.
on the dataframe output of the model.transform(inputDataFrame).

Is there a better way of saving and using model? like support for MLWritable/Saveable traits. In our use case, we create a model and use it later

value transform in class KNNRegression cannot be accessed in org.apache.spark.ml.regression.KNNRegression

Hi:

Here is my code:

package com.jd.anti.tool.ad

import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.ml.classification.KNNClassifier
import org.apache.spark.ml.regression.KNNRegression

object KNNForTest {
  def main(argv: Array[String]) = {
    val conf = new SparkConf()
      .setMaster("local[2]")
      .setAppName("com.ad.KNNForTest")
    val sc = new SparkContext(conf)
    //read in raw label and features
    val hiveContext = new HiveContext(sc)
    import hiveContext.implicits._
    val training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()

    val knn = new KNNRegression()
      .setTopTreeSize(training.count().toInt / 500)
      .setK(10)

    val knnModel = knn.fit(training)

    val predicted = knn.transform(training)
  }
}

And I add

resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"

libraryDependencies += "saurfang" % "spark-knn" % "0.2.0"

in my sbt build file.

What should I do to deal with this problem?

Thanks!

Got "Caused by: java.lang.ArrayStoreException" when running KNN

Hi,
After running KNN example to get nearest neighbor, and in this line
https://github.com/saurfang/spark-knn/blob/v0.3.0/spark-knn-core/src/test/scala/org/apache/spark/ml/knn/KNNSuite.scala#L40, I got "Caused by: java.lang.ArrayStoreException".

What should I do?

Thanks.

Problem with MNIST.scala example

Facing issue while running spark-knn-examples/src/main/scalacom/github/saurfang/spark/ml/knn/examples/MNIST.scala
I had ran the commands in spark-shell.
enivronment: spark: 2.1.0
scala: 2.11.8

scala> val pipeline = new Pipeline().setStages(Array(knn_KA)).fit(train)
java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:51)
at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:58)
at org.apache.spark.ml.classification.ClassifierParams$class.validateAndTransformSchema(Classifier.scala:42)
at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:53)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams$class.validateAndTransformSchema(ProbabilisticClassifier.scala:37)
at org.apache.spark.ml.classification.ProbabilisticClassifier.validateAndTransformSchema(ProbabilisticClassifier.scala:53)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:122)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:184)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:184)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:184)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:136)
... 52 elided

On looking at the dataset schema

scala> dataset.schema
res14: org.apache.spark.sql.types.StructType = StructType(StructField(label,DoubleType,true), StructField(features,org.apache.spark.mllib.linalg.VectorUDT@f71b0bce,true))

it shows that features are of type org.apache.spark.mllib.linalg.VectorUDT
Could this be the reason for the error?
Is it only me who is getting error while running the example?

Note: data/mnist/mnist.bz2 has no content. Hence I took mnist.bz2 from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html

Searching K nearest neighbors for each test sample in training set

Is there a way to extract the K nearest neighbors from Training samples from the KNN model in the Scala version?

issue with running using spark-sumbit

Can anybody help me with this?

I used the python related part on README.md :

py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.ml.classification.KNNClassifier.
: java.lang.NoSuchMethodError: org.apache.spark.ml.param.shared.HasInputCols.$init$(Lorg/apache/spark/ml/param/shared/HasInputCols;)V
	at org.apache.spark.ml.classification.KNNClassifier.<init>(KNNClassifier.scala:23)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:238)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

Algorithm fails on small data

As part our CICD pipeline, we have a daily build that runs on relatively small amounts of data. As part of this, we discovered an interesting bug; as part of the method estimateTau, there is the following line:

val y = DenseVector(estimators.map { case (_, d) => math.log(d) })

In this case, d is the average distance between points. We are finding that on the small data used in our daily build, beta can exceed 0. When this happens, yMax, which is defined as:

val yMax = breeze.linalg.max(y)

is below negative one, and subsequently used as the bufferSize.

Specifically, the following appears in the log:

ERROR KNN: Unable to estimate Tau with positive beta: 0.1577160047542901. This maybe because data is too small.
Setting to -1.3153582722102333 which is the maximum average distance we found in the sample.
This may leads to poor accuracy. Consider manually set bufferSize instead.
You can also try setting balanceThreshold to zero so only metric trees are built.

(this does not cause the code to stop, and it continues)

Exception in thread "main" java.lang.IllegalArgumentException: knn_2166a4d536d3 parameter bufferSize given invalid value -1.3153582722102333

This then causes an error and the pipeline stops.

From my understanding, very low average distances would always cause errors if beta exceeds 0.

Can not download the dependency

Hi saurfang,

I find the release here, but the

// https://mvnrepository.com/artifact/saurfang/spark-knn
libraryDependencies += "saurfang" % "spark-knn" % "0.3.0"

Not working for me.

Release version for Spark 2.4.X / Scala 2.12

It would be nice to get a release of this so that it can be used with Scala 2.12 and Spark 2.4.X.

Problem with example

Hi!
I've been studying your work but I have a problem with the example that you show in the main Readme.
It throws me a compiation error in toDF() method:
val training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()

And it also throws compilation errors in the last two lines, in the fit and transform methods, with the same reason.

Could you take a look?
Thx

steps to builb the project against spark 1.6

offline use in pyspark2.3.3

Run problem in Spark 1.6.0

If I try run example in Spark 1.6.0 I get exception:

ERROR [apache.spark.executor.Executor:logError :95 ] - Exception in task 0.0 in stage 7.0 (TID 7)sg
java.lang.NoSuchMethodError: org.apache.spark.ml.classification.MultiClassSummarizer.add(D)Lorg/apache/spark/ml/classification/MultiClassSummarizer;
at org.apache.spark.ml.classification.KNNClassifier$$anonfun$2.apply(KNNClassifier.scala:91)
at org.apache.spark.ml.classification.KNNClassifier$$anonfun$2.apply(KNNClassifier.scala:89)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
at org.apache.spark.InterruptibleIterator.foldLeft(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
at org.apache.spark.InterruptibleIterator.aggregate(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1121)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1121)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1122)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1122)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Scala version 2.10.6

Error "Parameter bufferSize given invalid value NaN."

Hello. I'm almost constantly getting the error "Parameter bufferSize given invalid value NaN." while training the KNN model. I suspect this might be caused by me incorrectly setting the setBufferSizeSampleSizes, although I have no idea what should be the correct values there. My dataset has about 3800 data points (it is a dataset that's supposed to be used only for testing). I'm trying various ranges, with big or small values, but none helps. Is there any guideline on how many and how big those samples should be? For now I am setting the bufferSize manually, instead of having it estimated.

Thanks!

TypeError: 'JavaPackage' object is not callable

when I try to run the test I get this error :

TypeError Traceback (most recent call last)
in ()
3
4
----> 5 knn = KNNClassifier(k=1, topTreeSize=1, topTreeLeafSize=1, subTreeLeafSize=1, bufferSizeSampleSize=[1, 2, 3]) # bufferSize=-1.0,
6 model = knn.fit(training_data)
7

C:\spark-2.2.1-bin-hadoop2.7\python\pyspark_init_.pyc in wrapper(self, *args, **kwargs)
102 raise TypeError("Method %s forces keyword arguments." % func.name)
103 self._input_kwargs = kwargs
--> 104 return func(self, **kwargs)
105 return wrapper
106

C:\spark-2.2.1-bin-hadoop2.7\python\pyspark\ml\knn.py in init(self, featuresCol, labelCol, predictionCol, seed, topTreeSize, topTreeLeafSize, subTreeLeafSize, bufferSize, bufferSizeSampleSize, balanceThreshold, k, neighborsCol, maxNeighbors, rawPredictionCol, probabilityCol)
17 super(KNNClassifier, self).init()
18 self._java_obj = self._new_java_obj(
---> 19 "org.apache.spark.ml.knn.KNNClassifier", self.uid)
20
21 self.topTreeSize = Param(self, "topTreeSize", "number of points to sample for top-level tree")

C:\spark-2.2.1-bin-hadoop2.7\python\pyspark\ml\wrapper.pyc in _new_java_obj(java_class, *args)
61 java_obj = getattr(java_obj, name)
62 java_args = [_py2java(sc, arg) for arg in args]
---> 63 return java_obj(*java_args)
64
65 @staticmethod

TypeError: 'JavaPackage' object is not callable

Update spark package against 1.6

Can you re-publish the package to be built against Spark 1.6 ? We'd like to expose knn to sparklyr.

knn.fit(training) throws an exception

followed whatever was there
val training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
val knn = new KNNClassifier()
.setTopTreeSize(training.count().toInt / 500)
.setK(10)
1st error : TopTreeSize is invalid 0 (since total count of training sample is 100)
let say we set manually TreeSize as 1
then it throws an exception while running knn.fit(training)

java.util.NoSuchElementException: Failed to find a default value for inputCols
at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:652)
at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:652)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:651)
at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42)
at org.apache.spark.ml.param.Params$class.$(params.scala:658)
at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42)
at org.apache.spark.ml.knn.KNN.fit(KNN.scala:383)

A point is considered its' own neighbour

Hello,

I have noticed a weird behaviour. For some reason, the algorithm counts each point as its' own neighbour, i. e. when using KNN(1), the only neighbour the algorithm finds is the point itself. Can some bad setting of a parameter cause this? Or is it a "feature"?

Nearest Neighbor between two dataframes

Hi,
Thanks for the amazing work!
I have two dataframes, A has about 200 Million points and B has about 10 Million points, I want to find the nearest neighbor for every point in A from B, I want to do this preferably in Python, how can I achieve it using this library?

How to avoid stackoverflow error when recursively building too much trees?

I ran KNNClassifier on my local machine with 5000 rows data, and I got stackoverflow errors. The version of this KNN is v0.1.1. How to avoid this stackoverflow?

Run Error

I run the example, but get

Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Sampling fraction (10.0) must be on interval [0, 1]
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.util.random.BernoulliSampler.(RandomSampler.scala:149)
at org.apache.spark.rdd.RDD$$anonfun$sample$1.apply(RDD.scala:430)
at org.apache.spark.rdd.RDD$$anonfun$sample$1.apply(RDD.scala:425)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
at org.apache.spark.rdd.RDD.sample(RDD.scala:425)
at org.apache.spark.ml.knn.KNN.fit(KNN.scala:317)
at org.apache.spark.ml.classification.KNNClassifier.train(KNNClassifier.scala:109)
at org.apache.spark.ml.classification.KNNClassifier.train(KNNClassifier.scala:24)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
at ict.knn.JavaKNN.main(JavaKNN.java:36)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Exception occurred when initializing new KNNRegression()

val knn = new KNNRegression().setTopTreeSize(df.count().toInt / 500)
.setFeaturesCol("features")
.setPredictionCol("predicted")
.setK(1)

This throws the following exception,

Caused by: java.lang.ClassNotFoundException: org.apache.spark.ml.param.shared.HasInputCols$class
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

Install as SBT dependency

For cases then you build fat jar, it would be nice to be able to add this library as dependency.

How to use it in pyspark environment

spark.jars and spark.driver.extraClassPath can‘t help me import pyspark_knn

Does this implementation support hundreds of millions points?

java.lang.NoSuchMethodError:breeze.linalg.Vector$.scalaOf() Lbreeze/linalg/support/ScalaOf

with the same Scala version and breeze 0.12,this error occurred.
At MetricTree.scala:95

Check if the number of points to sample for top-level tree is less than the number of records in training dataset

Just faced the issue and the reason was that the number of points (defaults to 1000) was higher than the number of records in the training dataset. Perhaps obvious for ML practitioners, but I spent few minutes debugging to nail it down.

It'd be nice to know it before fitting a model or get a more user-friendly error message.

Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Sampling fraction (333.3333333333333) must be on interval [0, 1]
	at scala.Predef$.require(Predef.scala:224)
	at org.apache.spark.util.random.BernoulliSampler.<init>(RandomSampler.scala:148)
	at org.apache.spark.rdd.RDD$$anonfun$sample$2.apply(RDD.scala:495)
	at org.apache.spark.rdd.RDD$$anonfun$sample$2.apply(RDD.scala:490)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.sample(RDD.scala:490)
	at org.apache.spark.ml.knn.KNN.fit(KNN.scala:387)

Getting K nearest neighbor for each data point

Would you please advise how to get K nearest neighbors for each data point using the python interface?
I believe this is possible based on " KNN itself is also exposed for advanced usage which returns arbitrary columns associated with found neighbors. "

"main" java.lang.NoSuchMethodError: org.apache.spark.rdd.ShuffledRDD

whenever i try to run KNN regression using direct main method I am getting this error "main" java.lang.NoSuchMethodError: org.apache.spark.rdd.ShuffledRDD but when I try to run text cases it is working fine.as I read different articles i m getting most of suggestion as different versions but i tried by adding main class in
actual code itself after cloning

saurfang / spark-knn Goto Github PK

spark-knn's Issues

Recommend Projects

Recommend Topics

Recommend Org