saurfang / spark-knn Goto Github PK
View Code? Open in Web Editor NEWk-Nearest Neighbors algorithm on Spark
License: Apache License 2.0
k-Nearest Neighbors algorithm on Spark
License: Apache License 2.0
Do you think it is would be easy to add a Python interface for this module?
I might try to do this for my research but I am not sure yet what is the best way to support Python. How would the code be sent to Spark? My best idea is that it would require to add a Python egg with a wrapper of this module and a jar with the Scala source to the spark-submit command (since Python users usually don't use SBT).
I've noticed that the dataset return by knnModel.transform(dataset) only contains the neighbors without their distance (which is actually available in the tree and thus really easy to fetch). Would it be possible to add an option so that the distance might be returned if asked ?
I'm actually implementing recommendation based on KNN, and having the possibility to weigh each recommendation based on it's distance would be really interesting.
PS: a compatibility with spark 2.0 would also be really appreciated
provide custom distance function instead of using fixed Euclidean distance, e.g.
def distance[T](point1:T , point2:T):Double
After loading the sample_libsvm_data.txt,
val knn=new KNNClassifier() gives an error saying KNNClassifer is not found.
How to use that line? Directly in command prompt ?
And where to save the spark-knn folder..?
It appears that spark-knn
needs to transform dense vectors into their sparse form. This creates a limitation when using spark-knn
for very wide, sparse datasets such as document-term matrices used in NLP.
Is there any interest in supporting sparse vectors within spark-knn
?
Hi,
The code commets said that we could get the exact nearest neighbors by tuning parameters. But I haven't figured out how to do it.
I tried to set tau to 0.0, but didn't get the exact result. I think the reason is that KNN.searchIndices of the topTree doesn't always return all the possible partition ids which contains nearest neighbors.
Of course, I could rewrite the KNN.searchIndices to make sure all possible partition ids are included. But I still want to know if there is a way to get exact nearest neighbors by just tuning parameters in the existing code.
Thanks!
while fitting training data, on what parameter does top tree size, leave size and sub tree leave size depends?
I saved a model(KNNClassificationModel) using java serialization and when I use it later, I always get java.lang.IllegalArgumentException: Flat hash tables cannot contain null elements.
on the dataframe output of the model.transform(inputDataFrame).
Is there a better way of saving and using model? like support for MLWritable/Saveable traits. In our use case, we create a model and use it later
Hi:
Here is my code:
package com.jd.anti.tool.ad
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.ml.classification.KNNClassifier
import org.apache.spark.ml.regression.KNNRegression
object KNNForTest {
def main(argv: Array[String]) = {
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("com.ad.KNNForTest")
val sc = new SparkContext(conf)
//read in raw label and features
val hiveContext = new HiveContext(sc)
import hiveContext.implicits._
val training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
val knn = new KNNRegression()
.setTopTreeSize(training.count().toInt / 500)
.setK(10)
val knnModel = knn.fit(training)
val predicted = knn.transform(training)
}
}
And I add
resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"
libraryDependencies += "saurfang" % "spark-knn" % "0.2.0"
in my sbt build file.
What should I do to deal with this problem?
Thanks!
Hi,
After running KNN example to get nearest neighbor, and in this line
https://github.com/saurfang/spark-knn/blob/v0.3.0/spark-knn-core/src/test/scala/org/apache/spark/ml/knn/KNNSuite.scala#L40, I got "Caused by: java.lang.ArrayStoreException".
What should I do?
Thanks.
Facing issue while running spark-knn-examples/src/main/scalacom/github/saurfang/spark/ml/knn/examples/MNIST.scala
I had ran the commands in spark-shell.
enivronment: spark: 2.1.0
scala: 2.11.8
scala> val pipeline = new Pipeline().setStages(Array(knn_KA)).fit(train)
java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:51)
at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:58)
at org.apache.spark.ml.classification.ClassifierParams$class.validateAndTransformSchema(Classifier.scala:42)
at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:53)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams$class.validateAndTransformSchema(ProbabilisticClassifier.scala:37)
at org.apache.spark.ml.classification.ProbabilisticClassifier.validateAndTransformSchema(ProbabilisticClassifier.scala:53)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:122)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:184)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:184)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:184)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:136)
... 52 elided
On looking at the dataset schema
scala> dataset.schema
res14: org.apache.spark.sql.types.StructType = StructType(StructField(label,DoubleType,true), StructField(features,org.apache.spark.mllib.linalg.VectorUDT@f71b0bce,true))
it shows that features are of type org.apache.spark.mllib.linalg.VectorUDT
Could this be the reason for the error?
Is it only me who is getting error while running the example?
Note: data/mnist/mnist.bz2 has no content. Hence I took mnist.bz2 from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html
Is there a way to extract the K nearest neighbors from Training samples from the KNN model in the Scala version?
Can anybody help me with this?
I used the python related part on README.md :
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.ml.classification.KNNClassifier.
: java.lang.NoSuchMethodError: org.apache.spark.ml.param.shared.HasInputCols.$init$(Lorg/apache/spark/ml/param/shared/HasInputCols;)V
at org.apache.spark.ml.classification.KNNClassifier.<init>(KNNClassifier.scala:23)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
As part our CICD pipeline, we have a daily build that runs on relatively small amounts of data. As part of this, we discovered an interesting bug; as part of the method estimateTau, there is the following line:
val y = DenseVector(estimators.map { case (_, d) => math.log(d) })
In this case, d is the average distance between points. We are finding that on the small data used in our daily build, beta can exceed 0. When this happens, yMax, which is defined as:
val yMax = breeze.linalg.max(y)
is below negative one, and subsequently used as the bufferSize.
Specifically, the following appears in the log:
ERROR KNN: Unable to estimate Tau with positive beta: 0.1577160047542901. This maybe because data is too small.
Setting to -1.3153582722102333 which is the maximum average distance we found in the sample.
This may leads to poor accuracy. Consider manually set bufferSize instead.
You can also try setting balanceThreshold to zero so only metric trees are built.
(this does not cause the code to stop, and it continues)
Exception in thread "main" java.lang.IllegalArgumentException: knn_2166a4d536d3 parameter bufferSize given invalid value -1.3153582722102333
This then causes an error and the pipeline stops.
From my understanding, very low average distances would always cause errors if beta exceeds 0.
Hi saurfang,
I find the release here, but the
// https://mvnrepository.com/artifact/saurfang/spark-knn
libraryDependencies += "saurfang" % "spark-knn" % "0.3.0"
Not working for me.
It would be nice to get a release of this so that it can be used with Scala 2.12 and Spark 2.4.X.
Hi!
I've been studying your work but I have a problem with the example that you show in the main Readme.
It throws me a compiation error in toDF() method:
val training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
And it also throws compilation errors in the last two lines, in the fit and transform methods, with the same reason.
Could you take a look?
Thx
If I try run example in Spark 1.6.0 I get exception:
ERROR [apache.spark.executor.Executor:logError :95 ] - Exception in task 0.0 in stage 7.0 (TID 7)sg
java.lang.NoSuchMethodError: org.apache.spark.ml.classification.MultiClassSummarizer.add(D)Lorg/apache/spark/ml/classification/MultiClassSummarizer;
at org.apache.spark.ml.classification.KNNClassifier$$anonfun$2.apply(KNNClassifier.scala:91)
at org.apache.spark.ml.classification.KNNClassifier$$anonfun$2.apply(KNNClassifier.scala:89)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
at org.apache.spark.InterruptibleIterator.foldLeft(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
at org.apache.spark.InterruptibleIterator.aggregate(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1121)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1121)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1122)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1122)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Scala version 2.10.6
Hello. I'm almost constantly getting the error "Parameter bufferSize given invalid value NaN." while training the KNN model. I suspect this might be caused by me incorrectly setting the setBufferSizeSampleSizes
, although I have no idea what should be the correct values there. My dataset has about 3800 data points (it is a dataset that's supposed to be used only for testing). I'm trying various ranges, with big or small values, but none helps. Is there any guideline on how many and how big those samples should be? For now I am setting the bufferSize manually, instead of having it estimated.
Thanks!
when I try to run the test I get this error :
TypeError Traceback (most recent call last)
in ()
3
4
----> 5 knn = KNNClassifier(k=1, topTreeSize=1, topTreeLeafSize=1, subTreeLeafSize=1, bufferSizeSampleSize=[1, 2, 3]) # bufferSize=-1.0,
6 model = knn.fit(training_data)
7
C:\spark-2.2.1-bin-hadoop2.7\python\pyspark_init_.pyc in wrapper(self, *args, **kwargs)
102 raise TypeError("Method %s forces keyword arguments." % func.name)
103 self._input_kwargs = kwargs
--> 104 return func(self, **kwargs)
105 return wrapper
106
C:\spark-2.2.1-bin-hadoop2.7\python\pyspark\ml\knn.py in init(self, featuresCol, labelCol, predictionCol, seed, topTreeSize, topTreeLeafSize, subTreeLeafSize, bufferSize, bufferSizeSampleSize, balanceThreshold, k, neighborsCol, maxNeighbors, rawPredictionCol, probabilityCol)
17 super(KNNClassifier, self).init()
18 self._java_obj = self._new_java_obj(
---> 19 "org.apache.spark.ml.knn.KNNClassifier", self.uid)
20
21 self.topTreeSize = Param(self, "topTreeSize", "number of points to sample for top-level tree")
C:\spark-2.2.1-bin-hadoop2.7\python\pyspark\ml\wrapper.pyc in _new_java_obj(java_class, *args)
61 java_obj = getattr(java_obj, name)
62 java_args = [_py2java(sc, arg) for arg in args]
---> 63 return java_obj(*java_args)
64
65 @staticmethod
TypeError: 'JavaPackage' object is not callable
Can you re-publish the package to be built against Spark 1.6 ? We'd like to expose knn to sparklyr.
followed whatever was there
val training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
val knn = new KNNClassifier()
.setTopTreeSize(training.count().toInt / 500)
.setK(10)
1st error : TopTreeSize is invalid 0 (since total count of training sample is 100)
let say we set manually TreeSize as 1
then it throws an exception while running knn.fit(training)
java.util.NoSuchElementException: Failed to find a default value for inputCols
at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:652)
at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:652)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:651)
at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42)
at org.apache.spark.ml.param.Params$class.$(params.scala:658)
at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42)
at org.apache.spark.ml.knn.KNN.fit(KNN.scala:383)
Hello,
I have noticed a weird behaviour. For some reason, the algorithm counts each point as its' own neighbour, i. e. when using KNN(1), the only neighbour the algorithm finds is the point itself. Can some bad setting of a parameter cause this? Or is it a "feature"?
Hi,
Thanks for the amazing work!
I have two dataframes, A has about 200 Million points and B has about 10 Million points, I want to find the nearest neighbor for every point in A from B, I want to do this preferably in Python, how can I achieve it using this library?
I ran KNNClassifier on my local machine with 5000 rows data, and I got stackoverflow errors. The version of this KNN is v0.1.1. How to avoid this stackoverflow?
I run the example, but get
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Sampling fraction (10.0) must be on interval [0, 1]
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.util.random.BernoulliSampler.(RandomSampler.scala:149)
at org.apache.spark.rdd.RDD$$anonfun$sample$1.apply(RDD.scala:430)
at org.apache.spark.rdd.RDD$$anonfun$sample$1.apply(RDD.scala:425)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
at org.apache.spark.rdd.RDD.sample(RDD.scala:425)
at org.apache.spark.ml.knn.KNN.fit(KNN.scala:317)
at org.apache.spark.ml.classification.KNNClassifier.train(KNNClassifier.scala:109)
at org.apache.spark.ml.classification.KNNClassifier.train(KNNClassifier.scala:24)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
at ict.knn.JavaKNN.main(JavaKNN.java:36)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
val knn = new KNNRegression().setTopTreeSize(df.count().toInt / 500)
.setFeaturesCol("features")
.setPredictionCol("predicted")
.setK(1)
This throws the following exception,
Caused by: java.lang.ClassNotFoundException: org.apache.spark.ml.param.shared.HasInputCols$class
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
For cases then you build fat jar, it would be nice to be able to add this library as dependency.
spark.jars and spark.driver.extraClassPath can‘t help me import pyspark_knn
with the same Scala version and breeze 0.12,this error occurred.
At MetricTree.scala:95
Just faced the issue and the reason was that the number of points (defaults to 1000
) was higher than the number of records in the training dataset. Perhaps obvious for ML practitioners, but I spent few minutes debugging to nail it down.
It'd be nice to know it before fitting a model or get a more user-friendly error message.
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Sampling fraction (333.3333333333333) must be on interval [0, 1]
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.util.random.BernoulliSampler.<init>(RandomSampler.scala:148)
at org.apache.spark.rdd.RDD$$anonfun$sample$2.apply(RDD.scala:495)
at org.apache.spark.rdd.RDD$$anonfun$sample$2.apply(RDD.scala:490)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.sample(RDD.scala:490)
at org.apache.spark.ml.knn.KNN.fit(KNN.scala:387)
Would you please advise how to get K nearest neighbors for each data point using the python interface?
I believe this is possible based on " KNN itself is also exposed for advanced usage which returns arbitrary columns associated with found neighbors. "
whenever i try to run KNN regression using direct main method I am getting this error "main" java.lang.NoSuchMethodError: org.apache.spark.rdd.ShuffledRDD but when I try to run text cases it is working fine.as I read different articles i m getting most of suggestion as different versions but i tried by adding main class in
actual code itself after cloning
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.