Code Monkey home page Code Monkey logo

mean-shift-lsh's Introduction

Distributed Nearest Neighbours Mean Shift with Locality Sensitive Hashing DNNMS-LSH Download

An upgraded version is available in Clustering4Ever repository

This scalable algorithm was created during an internship at Computer Science Laboratory (Laboratoire d'Informatique de Paris Nord, LIPN) at the University of Paris 13, with Lebbah Mustapha, Duong Tarn, Azzag Hanene and Beck Gaël. Its purpose is to provide an efficient distributed implementation to cluster large multivariate multidimensional data sets (Big Data) Nearest neighbor mean shift (NNMS) defines clusters in terms of locally density regions in the data density. The main advantages of NNMS are that it can detect automatically the number of clusters in the data set and detect non-ellipsoidal clusters, in contrast to k-means clustering. Exact nearest neighbors calculations in the standard NNMS prevent from being used on Big Data so we introduce approximate nearest neighbors via Locality Sensitive Hashing (LSH), which are based on random scalar projections of the data. To further improve the scalability, we implement NNMS-LSH in the distributed Spark/Scala ecosystem.

Linked publication

  • G. Beck, T. Duong, H. Azzag and M. Lebbah, "Distributed mean shift clustering with approximate nearest neighbours," 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, 2016, pp. 3110-3115. doi: 10.1109/IJCNN.2016.7727595
  • Tarn Duong, Gael Beck, Hanene Azzag, Mustapha Lebbah. Nearest neighbour estimators of density derivatives, with application to mean shift clustering. Pattern Recognition Letters. doi = http://dx.doi.org/10.1016/j.patrec.2016.06.021

Parameters

  • k is the number of neighbours to look at in order to compute centroid.
  • nbseg is the number of segments on which the data vectors are projected during LSH. Its value should usually be larger than 20, depending on the data set.
  • nbblocs1 is a crucial parameter as larger values give faster but less accurate LSH approximate nearest neighbors, and as smaller values give slower but more accurate approximations.
  • nbblocs2 as larger values give faster but less accurate LSH approximate nearest neighbors, and as smaller values give slower but more accurate approximations.
  • cmin is the threshold under which clusters with fewer than cmin members are merged with the next nearest cluster.
  • normalisation is a flag if the data should be first normalized (X-Xmin)/(Xmax-Xmin) before clustering.
  • w is a uniformisation constant for LSH.
  • ratioToStop is the percentage of data that need to not have converged in order to considere to stop gradient ascent iterations, set to 1.0 to disable it.
  • yStarIter is the maximum number of iterations in the gradient ascent in the mean shift update.
  • epsilon1 is the threshold under which we considere a mod to have converged.
  • epsilon2 is the threshold under which two final mean shift iterates are considered to be in the same cluster.
  • epsilon3 is the threshold under which two final clusters are considered to be the same.
  • nbLabelIter is the number of iteration for the labelisation step, it determines the number of final models

Usage

Multivariate multidimensional clustering

Unlike the image analysis which has a specific data pre-processing before the mean shift, for general multivariate multidimensional data sets, it is recommended to normalize data so that each variable has a comparable magnitude to the other variables to improve the performance in the distance matrix computation used to determine nearest neighbors.

Image analysis

To carry out image analysis, it is recommended to convert the usual color formats (e.g. RGB, CYMK) to the Luv* color space as the close values in the Luv*-space correspond more to visual perceptions of color proximity, as well adding the row and column indices (x,y). Each pixel is transformed to a 5-dimensional vector (x,y,L*, u*, v*) which is then input into the mean shift clustering.

  val sc = new SparkContext(conf)
  val meanShift = msLsh.MsLsh
  val data = sc.textFile("myData.csv")
  val parsedData = data.map(_.split(',').map(_.toDouble)).map(y => (Vectors.dense(y)))
                        .zipWithIndex
                        .map(_.swap)
  
  val models = meanShift.train(  sc,
                          parsedData,
                          k=60,
                          epsilon1=0.001,
                          epsilon2=0.05,
                          epsilon3=0.06,
                          ratioToStop=0.01,
                          yStarIter=10,
                          cmin=0,
                          normalisation=true,
                          w=1,
                          nbseg=100,
                          nbblocs1=50,
                          nbblocs2=50,
                          nbLabelIter=1)  
                          
  // Save result for an image as (ID, Vector, ClusterNumber)
  meanShift.saveImageAnalysis(models.head, "MyImageResultDirectory",1)

  // Save result as (ID, ClusterNumber)
  meanShift.savelabeling(models.head, "MyResultDirectory", 1)

  // Save centroids result as (NumCluster, cardinality, CentroidVector)
  meanShift.saveClusterInfo(models.head, "centroidDirectory")

Image segmentation example

The picture on top left corner is the #117 from Berkeley Segmentation Dataset and Benchmark repository. Others are obtained with :

  • nbblocs1 : 200 (top right) , 500 (bottom left), 1000 (bottom right)
  • nbblocs2 : 1
  • k : 50
  • epsilon2 : 0.05
  • epsilon3 : 0.05 // doesn't matter with nbblocs2 = 1
  • yStarIter : 10
  • nbLabelIter : 1
  • w : 1
  • cmin : 200

alt text

Maintenance

Please feel free to report any bugs, recommendations or requests to [email protected] to improve this algorithm.

mean-shift-lsh's People

Contributors

beckgael avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.