Code Monkey home page Code Monkey logo

cop-kmeans's People

Contributors

behrouz-babaki avatar kensuke-mitsuzawa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cop-kmeans's Issues

Extend the basic version: options for dealing with empty clusters

It is possible that in an iteration of the algorithm, no data-point is assigned to a centroid. This will produce empty clusters. Since there is no possibility for incrementing the number of clusters, each reduction in the number of clusters will persist until convergence of the algorithm.

In K-Means, we can avoid this situation by checking the number of clusters at each iteration. If there are x empty clusters, we will introduce the x data-points that are furthest from their assigned centroid, as new centroids. This practice is used in the scikit-learn implementation of K-means, here and here.

In COP-Kmeans, all must-linked points are assigned to the same cluster. The scheme above should be modified to take this into account. To do so, we can evaluate all must-link blocks and turn blocks that produce the largest amount of improvement into new clusters. The improvement is measured by the difference between the sum of distances to the old centroid and the sum of distances to the centroid of the block.

Implement the code in C with the same functionality

Functions closest_cluster and compute_centers might take a lot of time to run. Also the loop of assigning instances to clusters can incur some computational cost.

We should profile the code on a relatively large dataset (e.g. the digits dataset) and identify the critical parts. Then we have to implement those parts in C. To avoid extra dependencies, we are going to use ctypes.

Using C imposes a limit on the available data structures. We need to implement the sorting algorithm ourselves. But for keeping track of constraints while assigning nodes to clusters, it is enough to have k boolean arrays that indicate what instances can not be added to a cluster. This array is updated after assignment of each instance to a cluster.

ZeroDivisionError: float division by zero with kmpp initialiser

When using kmpp initialisation, chances vector keeps decreasing till the point at which the following code line:

chances = [x/sum(chances) for x in chances]

produces a ZeroDivisionError error because sum(chances) == 0, and execution is broken.

In my case, the following chances and sum(chances) values where computed across iterations:

> [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] 20
> [0.0, 0.0, 0.0, 0.0, 0.0, 0.7805784783006152, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 1.0518003539299652, 0.0, 0.0, 1.367544467966324, 0.0, 0.0] 5.199923300196905
> [0.0, 0.0, 0.0, 0.0, 0.0, 0.7805784783006152, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 1.0518003539299652, 0.0, 0.0, 0.0, 0.0, 0.0] 3.8323788322305807
> [0.0, 0.0, 0.0, 0.0, 0.0, 0.7805784783006152, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.23905780015564948, 0.0, 0.0, 0.0, 0.0, 0.0] 1.0196362784562647
> [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.23905780015564948, 0.0, 0.0, 0.0, 0.0, 0.0] 0.23905780015564948
> [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.0

As it is seeable, sum(chances) plummits to 0.0 at the end.

Extend the basic version: options for initialization

Currently, the initialization is completely random. There is an improved initialization mechanism in k-means++ which we might be able to adapt for COP-Kmeans.

We can not pick two must-linked points as two centroids. The algorithm should be modified to ensure this.

Exception is raised when k > n

When the number of clusters is larger than the number of instances, a division by zero exception is raised. Handle this error gracefully.

Issue kmeans code.

In file: try_kmeans.ipynb

In function: find_closest()

Here you have to update the min_dist parameter in the IF-condition.

Predict method

Hi,

Is it possible to implement a predict method like in the sklearn implementation of KMeans.

If you think it is resonnable, i could try to implement it my self.

Algorithm fails without conflict constraint

consider the following case

if we want to cluster x1, x2, x3 into two clusters
with cannot links (x1, x3) and (x2, x3)

we first assign x1 to cluster1, and x2 to cluster2
the algorithm fails because there is no cluster for x3 to be assigned

what will be the right way to add predict to your COP-Kmeans implementation

Hi Behrouz,

I am looking into your implementation (new to COP-Kmeans). I would like to train the classification COP-Kmeans on a train set with constrains and then try it on test set.

Could you please advice what will be the right way to add predict method to the classification model?
where can i find the weights for each one of the sample vector values? is it part of the clusters, centers objects?

Many thanks for any input / direction,
eilalan

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.