behrouz-babaki / cop-kmeans Goto Github PK
View Code? Open in Web Editor NEWA Python implementation of COP-KMEANS algorithm
License: MIT License
A Python implementation of COP-KMEANS algorithm
License: MIT License
It is possible that in an iteration of the algorithm, no data-point is assigned to a centroid. This will produce empty clusters. Since there is no possibility for incrementing the number of clusters, each reduction in the number of clusters will persist until convergence of the algorithm.
In K-Means, we can avoid this situation by checking the number of clusters at each iteration. If there are x empty clusters, we will introduce the x data-points that are furthest from their assigned centroid, as new centroids. This practice is used in the scikit-learn implementation of K-means, here and here.
In COP-Kmeans, all must-linked points are assigned to the same cluster. The scheme above should be modified to take this into account. To do so, we can evaluate all must-link blocks and turn blocks that produce the largest amount of improvement into new clusters. The improvement is measured by the difference between the sum of distances to the old centroid and the sum of distances to the centroid of the block.
Functions closest_cluster
and compute_centers
might take a lot of time to run. Also the loop of assigning instances to clusters can incur some computational cost.
We should profile the code on a relatively large dataset (e.g. the digits dataset) and identify the critical parts. Then we have to implement those parts in C. To avoid extra dependencies, we are going to use ctypes
.
Using C imposes a limit on the available data structures. We need to implement the sorting algorithm ourselves. But for keeping track of constraints while assigning nodes to clusters, it is enough to have k boolean arrays that indicate what instances can not be added to a cluster. This array is updated after assignment of each instance to a cluster.
Sometimes the algorithm does not detect the inconsistency between constraints.
When using kmpp
initialisation, chances
vector keeps decreasing till the point at which the following code line:
chances = [x/sum(chances) for x in chances]
produces a ZeroDivisionError
error because sum(chances) == 0
, and execution is broken.
In my case, the following chances
and sum(chances)
values where computed across iterations:
> [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] 20
> [0.0, 0.0, 0.0, 0.0, 0.0, 0.7805784783006152, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 1.0518003539299652, 0.0, 0.0, 1.367544467966324, 0.0, 0.0] 5.199923300196905
> [0.0, 0.0, 0.0, 0.0, 0.0, 0.7805784783006152, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 1.0518003539299652, 0.0, 0.0, 0.0, 0.0, 0.0] 3.8323788322305807
> [0.0, 0.0, 0.0, 0.0, 0.0, 0.7805784783006152, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.23905780015564948, 0.0, 0.0, 0.0, 0.0, 0.0] 1.0196362784562647
> [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.23905780015564948, 0.0, 0.0, 0.0, 0.0, 0.0] 0.23905780015564948
> [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.0
As it is seeable, sum(chances)
plummits to 0.0
at the end.
I use this algorithm in my dataset, but I find data which is in the must-link set are in the two cluster. Hope your reply, thanks
Instead of comparing the obtained clustering in consecutive iterations, use a tolerance and a maximum number of iterations to verify the conversion of the algorithm.
Currently, the initialization is completely random. There is an improved initialization mechanism in k-means++ which we might be able to adapt for COP-Kmeans.
We can not pick two must-linked points as two centroids. The algorithm should be modified to ensure this.
When the number of clusters is larger than the number of instances, a division by zero
exception is raised. Handle this error gracefully.
In file: try_kmeans.ipynb
In function: find_closest()
Here you have to update the min_dist parameter in the IF-condition.
Hi,
Is it possible to implement a predict method like in the sklearn implementation of KMeans.
If you think it is resonnable, i could try to implement it my self.
consider the following case
if we want to cluster x1, x2, x3 into two clusters
with cannot links (x1, x3) and (x2, x3)
we first assign x1 to cluster1, and x2 to cluster2
the algorithm fails because there is no cluster for x3 to be assigned
Hi Behrouz,
I am looking into your implementation (new to COP-Kmeans). I would like to train the classification COP-Kmeans on a train set with constrains and then try it on test set.
Could you please advice what will be the right way to add predict method to the classification model?
where can i find the weights for each one of the sample vector values? is it part of the clusters, centers objects?
Many thanks for any input / direction,
eilalan
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.