behrouz-babaki / cop-kmeans Goto Github PK

A Python implementation of COP-KMEANS algorithm

License: MIT License

Python 100.00%

clustering clustering-algorithm constrained-clustering constrained-kmeans-clustering cop-kmeans instance-level semi-supervised-clustering

cop-kmeans's People

Contributors

Stargazers

Watchers

Forkers

joselog liyuan35023 oliviaguest gitter-badger hegaofeng randpythondata younesz raihan-seraj crimson12ed killbutcher kevinb22 kensuke-mitsuzawa leeliuhuan juliansteam kentow sts-sadr punit0087 jliang41 zkikic ataria- anj1 vishalbelsare fl33tw00d fadhelrahmawan berkbatu twistedmove zenglinxiao hovinhthinh 4620511 dosen4552 dkourilov hbredin mr-zjh tosinwashere wberghuijs hassanazhar blaisdelkate jobiegamao jacobf18 88aggressive cryptodbob

cop-kmeans's Issues

what will be the right way to add predict to your COP-Kmeans implementation

Hi Behrouz,

I am looking into your implementation (new to COP-Kmeans). I would like to train the classification COP-Kmeans on a train set with constrains and then try it on test set.

Could you please advice what will be the right way to add predict method to the classification model?
where can i find the weights for each one of the sample vector values? is it part of the clusters, centers objects?

Many thanks for any input / direction,
eilalan

Can't group must-link data into one cluster

I use this algorithm in my dataset, but I find data which is in the must-link set are in the two cluster. Hope your reply, thanks

Extend the basic version: options for dealing with empty clusters

It is possible that in an iteration of the algorithm, no data-point is assigned to a centroid. This will produce empty clusters. Since there is no possibility for incrementing the number of clusters, each reduction in the number of clusters will persist until convergence of the algorithm.

In K-Means, we can avoid this situation by checking the number of clusters at each iteration. If there are x empty clusters, we will introduce the x data-points that are furthest from their assigned centroid, as new centroids. This practice is used in the scikit-learn implementation of K-means, here and here.

In COP-Kmeans, all must-linked points are assigned to the same cluster. The scheme above should be modified to take this into account. To do so, we can evaluate all must-link blocks and turn blocks that produce the largest amount of improvement into new clusters. The improvement is measured by the difference between the sum of distances to the old centroid and the sum of distances to the centroid of the block.

Algorithm fails without conflict constraint

consider the following case

if we want to cluster x1, x2, x3 into two clusters
with cannot links (x1, x3) and (x2, x3)

we first assign x1 to cluster1, and x2 to cluster2
the algorithm fails because there is no cluster for x3 to be assigned

how i can use your implementation code with my dataset?

Exception is raised when k > n

When the number of clusters is larger than the number of instances, a division by zero exception is raised. Handle this error gracefully.

Implement the code in C with the same functionality

Functions closest_cluster and compute_centers might take a lot of time to run. Also the loop of assigning instances to clusters can incur some computational cost.

We should profile the code on a relatively large dataset (e.g. the digits dataset) and identify the critical parts. Then we have to implement those parts in C. To avoid extra dependencies, we are going to use ctypes.

Using C imposes a limit on the available data structures. We need to implement the sorting algorithm ourselves. But for keeping track of constraints while assigning nodes to clusters, it is enough to have k boolean arrays that indicate what instances can not be added to a cluster. This array is updated after assignment of each instance to a cluster.

Extend the basic version: verify conversion using a tolerance

Instead of comparing the obtained clustering in consecutive iterations, use a tolerance and a maximum number of iterations to verify the conversion of the algorithm.

Issue kmeans code.

In file: try_kmeans.ipynb

In function: find_closest()

Here you have to update the min_dist parameter in the IF-condition.

Extend the basic version: options for initialization

Currently, the initialization is completely random. There is an improved initialization mechanism in k-means++ which we might be able to adapt for COP-Kmeans.

We can not pick two must-linked points as two centroids. The algorithm should be modified to ensure this.

ZeroDivisionError: float division by zero with kmpp initialiser

When using kmpp initialisation, chances vector keeps decreasing till the point at which the following code line:

chances = [x/sum(chances) for x in chances]

produces a ZeroDivisionError error because sum(chances) == 0, and execution is broken.

In my case, the following chances and sum(chances) values where computed across iterations:

> [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] 20
> [0.0, 0.0, 0.0, 0.0, 0.0, 0.7805784783006152, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 1.0518003539299652, 0.0, 0.0, 1.367544467966324, 0.0, 0.0] 5.199923300196905
> [0.0, 0.0, 0.0, 0.0, 0.0, 0.7805784783006152, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 1.0518003539299652, 0.0, 0.0, 0.0, 0.0, 0.0] 3.8323788322305807
> [0.0, 0.0, 0.0, 0.0, 0.0, 0.7805784783006152, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.23905780015564948, 0.0, 0.0, 0.0, 0.0, 0.0] 1.0196362784562647
> [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.23905780015564948, 0.0, 0.0, 0.0, 0.0, 0.0] 0.23905780015564948
> [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.0

As it is seeable, sum(chances) plummits to 0.0 at the end.

Algorithm does not catch inconsistent constraints

Sometimes the algorithm does not detect the inconsistency between constraints.

Predict method

Hi,

Is it possible to implement a predict method like in the sklearn implementation of KMeans.

If you think it is resonnable, i could try to implement it my self.