nla-group / classix Goto Github PK

View Code? Open in Web Editor NEW

95.0 2.0 8.0 286.23 MB

Fast and explainable clustering in Python

Home Page: https://classix.readthedocs.io/en/stable/

License: MIT License

Python 89.32% Cython 10.46% Batchfile 0.02% Shell 0.13% Dockerfile 0.07%

clustering database machine-learning algorithm cython python explainable-ml data-analysis data-mining data-science

classix's People

Contributors

Stargazers

Watchers

Forkers

techthiyanes animesh the-null aserehal chenxinye kianmeng ale-uy rabia174

classix's Issues

Avoid recomputation of PCA and distance computations in .explain()

The current implementation of .explain() performs superfluous computations which could be avoided:

(i) Currently, every call to .explain() computes the two leading principal components of the data. This should be done only once when .fit(data) is called. For CLASSIX itself only the first PCA is needed, but there is no harm getting the second component as well and storing it for .explain().

(ii) The current .explain(ind1,ind2) method computes a shortest path between the group centres, recomputing all the distances between them. This is unnecessary and can become very slow, especially for clusters with many points (e.g. in image segmentation). The distances between overlapping groups have already been computed in the merging phase, and all that is needed is the adjacency matrix of overlapping groups.

Both of these optimizations are implemented in the MATLAB code
https://github.com/nla-group/classix-matlab/blob/main/classix.m

Allow indexing with data frame labels

Assume I have a date frame df like

Anna   0.3   -0.1   0.5
Bert   0.0   -0.2   0.7
Carl  -0.8   -0.1   0.2

where the first column is the index. It would be nice to be able to do

clx = CLASSIX(radius=0.5)
clx.fit(df)
clx.explain('Anna', 'Bert')

and get an output similar to

The data point 'Anna' is in group 0, which has been merged into cluster 0.
The data point 'Bert' is in group 1, which has been merged into cluster 1.
There is no path of overlapping groups between these clusters.

The table of groups could contain an additional column for the label:

-------------------------------------------------
 Group  Label   NrPts  Cluster  Coordinates  
   0    'Anna'  1      0        0.3   -0.1   0.5
   1    'Bert'  2      1        0.0   -0.2   0.7
-------------------------------------------------

The plot function could also use the labels instead of numerical point indices, but there should probably be an option to revert to numerical indices for the starting points if they plot gets too cluttered.

Fail instalment

Some users cannot install CLASSIX by pip install classix and report the installing issue. I told them classix is also another software name. Instead we should use pip install ClassixClustering.

Cython fail on windows 11

When importing anything from classix e.g.

from classix import CLASSIX

I get the following output

Cython fail.
Cython fail.

I'm running Windows 11 and see this on both a pip installed version of classix and from doing python setup.py install from the GitHub repo.

However, everything else seems to be working fine. Execution has fallen back to the pure Python versions and it seems to be working pretty quickly:

Using the standard Windows Python install: https://www.python.org/downloads/windows/ with the whole stack pip installed the fit below on 2,000,000 10 dimensional points takes 11.2 seconds on my machine.

from sklearn import datasets
from classix import CLASSIX
import numpy as np
import matplotlib.pyplot as plt 

X, y = datasets.make_blobs(n_samples=2000000, centers=4, n_features=10, random_state=1) 

clx = CLASSIX(sorting='pca', verbose=0)
clx.fit(X)

I'm wondering what speed differences I'd see with the Cython version? Also, can I expect the results to be identical?

Adjust minPts without redoing aggregation

Our recommendation to choose the hyperparameter is to start with radius=1, reduce it until the number of clusters is only slightly larger than expected, and then increase minPts until the right number of clusters are obtained.

Could we add a function like

clx.minPtsChange(minPts)

which just redoes the minPts phase without doing the rest.

Note that, if mergeTinyGroups==False, then the group merging also needs to be redone because it depends on minPts. Otherwise, it doesn't.

Question:Possibility of adding a temporal component to the clustering

I wanted to know if it's possible to add a temporal component when it comes to how each node is clustered. Nodes with a close timestamp are clustered and this would take precedence over the spatial.

Error using Scipy 1.8.0

This code works fine with numpy 1.22.2 and Scipy 1.7.3

from sklearn import datasets
from classix import CLASSIX

# Generate synthetic data
X, y = datasets.make_blobs(n_samples=2000000, centers=4, n_features=10, random_state=1) #data_med

# Employ CLASSIX clustering
clx = CLASSIX(sorting='pca', radius=0.5, verbose=1)
clx.fit(X)

but fails with Scipy 1.8.0:

CLASSIX(sorting='pca', radius=0.5, minPts=0, group_merging='distance')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [1], in <cell line: 9>()
      7 # Employ CLASSIX clustering
      8 clx = CLASSIX(sorting='pca', radius=0.5, verbose=1)
----> 9 clx.fit(X)

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\classix\clustering.py:460, in CLASSIX.fit(self, data)
    457     self.data = (data - self._mu) / self._scl
    459 # aggregation
--> 460 self.agg_labels_, self.splist_, self.dist_nr = aggregate(data=self.data, sorting=self.sorting, tol=self.radius) 
    461 self.splist_ = np.array(self.splist_)
    463 self.clean_index_ = np.full(self.data.shape[0], True) # claim clean data indices

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\classix\aggregation_cm.pyx:44, in classix.aggregation_cm.aggregate()

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\classix\aggregation_cm.pyx:100, in classix.aggregation_cm.aggregate()

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\scipy\sparse\linalg\_eigen\_svds.py:269, in svds(A, k, ncv, tol, which, v0, maxiter, return_singular_vectors, solver, random_state, options)
    134 """
    135 Partial singular value decomposition of a sparse matrix.
    136 
   (...)
    265 
    266 """
    267 rs_was_None = random_state is None  # avoid changing v0 for arpack/lobpcg
--> 269 args = _iv(A, k, ncv, tol, which, v0, maxiter, return_singular_vectors,
    270            solver, random_state)
    271 (A, k, ncv, tol, which, v0, maxiter,
    272  return_singular_vectors, solver, random_state) = args
    274 largest = (which == 'LM')

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\scipy\sparse\linalg\_eigen\_svds.py:63, in _iv(A, k, ncv, tol, which, v0, maxiter, return_singular, solver, random_state)
     60     raise ValueError(f"solver must be one of {solvers}.")
     62 # input validation/standardization for `A`
---> 63 A = aslinearoperator(A)  # this takes care of some input validation
     64 if not (np.issubdtype(A.dtype, np.complexfloating)
     65         or np.issubdtype(A.dtype, np.floating)):
     66     message = "`A` must be of floating or complex floating data type."

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\scipy\sparse\linalg\_interface.py:826, in aslinearoperator(A)
    822     return LinearOperator(A.shape, A.matvec, rmatvec=rmatvec,
    823                           rmatmat=rmatmat, dtype=dtype)
    825 else:
--> 826     raise TypeError('type not understood')

TypeError: type not understood

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) Error

Using this example:

from sklearn import datasets
import numpy as np
from classix import CLASSIX

X, y = datasets.make_blobs(n_samples=5000, centers=2, n_features=2, cluster_std=1, random_state=1)
clx = CLASSIX(sorting='pca', radius=0.15, group_merging='density', verbose=1, minPts=13, post_alloc=False)
clx.fit(X)

I am getting the following error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-569-bb587af68fc1> in <module>
      5 X, y = datasets.make_blobs(n_samples=5000, centers=2, n_features=2, cluster_std=1, random_state=1)
      6 clx = CLASSIX(sorting='pca', radius=0.15, group_merging='density', verbose=1, minPts=13, post_alloc=False)
----> 7 clx.fit(X)
      8 
      9 X

~/miniconda3/envs/ltf-analysis/lib/python3.8/site-packages/classix/clustering.py in fit(self, data)
    506             self.labels_ = copy.deepcopy(self.groups_)
    507         else:
--> 508             self.labels_ = self.clustering(
    509                 data=self.data,
    510                 agg_labels=self.groups_,

~/miniconda3/envs/ltf-analysis/lib/python3.8/site-packages/classix/clustering.py in clustering(self, data, agg_labels, splist, sorting, radius, method, minPts)
    724             # self.merge_groups = merge_pairs(self.connected_pairs_)
    725 
--> 726         self.merge_groups, self.connected_pairs_ = self.fast_agglomerate(data, splist, radius, method, scale=self.scale)
    727         maxid = max(labels) + 1
    728 

~/miniconda3/envs/ltf-analysis/lib/python3.8/site-packages/classix/merging.py in fast_agglomerate(data, splist, radius, method, scale)
    115             # den1 = splist[int(i), 2] / volume # density(splist[int(i), 2], volume = volume)
    116             for j in select_stps.astype(int):
--> 117                 sp2 = data[splist[j, 0]] # splist[int(j), 3:]
    118 
    119                 c2 = np.linalg.norm(data-sp2, ord=2, axis=-1) <= radius

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

python==3.8.12
classixclustering==0.6.5
numpy==1.22.0
scipy==1.7.3

Add groups_ property

clx.labels_ will return the cluster label of each data point.

It would be useful to have clx.groups_ return the group each data point belongs to.

Access of timing information

It would be useful to have some way to access runtimes of the individual parts of CLASSIX after the clustering is computed. Largely, I think there are essentially five components:

t1_prepare: The initial data preparation, which mainly comprises data scaling and the computation of the first two principal axes.
t2_aggregate: This phase aggregates all data points into groups determined by the radius parameter of CLASSIX.
t3_merge: The computed groups will be merged into clusters when their group centers (starting points) are sufficiently close.
t4_minPts: Clusters with fewer than minPts points will be dissolved into their groups, and each of the groups will then be reassigned to a large enough cluster.
t5_finalize: Any cleanup activities.

group_merging=='none'

It would be helpful and intuitive to return the group labels as obtained by the aggregation phase when

group_merging.lower()=='none' or group_merging is None

In this case the cluster labels are just the group labels returned by aggregation.

ModuleNotFoundError: No module named 'numpy'

I am trying to install classix 0.7.4 from a requirements file into a venv, but unfortunately I am getting an error about numpy not found, which kind of makes no sense to me, because it is installed in one of the steps above in version 1.22.4. I was hoping someone here could help me figure out what is going wrong.

System: Debian 11 (bullseye)
Python version: 3.9.2
CLASSIX version: 0.7.4
Cython: 0.29.32

Steps to reproduce:

$ python3 -m venv venv
$ source venv/bin/activate
$ python3 -m pip install -r requirements.txt 
Collecting tqdm~=4.64.0
  Using cached tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
Collecting numpy~=1.22.4
  Using cached numpy-1.22.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
Collecting pandas~=1.4.2
  Using cached pandas-1.4.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB)
Collecting seaborn~=0.11.2
  Using cached seaborn-0.11.2-py3-none-any.whl (292 kB)
Collecting matplotlib~=3.5.2
  Using cached matplotlib-3.5.3-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (11.2 MB)
Collecting tensorflow~=2.9.1
  Using cached tensorflow-2.9.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (511.8 MB)
Collecting scikit-learn~=1.1.1
  Using cached scikit_learn-1.1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30.8 MB)
Collecting nfstream~=6.5.1
  Using cached nfstream-6.5.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
Collecting tabulate~=0.8.9
  Using cached tabulate-0.8.10-py3-none-any.whl (29 kB)
Collecting missingno~=0.5.1
  Using cached missingno-0.5.1-py3-none-any.whl (8.7 kB)
Collecting scipy~=1.8.1
  Using cached scipy-1.8.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (42.2 MB)
Collecting cython~=0.29.32
  Using cached Cython-0.29.32-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (2.0 MB)
Collecting scapy~=2.4.5
  Using cached scapy-2.4.5.tar.gz (1.1 MB)
Collecting zstandard~=0.18.0
  Using cached zstandard-0.18.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB)
Collecting protobuf~=3.19.4
  Using cached protobuf-3.19.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
Collecting pyclustering~=0.10.1.2
  Using cached pyclustering-0.10.1.2.tar.gz (2.6 MB)
Collecting classixclustering~=0.7.4
  Using cached classixclustering-0.7.4.tar.gz (629 kB)
    ERROR: Command errored out with exit status 1:
     command: /home/XXX/venv/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-a5h4ms0x/classixclustering_5529ebbd0bef4c0489673327bfb2a134/setup.py'"'"'; __file__='"'"'/tmp/pip-install-a5h4ms0x/classixclustering_5529ebbd0bef4c0489673327bfb2a134/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-ldj6pnm6
         cwd: /tmp/pip-install-a5h4ms0x/classixclustering_5529ebbd0bef4c0489673327bfb2a134/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-a5h4ms0x/classixclustering_5529ebbd0bef4c0489673327bfb2a134/setup.py", line 1, in <module>
        import numpy
    ModuleNotFoundError: No module named 'numpy'
    ----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/4d/0f/5a17e5d8045195d1a112b143a8143fff86e558d7cbeacad886d1b93be6db/classixclustering-0.7.4.tar.gz#sha256=d0f72deccb40ca9eb14905bb1a0f41787a824446eebac5a67a7ae59ec4c65342 (from https://pypi.org/simple/classixclustering/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement classixclustering~=0.7.4
ERROR: No matching distribution found for classixclustering~=0.7.4

The content of the requirements.txt looks as follows:

tqdm~=4.64.0
numpy~=1.22.4
pandas~=1.4.2
seaborn~=0.11.2
matplotlib~=3.5.2
tensorflow~=2.9.1
scikit-learn~=1.1.1
nfstream~=6.5.1
tabulate~=0.8.9
missingno~=0.5.1
scipy~=1.8.1
cython~=0.29.32
scapy~=2.4.5
zstandard~=0.18.0
protobuf~=3.19.4
pyclustering~=0.10.1.2
classixclustering~=0.7.4
umap-learn~=0.5.3

Any help resolving this issue would be appreciated, thanks.

Error will be reported when the feature dimension is greater than 3

Hi, Xinye!
Thank you for your excellent work!
I try to apply your method to my own multi-dim dataset. I found if the 'n_features' > 3 will lead to an error(also will perform on the synthetic dataset ):

NameError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_8868\670907302.py in
7 # Call CLASSIX
8 clx = CLASSIX(radius=0.5, verbose=0)
----> 9 clx.fit(X)

E:\Anaconda3\envs\DL-main\lib\site-packages\classix\clustering.py in fit(self, data)
488
489 # aggregation
--> 490 self.groups_, self.splist_, self.dist_nr = self.aggregate(data=self.data, sorting=self.sorting, tol=self.radius)
491 self.splist_ = np.array(self.splist_)
492

E:\Anaconda3\envs\DL-main\lib\site-packages\classix\aggregation.py in aggregate(data, sorting, tol)
82 sort_vals = [email protected](-1)
83 else:
---> 84 U1, s1, _ = svds(data, k=1, return_singular_vectors="u")
85 sort_vals = U1[:,0]*s1[0]
86

NameError: name 'svds' is not defined

from sklearn import datasets
from classix import CLASSIX

Generate synthetic data

X, y = datasets.make_blobs(n_samples=5000, centers=2, n_features=25, random_state=1)

Call CLASSIX

clx = CLASSIX(radius=0.5, verbose=0)
clx.fit(X)

Maybe I use your method in a wrong way. Could you tell me how to get CLASSIX work on muiti-dim data?
Any advise will be thankful!
Best wishes!