nla-group / classix Goto Github PK
View Code? Open in Web Editor NEWFast and explainable clustering in Python
Home Page: https://classix.readthedocs.io/en/stable/
License: MIT License
Fast and explainable clustering in Python
Home Page: https://classix.readthedocs.io/en/stable/
License: MIT License
I am trying to install classix 0.7.4
from a requirements file into a venv, but unfortunately I am getting an error about numpy not found, which kind of makes no sense to me, because it is installed in one of the steps above in version 1.22.4
. I was hoping someone here could help me figure out what is going wrong.
System: Debian 11 (bullseye)
Python version: 3.9.2
CLASSIX version: 0.7.4
Cython: 0.29.32
Steps to reproduce:
$ python3 -m venv venv
$ source venv/bin/activate
$ python3 -m pip install -r requirements.txt
Collecting tqdm~=4.64.0
Using cached tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
Collecting numpy~=1.22.4
Using cached numpy-1.22.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
Collecting pandas~=1.4.2
Using cached pandas-1.4.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB)
Collecting seaborn~=0.11.2
Using cached seaborn-0.11.2-py3-none-any.whl (292 kB)
Collecting matplotlib~=3.5.2
Using cached matplotlib-3.5.3-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (11.2 MB)
Collecting tensorflow~=2.9.1
Using cached tensorflow-2.9.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (511.8 MB)
Collecting scikit-learn~=1.1.1
Using cached scikit_learn-1.1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30.8 MB)
Collecting nfstream~=6.5.1
Using cached nfstream-6.5.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
Collecting tabulate~=0.8.9
Using cached tabulate-0.8.10-py3-none-any.whl (29 kB)
Collecting missingno~=0.5.1
Using cached missingno-0.5.1-py3-none-any.whl (8.7 kB)
Collecting scipy~=1.8.1
Using cached scipy-1.8.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (42.2 MB)
Collecting cython~=0.29.32
Using cached Cython-0.29.32-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (2.0 MB)
Collecting scapy~=2.4.5
Using cached scapy-2.4.5.tar.gz (1.1 MB)
Collecting zstandard~=0.18.0
Using cached zstandard-0.18.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB)
Collecting protobuf~=3.19.4
Using cached protobuf-3.19.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
Collecting pyclustering~=0.10.1.2
Using cached pyclustering-0.10.1.2.tar.gz (2.6 MB)
Collecting classixclustering~=0.7.4
Using cached classixclustering-0.7.4.tar.gz (629 kB)
ERROR: Command errored out with exit status 1:
command: /home/XXX/venv/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-a5h4ms0x/classixclustering_5529ebbd0bef4c0489673327bfb2a134/setup.py'"'"'; __file__='"'"'/tmp/pip-install-a5h4ms0x/classixclustering_5529ebbd0bef4c0489673327bfb2a134/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-ldj6pnm6
cwd: /tmp/pip-install-a5h4ms0x/classixclustering_5529ebbd0bef4c0489673327bfb2a134/
Complete output (5 lines):
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-install-a5h4ms0x/classixclustering_5529ebbd0bef4c0489673327bfb2a134/setup.py", line 1, in <module>
import numpy
ModuleNotFoundError: No module named 'numpy'
----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/4d/0f/5a17e5d8045195d1a112b143a8143fff86e558d7cbeacad886d1b93be6db/classixclustering-0.7.4.tar.gz#sha256=d0f72deccb40ca9eb14905bb1a0f41787a824446eebac5a67a7ae59ec4c65342 (from https://pypi.org/simple/classixclustering/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement classixclustering~=0.7.4
ERROR: No matching distribution found for classixclustering~=0.7.4
The content of the requirements.txt
looks as follows:
tqdm~=4.64.0
numpy~=1.22.4
pandas~=1.4.2
seaborn~=0.11.2
matplotlib~=3.5.2
tensorflow~=2.9.1
scikit-learn~=1.1.1
nfstream~=6.5.1
tabulate~=0.8.9
missingno~=0.5.1
scipy~=1.8.1
cython~=0.29.32
scapy~=2.4.5
zstandard~=0.18.0
protobuf~=3.19.4
pyclustering~=0.10.1.2
classixclustering~=0.7.4
umap-learn~=0.5.3
Any help resolving this issue would be appreciated, thanks.
It would be helpful and intuitive to return the group labels as obtained by the aggregation phase when
group_merging.lower()=='none' or group_merging is None
In this case the cluster labels are just the group labels returned by aggregation.
Assume I have a date frame df
like
Anna 0.3 -0.1 0.5
Bert 0.0 -0.2 0.7
Carl -0.8 -0.1 0.2
where the first column is the index. It would be nice to be able to do
clx = CLASSIX(radius=0.5)
clx.fit(df)
clx.explain('Anna', 'Bert')
and get an output similar to
The data point 'Anna' is in group 0, which has been merged into cluster 0.
The data point 'Bert' is in group 1, which has been merged into cluster 1.
There is no path of overlapping groups between these clusters.
The table of groups could contain an additional column for the label:
-------------------------------------------------
Group Label NrPts Cluster Coordinates
0 'Anna' 1 0 0.3 -0.1 0.5
1 'Bert' 2 1 0.0 -0.2 0.7
-------------------------------------------------
The plot function could also use the labels instead of numerical point indices, but there should probably be an option to revert to numerical indices for the starting points if they plot gets too cluttered.
The current implementation of .explain() performs superfluous computations which could be avoided:
(i) Currently, every call to .explain() computes the two leading principal components of the data. This should be done only once when .fit(data) is called. For CLASSIX itself only the first PCA is needed, but there is no harm getting the second component as well and storing it for .explain().
(ii) The current .explain(ind1,ind2) method computes a shortest path between the group centres, recomputing all the distances between them. This is unnecessary and can become very slow, especially for clusters with many points (e.g. in image segmentation). The distances between overlapping groups have already been computed in the merging phase, and all that is needed is the adjacency matrix of overlapping groups.
Both of these optimizations are implemented in the MATLAB code
https://github.com/nla-group/classix-matlab/blob/main/classix.m
This code works fine with numpy 1.22.2 and Scipy 1.7.3
from sklearn import datasets
from classix import CLASSIX
# Generate synthetic data
X, y = datasets.make_blobs(n_samples=2000000, centers=4, n_features=10, random_state=1) #data_med
# Employ CLASSIX clustering
clx = CLASSIX(sorting='pca', radius=0.5, verbose=1)
clx.fit(X)
but fails with Scipy 1.8.0:
CLASSIX(sorting='pca', radius=0.5, minPts=0, group_merging='distance')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [1], in <cell line: 9>()
7 # Employ CLASSIX clustering
8 clx = CLASSIX(sorting='pca', radius=0.5, verbose=1)
----> 9 clx.fit(X)
File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\classix\clustering.py:460, in CLASSIX.fit(self, data)
457 self.data = (data - self._mu) / self._scl
459 # aggregation
--> 460 self.agg_labels_, self.splist_, self.dist_nr = aggregate(data=self.data, sorting=self.sorting, tol=self.radius)
461 self.splist_ = np.array(self.splist_)
463 self.clean_index_ = np.full(self.data.shape[0], True) # claim clean data indices
File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\classix\aggregation_cm.pyx:44, in classix.aggregation_cm.aggregate()
File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\classix\aggregation_cm.pyx:100, in classix.aggregation_cm.aggregate()
File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\scipy\sparse\linalg\_eigen\_svds.py:269, in svds(A, k, ncv, tol, which, v0, maxiter, return_singular_vectors, solver, random_state, options)
134 """
135 Partial singular value decomposition of a sparse matrix.
136
(...)
265
266 """
267 rs_was_None = random_state is None # avoid changing v0 for arpack/lobpcg
--> 269 args = _iv(A, k, ncv, tol, which, v0, maxiter, return_singular_vectors,
270 solver, random_state)
271 (A, k, ncv, tol, which, v0, maxiter,
272 return_singular_vectors, solver, random_state) = args
274 largest = (which == 'LM')
File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\scipy\sparse\linalg\_eigen\_svds.py:63, in _iv(A, k, ncv, tol, which, v0, maxiter, return_singular, solver, random_state)
60 raise ValueError(f"solver must be one of {solvers}.")
62 # input validation/standardization for `A`
---> 63 A = aslinearoperator(A) # this takes care of some input validation
64 if not (np.issubdtype(A.dtype, np.complexfloating)
65 or np.issubdtype(A.dtype, np.floating)):
66 message = "`A` must be of floating or complex floating data type."
File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\scipy\sparse\linalg\_interface.py:826, in aslinearoperator(A)
822 return LinearOperator(A.shape, A.matvec, rmatvec=rmatvec,
823 rmatmat=rmatmat, dtype=dtype)
825 else:
--> 826 raise TypeError('type not understood')
TypeError: type not understood
clx.labels_
will return the cluster label of each data point.
It would be useful to have clx.groups_
return the group each data point belongs to.
Our recommendation to choose the hyperparameter is to start with radius=1, reduce it until the number of clusters is only slightly larger than expected, and then increase minPts until the right number of clusters are obtained.
Could we add a function like
clx.minPtsChange(minPts)
which just redoes the minPts phase without doing the rest.
Note that, if mergeTinyGroups==False, then the group merging also needs to be redone because it depends on minPts. Otherwise, it doesn't.
Hi, Xinye!
Thank you for your excellent work!
I try to apply your method to my own multi-dim dataset. I found if the 'n_features' > 3 will lead to an error(also will perform on the synthetic dataset ):
NameError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_8868\670907302.py in
7 # Call CLASSIX
8 clx = CLASSIX(radius=0.5, verbose=0)
----> 9 clx.fit(X)
E:\Anaconda3\envs\DL-main\lib\site-packages\classix\clustering.py in fit(self, data)
488
489 # aggregation
--> 490 self.groups_, self.splist_, self.dist_nr = self.aggregate(data=self.data, sorting=self.sorting, tol=self.radius)
491 self.splist_ = np.array(self.splist_)
492
E:\Anaconda3\envs\DL-main\lib\site-packages\classix\aggregation.py in aggregate(data, sorting, tol)
82 sort_vals = [email protected](-1)
83 else:
---> 84 U1, s1, _ = svds(data, k=1, return_singular_vectors="u")
85 sort_vals = U1[:,0]*s1[0]
86
NameError: name 'svds' is not defined
from sklearn import datasets
from classix import CLASSIX
X, y = datasets.make_blobs(n_samples=5000, centers=2, n_features=25, random_state=1)
clx = CLASSIX(radius=0.5, verbose=0)
clx.fit(X)
Maybe I use your method in a wrong way. Could you tell me how to get CLASSIX work on muiti-dim data?
Any advise will be thankful!
Best wishes!
It would be useful to have some way to access runtimes of the individual parts of CLASSIX after the clustering is computed. Largely, I think there are essentially five components:
See also the MATLAB version:
https://github.com/nla-group/classix-matlab/blob/main/demos/The_out_structure_returned_by_CLASSIX.md
I wanted to know if it's possible to add a temporal component when it comes to how each node is clustered. Nodes with a close timestamp are clustered and this would take precedence over the spatial.
When importing anything from classix e.g.
from classix import CLASSIX
I get the following output
Cython fail.
Cython fail.
I'm running Windows 11 and see this on both a pip installed version of classix and from doing python setup.py install
from the GitHub repo.
However, everything else seems to be working fine. Execution has fallen back to the pure Python versions and it seems to be working pretty quickly:
Using the standard Windows Python install: https://www.python.org/downloads/windows/ with the whole stack pip installed the fit below on 2,000,000 10 dimensional points takes 11.2 seconds on my machine.
from sklearn import datasets
from classix import CLASSIX
import numpy as np
import matplotlib.pyplot as plt
X, y = datasets.make_blobs(n_samples=2000000, centers=4, n_features=10, random_state=1)
clx = CLASSIX(sorting='pca', verbose=0)
clx.fit(X)
I'm wondering what speed differences I'd see with the Cython version? Also, can I expect the results to be identical?
Some users cannot install CLASSIX by pip install classix
and report the installing issue. I told them classix is also another software name. Instead we should use pip install ClassixClustering
.
Using this example:
from sklearn import datasets
import numpy as np
from classix import CLASSIX
X, y = datasets.make_blobs(n_samples=5000, centers=2, n_features=2, cluster_std=1, random_state=1)
clx = CLASSIX(sorting='pca', radius=0.15, group_merging='density', verbose=1, minPts=13, post_alloc=False)
clx.fit(X)
I am getting the following error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-569-bb587af68fc1> in <module>
5 X, y = datasets.make_blobs(n_samples=5000, centers=2, n_features=2, cluster_std=1, random_state=1)
6 clx = CLASSIX(sorting='pca', radius=0.15, group_merging='density', verbose=1, minPts=13, post_alloc=False)
----> 7 clx.fit(X)
8
9 X
~/miniconda3/envs/ltf-analysis/lib/python3.8/site-packages/classix/clustering.py in fit(self, data)
506 self.labels_ = copy.deepcopy(self.groups_)
507 else:
--> 508 self.labels_ = self.clustering(
509 data=self.data,
510 agg_labels=self.groups_,
~/miniconda3/envs/ltf-analysis/lib/python3.8/site-packages/classix/clustering.py in clustering(self, data, agg_labels, splist, sorting, radius, method, minPts)
724 # self.merge_groups = merge_pairs(self.connected_pairs_)
725
--> 726 self.merge_groups, self.connected_pairs_ = self.fast_agglomerate(data, splist, radius, method, scale=self.scale)
727 maxid = max(labels) + 1
728
~/miniconda3/envs/ltf-analysis/lib/python3.8/site-packages/classix/merging.py in fast_agglomerate(data, splist, radius, method, scale)
115 # den1 = splist[int(i), 2] / volume # density(splist[int(i), 2], volume = volume)
116 for j in select_stps.astype(int):
--> 117 sp2 = data[splist[j, 0]] # splist[int(j), 3:]
118
119 c2 = np.linalg.norm(data-sp2, ord=2, axis=-1) <= radius
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
python==3.8.12
classixclustering==0.6.5
numpy==1.22.0
scipy==1.7.3
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.