christopherjenness / dbcv Goto Github PK
View Code? Open in Web Editor NEWPython implementation of Density-Based Clustering Validation
License: MIT License
Python implementation of Density-Based Clustering Validation
License: MIT License
On my program your dbcv code return nan in some cases (sklearm Calinski-Harabaz and Shilhuette index work well with this data (3 dimensional, about 200-1000 points)).
Your solution is interesting. Unfortunately, it is not scalable. I made it turn for 200 points of two dimensions, it takes almost 6 seconds. For thousands of points I can't keep it running anymore.
Hello,
Thanks for this implementation of the DBCV in Python. However, the results with this method don't match with the reference implementation in Matlab by Moulavi et al.
This is partly because your implementation treats outliers as a cluster, but even fixing this leads to completly different results. The first example dataset of the reference Implementation will give values of -0.2986 for your Implementation, 0.5074 for your implementation with the correct outlier processing and 0.6149 for the reference implementation.
I think these quite significant difference discourage from using this implementation in scientific contexts until this is fixed.
Hello! Would like to hear your input about whats the best option for installing this package in an Anaconda Environment.
I've tried this code in my Anaconda Prompt:
conda config --set ssl_verify false
pip install git+https://github.com/christopherjenness/DBCV.git#egg=DBCV
with the following output:
fatal: unable to access 'https://github.com/christopherjenness/DBCV.git/': SSL certificate problem: self signed certificate in certificate chain
error: subprocess-exited-with-error
× git clone --filter=blob:none --quiet https://github.com/christopherjenness/DBCV.git 'C:\Users\s41534\AppData\Local\Temp\pip-install-f3_se_hc\dbcv_0cc75111be504782ad7112db4e48e064' did not run successfully.
│ exit code: 128
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× git clone --filter=blob:none --quiet https://github.com/christopherjenness/DBCV.git 'C:\Users\s41534\AppData\Local\Temp\pip-install-f3_se_hc\dbcv_0cc75111be504782ad7112db4e48e064' did not run successfully.
│ exit code: 128
╰─> See above for output.
Thanks in advance!
Thank you for publishing this DBCV implementation,
i would like to cite your implementation in addition to the original paper. Since Github now supports an official citation widget i suggest to implement this widget or cite your repo in an unofficial way.
Hi @christopherjenness I think it would be nice if there was a requirements list to go though to install everything needed for the test file. Here the is what I had to do to set that up on my system:
pip install -U scikit-learn
to install sklearn
pip install pytest
pip install hdbscan
or conda install -c conda-forge hdbscan
I actually also expected the test folder to provide an example of the code´s application, not the assertions, I would add an example.py
for that (eg. using the code in the README).
Hello,
Is it possible to use this with a precomputed similarity matrix? I suppose I could set X to a dummy matrix of index values and use a distance function that does a simple matrix lookup?
Ross
HDBSCAN is inaccessible with pip, so conda is required. This is causing Travis CI issues:
The command "conda update --yes conda" failed and exited with 127 during .
When running the code from the read me I don't get the score that is mentioned in the read me. When going back to commit b28e70a
, it works as communicated.
If `np.shape(neighbors)[0]` is taken instead of `np.shape(neighbors)[1]` (as it should be), the resultant index has always a low value (hardly never a positive one) ... even when evaluating good clustering results as the one obtained running hdbscan with the noisy moons dataset (provided by the author).
Does anyone know why?
Originally posted by @onofricamila in #10 (comment)
I also got negative dbcv score for a good clustering from hdbscan. Is this expected?
I 100% appreciate the care that was given to make sure that this package is pip installable from a well organized GH repo, but I was surprised to find that I would have to find the egg name from setup.py. It was a minor inconvence, but just adding an "Installation" section to the readme with the following line will probably be very helpful for others too. Cheers!
pipenv install git+https://github.com/christopherjenness/DBCV.git#egg=DBCV
Hello,
I am a newbie in data science environment. I want to use your DBCV library in my project. But, i did not find how to import it conda environment.
Thanks,
I just wanted to try to calculate DBCV for my HDBSCAN result (312 points) and this takes me now forever. As I look into the code, it seems that it may be rather simple to parallelize e.g. the computation of mutual reachability graph as it takes so far the most time... I might fork and make a pull-request then.
DBCV is capable of handling noise assignments. This needs to be implemented.
Hi!
I am running the following code:
db = DBSCAN(eps=5, min_samples=9).fit(df)
labels = db.labels_
dbscan_score = DBCV(df, labels, dist_function=euclidean)
print(dbscan_score)
but I am having the following error:
File "*\DBScan.py", line 68, in
dbscan_score = DBCV(df, labels, dist_function=euclidean)
File "C:\Python27\lib\site-packages\DBCV\DBCV.py", line 30, in DBCV
graph = _mutual_reach_dist_graph(X, labels, dist_function)
File "C:\Python27\lib\site-packages\DBCV\DBCV.py", line 113, in _mutual_reach_dist_graph
point_i = X[row]
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2927, in getitem
indexer = self.columns.get_loc(key)
File "C:\Python27\lib\site-packages\pandas\core\indexes\base.py", line 2659, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas_libs\index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas_libs\index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas_libs\hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas_libs\hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0
Thank you very much for providing the code for the DBCV index.
I noticed in the _core_dist
function that you have defined the number of neighbours (n_neighbors
) to equal the dimensionality of the dataset np.shape(neighbors)[1]
(Line 57 of the DBCV.py) shouldn't this have been np.shape(neighbors)[0]
?
Also based on the formula of Moulavi et al (definition 1, equation 3.1) Line 62 of your code shouldn't have been core_dist = (numerator / (n_neighbors
-1 )) ** (-1/n_features)
?
Hey! This is a great implementation of DBCV! Do you have any plans to release it on pypi?
This is the original formula to calculate the core distance of a given object:
"KNN (o, i) be the distance between object o and its i th nearest neighbor.",
says the paper.
So, my question is, shouldn't we divide 1 by the ith KNN instead of the dist to the ith element? This is what we are currently doing. Thx
Thank you for publishing this DBCV implementation. I believe, however, that there is an error in the logic. On page 842 of the paper, regarding the minimum spanning tree computations, the paper states:
Based on the MRDs, a Minimum Spanning Tree (MSTMRD ) is then built. This process is repeated for all the clusters in the partition, resulting in l minimum spanning trees, one for each cluster.
In this implementation, however, it appears that only one MST is being created for the entire data set: https://github.com/christopherjenness/DBCV/blob/master/DBCV/DBCV.py#L90
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.