Code Monkey home page Code Monkey logo

dbcv's People

Contributors

christopherjenness avatar galeone avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dbcv's Issues

nan in result

On my program your dbcv code return nan in some cases (sklearm Calinski-Harabaz and Shilhuette index work well with this data (3 dimensional, about 200-1000 points)).

The execution time is very slow

Your solution is interesting. Unfortunately, it is not scalable. I made it turn for 200 points of two dimensions, it takes almost 6 seconds. For thousands of points I can't keep it running anymore.

Results don't match with reference implementation in Matlab

Hello,

Thanks for this implementation of the DBCV in Python. However, the results with this method don't match with the reference implementation in Matlab by Moulavi et al.
This is partly because your implementation treats outliers as a cluster, but even fixing this leads to completly different results. The first example dataset of the reference Implementation will give values of -0.2986 for your Implementation, 0.5074 for your implementation with the correct outlier processing and 0.6149 for the reference implementation.

I think these quite significant difference discourage from using this implementation in scientific contexts until this is fixed.

Issues with installation

Hello! Would like to hear your input about whats the best option for installing this package in an Anaconda Environment.
I've tried this code in my Anaconda Prompt:

conda config --set ssl_verify false
pip install git+https://github.com/christopherjenness/DBCV.git#egg=DBCV

with the following output:

  fatal: unable to access 'https://github.com/christopherjenness/DBCV.git/': SSL certificate problem: self signed certificate in certificate chain
  error: subprocess-exited-with-error

  × git clone --filter=blob:none --quiet https://github.com/christopherjenness/DBCV.git 'C:\Users\s41534\AppData\Local\Temp\pip-install-f3_se_hc\dbcv_0cc75111be504782ad7112db4e48e064' did not run successfully.
  │ exit code: 128
  ╰─> See above for output.

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× git clone --filter=blob:none --quiet https://github.com/christopherjenness/DBCV.git 'C:\Users\s41534\AppData\Local\Temp\pip-install-f3_se_hc\dbcv_0cc75111be504782ad7112db4e48e064' did not run successfully.
│ exit code: 128
╰─> See above for output.

Thanks in advance!

Incomplete requirement list to run tests

Hi @christopherjenness I think it would be nice if there was a requirements list to go though to install everything needed for the test file. Here the is what I had to do to set that up on my system:

pip install -U scikit-learn to install sklearn
pip install pytest
pip install hdbscan or conda install -c conda-forge hdbscan

I actually also expected the test folder to provide an example of the code´s application, not the assertions, I would add an example.py for that (eg. using the code in the README).

Using with precomputed similarity matrix?

Hello,

Is it possible to use this with a precomputed similarity matrix? I suppose I could set X to a dummy matrix of index values and use a distance function that does a simple matrix lookup?

Ross

Travis CI - use conda

HDBSCAN is inaccessible with pip, so conda is required. This is causing Travis CI issues:

The command "conda update --yes conda" failed and exited with 127 during .

If `np.shape(neighbors)[0]` is taken instead of `np.shape(neighbors)[1]` (as it should be), the resultant index has always a low value (hardly never a positive one) ... even when evaluating good clustering results as the one obtained running hdbscan with the noisy moons dataset (provided by the author).

          If `np.shape(neighbors)[0]` is taken instead of `np.shape(neighbors)[1]` (as it should be), the resultant index has always a low value (hardly never a positive one) ... even when evaluating good clustering results as the one obtained running hdbscan with the noisy moons dataset (provided by the author). 

Does anyone know why?

Originally posted by @onofricamila in #10 (comment)

I also got negative dbcv score for a good clustering from hdbscan. Is this expected?

Add installation instructions to README.md

I 100% appreciate the care that was given to make sure that this package is pip installable from a well organized GH repo, but I was surprised to find that I would have to find the egg name from setup.py. It was a minor inconvence, but just adding an "Installation" section to the readme with the following line will probably be very helpful for others too. Cheers!

pipenv install git+https://github.com/christopherjenness/DBCV.git#egg=DBCV

How to import DBCV

Hello,
I am a newbie in data science environment. I want to use your DBCV library in my project. But, i did not find how to import it conda environment.

Thanks,

Parallelization for speed improvement

I just wanted to try to calculate DBCV for my HDBSCAN result (312 points) and this takes me now forever. As I look into the code, it seems that it may be rather simple to parallelize e.g. the computation of mutual reachability graph as it takes so far the most time... I might fork and make a pull-request then.

Error running DBCV

Hi!

I am running the following code:
db = DBSCAN(eps=5, min_samples=9).fit(df)
labels = db.labels_
dbscan_score = DBCV(df, labels, dist_function=euclidean)
print(dbscan_score)

but I am having the following error:
File "*\DBScan.py", line 68, in
dbscan_score = DBCV(df, labels, dist_function=euclidean)
File "C:\Python27\lib\site-packages\DBCV\DBCV.py", line 30, in DBCV
graph = _mutual_reach_dist_graph(X, labels, dist_function)
File "C:\Python27\lib\site-packages\DBCV\DBCV.py", line 113, in _mutual_reach_dist_graph
point_i = X[row]
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2927, in getitem
indexer = self.columns.get_loc(key)
File "C:\Python27\lib\site-packages\pandas\core\indexes\base.py", line 2659, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas_libs\index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas_libs\index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas_libs\hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas_libs\hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0

Question about the Core Distance of an Object formula

Thank you very much for providing the code for the DBCV index.

I noticed in the _core_dist function that you have defined the number of neighbours (n_neighbors) to equal the dimensionality of the dataset np.shape(neighbors)[1] (Line 57 of the DBCV.py) shouldn't this have been np.shape(neighbors)[0] ?

Also based on the formula of Moulavi et al (definition 1, equation 3.1) Line 62 of your code shouldn't have been core_dist = (numerator / (n_neighbors -1 )) ** (-1/n_features) ?

About apts core distance numerator calculation

This is the original formula to calculate the core distance of a given object:

image

"KNN (o, i) be the distance between object o and its i th nearest neighbor.",

says the paper.

So, my question is, shouldn't we divide 1 by the ith KNN instead of the dist to the ith element? This is what we are currently doing. Thx

image

Minimum spanning tree for each cluster vs. entire data set?

Thank you for publishing this DBCV implementation. I believe, however, that there is an error in the logic. On page 842 of the paper, regarding the minimum spanning tree computations, the paper states:

Based on the MRDs, a Minimum Spanning Tree (MSTMRD ) is then built. This process is repeated for all the clusters in the partition, resulting in l minimum spanning trees, one for each cluster.

In this implementation, however, it appears that only one MST is being created for the entire data set: https://github.com/christopherjenness/DBCV/blob/master/DBCV/DBCV.py#L90

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.