scikit-learn-contrib / hdbscan Goto Github PK

View Code? Open in Web Editor NEW

2.7K 56.0 482.0 26.09 MB

A high performance implementation of HDBSCAN clustering.

Home Page: http://hdbscan.readthedocs.io/en/latest/

License: BSD 3-Clause "New" or "Revised" License

Python 3.36% Jupyter Notebook 94.45% Shell 0.02% TeX 0.02% Cython 2.15%

machine-learning machine-learning-algorithms clustering clustering-algorithm cluster-analysis clustering-evaluation

hdbscan's Introduction

scikit-learn-contrib

scikit-learn-contrib is a github organization for gathering high-quality scikit-learn compatible projects. It also provides a template for establishing new scikit-learn compatible projects.

Vision

With the explosion of the number of machine learning papers, it becomes increasingly difficult for users and researchers to implement and compare algorithms. Even when authors release their software, it takes time to learn how to use it and how to apply it to one's own purposes. The goal of scikit-learn-contrib is to provide easy-to-install and easy-to-use high-quality machine learning software. With scikit-learn-contrib, users can install a project by pip install sklearn-contrib-project-name and immediately try it on their data with the usual fit, predict and transform methods. In addition, projects are compatible with scikit-learn tools such as grid search, pipelines, etc.

Projects

If you would like to include your own project in scikit-learn-contrib, take a look at the workflow.

DenMune: Density-peak clustering using mutual nearest neighbors

A simple-but-efficient density-based clustering algorithm that can find clusters of arbitrary size, shapes and densities in two-dimensions. Higher dimensions are first reduced to 2-D using the t-sne. The algorithm relies on a single parameter K, the number of nearest neighbors.

Read The Docs, Read the Paper

Maintained by: Mohamed Abbas

lightning

Large-scale linear classification, regression and ranking.

Maintained by Mathieu Blondel and Fabian Pedregosa.

skglm

Fast and modular Generalized Linear Models with support for models missing in scikit-learn.

Maintained by Mathurin Massias, Pierre-Antoine Bannier, Quentin Klopfenstein and Quentin Bertrand.

py-earth

A Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines.

Maintained by Jason Rudy and Mehdi.

imbalanced-learn

Python module to perform under sampling and over sampling with various techniques.

Maintained by Guillaume Lemaitre, Fernando Nogueira, Dayvid Oliveira and Christos Aridas.

polylearn

Factorization machines and polynomial networks for classification and regression in Python.

Maintained by Vlad Niculae.

forest-confidence-interval

Confidence intervals for scikit-learn forest algorithms.

Maintained by Ariel Rokem, Kivan Polimis and Bryna Hazelton.

hdbscan

A high performance implementation of HDBSCAN clustering.

Maintained by Leland McInnes, jc-healy, c-north and Steve Astels.

categorical-encoding

A library of sklearn compatible categorical variable encoders.

Maintained by Will McGinnis and Paul Westenthanner

boruta_py

Python implementations of the Boruta all-relevant feature selection method.

Maintained by Daniel Homola

sklearn-pandas

Pandas integration with sklearn.

Maintained by Israel Saeta Pérez

skope-rules

Machine learning with logical rules in Python.

Maintained by Florian Gardin, Ronan Gautier, Nicolas Goix and Jean-Matthieu Schertzer.

stability-selection

A Python implementation of the stability selection feature selection algorithm.

Maintained by Thomas Huijskens

metric-learn

Metric learning algorithms in Python.

Maintained by CJ Carey, Yuan Tang, William de Vazelhes, Aurélien Bellet and Nathalie Vauquier.

hdbscan's People

Contributors

Stargazers

Watchers

Forkers

dgthoms rbkreisberg gaoch023 directorscut82 zachwill s0j0urn embaker evelynmitchell jc-healy chengwang88 lmillefiori xsongx lrargerich datafighter jjmaldonis h-krishna jgaw kod3r joskid jgooly boumer lepy andreamorichetta wangjiahong mrb1b0 arizzuto natural209x antoine-tran brunoalano maoting1223 linearregression zhang365947064 taalexander ddcamiu bratu-ionut lmcinnes glemaitre kmike raggnar dvro jfear rth chongxi gratefulbuaa john-min grej guillermogsjc drorhilman sandy4321 christopherjenness hanbman rocksailor kassenov ml-ai-nlp-ir mechaman lostroom lele94218 zugenliu alleyz-favorite jessiejamieson hangzhang hzhao16 blazs markdimi mikeanas lixun910 fatenamama codingafuture vardominator bhaskar24 lodemo craigmcnamara den-run-ai niravajmeri lgro sauln codeaudit resurgo-genetics savourylie maiikii ahirner m-dz cosecant-csc alxsoares duschang27 yiitozer-zz mammask helske gpfreitas x-z duke24k moi90 rakibalfahad vinayakraja carmenlok codes-kzhan yzy5630 jaidmin jungi21cc streetbees

hdbscan's Issues

Condensed Tree plot error with small cluster size

There are no problems as long as the number of points are greater than 73, counts below 74 fail with the following error. None of the other plotting functions have a problem. If I have time later I'll fix this, wouldn't use it on such small datasets except to inspect the internal tree structure.

KeyError Traceback (most recent call last)
in ()
----> 1 cnew.condensed_tree_.plot(select_clusters=True)

/usr/local/lib/python2.7/dist-packages/hdbscan/plots.pyc in plot(self, leaf_separation, cmap, select_clusters, label_clusters, axis, colorbar, log_size)
272 'Use get_plot_data to calculate the relevant data without plotting.')
273
--> 274 plot_data = self.get_plot_data(leaf_separation=leaf_separation, log_size=log_size)
275
276 if cmap != 'none':

/usr/local/lib/python2.7/dist-packages/hdbscan/plots.pyc in get_plot_data(self, leaf_separation, log_size)
148 current_size = np.log(current_size)
149
--> 150 cluster_bounds[c][CB_LEFT] = cluster_x_coords[c] * scaling - (current_size / 2.0)
151 cluster_bounds[c][CB_RIGHT] = cluster_x_coords[c] * scaling + (current_size / 2.0)
152 cluster_bounds[c][CB_BOTTOM] = cluster_y_coords[c]

KeyError: 73

Upgrading with pip fails

I just tried to update to 0.4 and it fails with the following error:

hdbscan/_hdbscan_linkage.c:1:2: error: #error Do not use this file, it is the result of a failed Cython compilation.

I tried on ubuntu 14.04 and OSX (El Capitan)

No way to use alternate metrics, like Mahalonobis

Your HDBSCAN implementation is impressive and can be vary useful in various fields. The implementation mentions lots of different metrics, but no way to specify keywords for certain metrics. This might be an issue with me not being a python expert. If that's the case, I apologize for wasting time.

Error while importing hdbscan

I am getting the following error while trying to do : import hdbscan

I am on a windows 7 machine and with a 64 bit python installation using conda

Error :

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-12-3f1c79fb1b69> in <module>()
----> 1 import hdbscan

c:\anaconda2\lib\site-packages\hdbscan-0.6.2-py2.7-macosx-10.5-x86_64.egg\hdbscan\__init__.py in <module>()

c:\anaconda2\lib\site-packages\hdbscan-0.6.2-py2.7-macosx-10.5-x86_64.egg\hdbscan\hdbscan_.py in <module>()

c:\anaconda2\lib\site-packages\hdbscan-0.6.2-py2.7-macosx-10.5-x86_64.egg\hdbscan\_hdbscan_linkage.py in <module>()

c:\anaconda2\lib\site-packages\hdbscan-0.6.2-py2.7-macosx-10.5-x86_64.egg\hdbscan\_hdbscan_linkage.py in __bootstrap__()

ImportError: DLL load failed: %1 is not a valid Win32 application.

min_cluster_size=1

Greetings,

First, thank you for the awesome library, I've found great success with it riding on top of a word2vec -> t-SNE pipeline for a new natural language processing project called words2map. I'm just a few days from publishing about our code, but am finding suddenly an unusual response around the min_cluster_size=1 edge case.

Specifically, here is a sample from my code:

        print vectors
        print vectors.shape
        clusters = HDBSCAN(min_cluster_size=1).fit_predict(vectors)

...which generates the following:

[[-6.45791257 -4.44567396]
 [-6.44261124 -4.46679372]
 [ 1.56015613  2.66220251]
 [-2.85201212 -6.22262758]
 [-2.09304712 -7.14152838]
 [-1.79593637 -5.40181427]
 [-2.85714682 -6.80640389]
 [ 4.3957132   0.9440575 ]
 [ 1.81048591  1.02327832]
 [ 5.16887682  6.7110603 ]
 [ 5.37719927  6.28808868]
 [ 0.11477621 -1.11785618]
 [ 5.04218427  6.9646349 ]
 [ 5.00133672 -1.09453604]
 [ 3.72641649 -7.25523152]
 [ 4.04915813 -6.63786898]
 [ 1.2898098  -0.4886845 ]
 [ 3.83732358 -6.94612688]
 [-5.76981671 -0.61409089]
 [-4.01244246 -0.87056213]
 [ 6.22234736  0.58425667]
 [-6.04886453  3.37668561]
 [ 5.72806943  0.43197586]
 [-5.69073762 -0.53585563]
 [-6.05174687  3.38048791]
 [-6.1931812  -7.1423447 ]
 [-6.58477538 -7.32116243]
 [-2.1166777  -6.3211911 ]
 [-7.34847743 -7.71195486]
 [ 2.35076966 -0.85595291]
 [ 1.84758212  0.54659307]
 [-5.19133828  0.36389009]
 [-1.31086939 -4.1921722 ]
 [ 1.79962948  1.58228398]
 [-8.89372872 -6.65828121]]
(35, 2)
Traceback (most recent call last):
  File "clusters.py", line 47, in <module>
    compute_clusters()
  File "clusters.py", line 25, in compute_clusters
    clusters = generate_clusters(topic_names, vectors_in_2D)
  File "/home/ubuntu/words2map/words2map.py", line 202, in generate_clusters
    clusters = HDBSCAN(min_cluster_size=1).fit_predict(vectors)
  File "/home/ubuntu/miniconda/envs/words2map/lib/python2.7/site-packages/hdbscan/hdbscan_.py", line 667, in fit_predict
    self.fit(X)
  File "/home/ubuntu/miniconda/envs/words2map/lib/python2.7/site-packages/hdbscan/hdbscan_.py", line 649, in fit
    self._min_spanning_tree) = hdbscan(X, **kwargs)
  File "/home/ubuntu/miniconda/envs/words2map/lib/python2.7/site-packages/hdbscan/hdbscan_.py", line 469, in hdbscan
    return _tree_to_labels(X, single_linkage_tree, min_cluster_size) + (result_min_span_tree,)
  File "/home/ubuntu/miniconda/envs/words2map/lib/python2.7/site-packages/hdbscan/hdbscan_.py", line 55, in _tree_to_labels
    labels, probabilities, stabilities = get_clusters(condensed_tree, stability_dict)
  File "hdbscan/_hdbscan_tree.pyx", line 477, in hdbscan._hdbscan_tree.get_clusters (hdbscan/_hdbscan_tree.c:9977)
  File "hdbscan/_hdbscan_tree.pyx", line 520, in hdbscan._hdbscan_tree.get_clusters (hdbscan/_hdbscan_tree.c:9812)
  File "hdbscan/_hdbscan_tree.pyx", line 371, in hdbscan._hdbscan_tree.do_labelling (hdbscan/_hdbscan_tree.c:7355)
  File "hdbscan/_hdbscan_tree.pyx", line 269, in hdbscan._hdbscan_tree.TreeUnionFind.union_ (hdbscan/_hdbscan_tree.c:5911)
  File "hdbscan/_hdbscan_tree.pyx", line 284, in hdbscan._hdbscan_tree.TreeUnionFind.find (hdbscan/_hdbscan_tree.c:6127)
IndexError: index 98 is out of bounds for axis 0 with size 98

I don't receive any error when setting min_cluster_size=2, and am having trouble figuring out why and what I might need to do try to debug further.

Otherwise everything has been great and I am very grateful for everything; I also recall making the min_cluster_size=1 parameter work previously, and so am not sure in what way this behavior may be data-dependent...

Cheers!

Only a single CPU core is used

Not sure if this is a bug or feature, but I have observed that on my Ubuntu 14.04 machine HDBSCAN only ever uses one core, some other cores also spike occasionally but 90% of the time it's just a single core at 100% and all the others 0%.

Is this algorithm not parallelizable? Or has it not been done yet?

AttributeError: 'NoneType' object has no attribute 'get_clusters'

Hi,
I was trying to run code with robust single linkage. Here is example.

import numpy as np
import sklearn.datasets as data
import hdbscan

ss = 100
moons, _ = data.make_moons( n_samples = ss, noise = 0.05 )
blobs, _ = data.make_blobs(
    n_samples = ss,
    centers=[( -0.75, 2.25 ), ( 1.0, 2.0 )],
    cluster_std = 0.25
)

test_data = np.vstack([moons, blobs])

clusterer = hdbscan.RobustSingleLinkage(cut=0.125, k=7)
cluster_labels = clusterer.fit_predict( test_data )
hierarchy = clusterer.cluster_hierarchy_
alt_labels = hierarchy.get_clusters(0.100, 5)
hierarchy.plot()

But received such warning and error message:

/usr/local/lib/python2.7/dist-packages/hdbscan/robust_single_linkage_.py:395: UserWarning: No single linkage tree was generated; try running fit first.
warn('No single linkage tree was generated; try running fit first.')
Traceback (most recent call last):
File "test_robustsinglelinkage.py", line 21, in
alt_labels = hierarchy.get_clusters(0.100, 5)
AttributeError: 'NoneType' object has no attribute 'get_clusters'

I'm not sure, but it seems that problem is in this line of code in file robust_single_linkage.py:
self.labels_, self._cluster_hierarchy = robust_single_linkage(X, **self.get_params())

but should be:
self.labels_, self._cluster_hierarchy_ = robust_single_linkage(X, **self.get_params())

Model persistence

Hello,

I would like to save a trained model using pickle and then use the predict method on new data points to predict the cluster membership probabilities for each new data point. So far, I found only the fit_predict method which is supposed to alter the model.

Do you think this feature (which might be useful for many I guess) is easy to implement?

Thanks a lot!

type mismatch error running examples

i tried running both examples on windows 8.1/python 3.4.3/anaconda 2.3.0/64-bit
i get:

Traceback (most recent call last):
File "C:/Users/eyalg/Dropbox/python/hdbscan-example/hdbscan-example.py", line 42, in
hdb = HDBSCAN(min_cluster_size=10).fit(X)
File "C:\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py", line 322, in fit
self.min_spanning_tree) = hdbscan(X, **self.get_params())
File "C:\Anaconda3\lib\site-packages\hdbscan\hdbscan.py", line 231, in hdbscan
min_samples, metric, p)
File "C:\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py", line 96, in _hdbscan_small_kdtree
cluster_list = get_clusters(condensed_tree, stability_dict)
File "hdbscan/_hdbscan_tree.pyx", line 225, in hdbscan._hdbscan_tree.get_clusters (hdbscan_hdbscan_tree.c:5268)
File "hdbscan/_hdbscan_tree.pyx", line 257, in hdbscan._hdbscan_tree.get_clusters (hdbscan_hdbscan_tree.c:5050)
File "hdbscan/_hdbscan_tree.pyx", line 192, in hdbscan._hdbscan_tree.bfs_from_cluster_tree (hdbscan_hdbscan_tree.c:4228)
ValueError: Buffer dtype mismatch, expected 'int64_t' but got 'long'

Problem with special input data

Hi, I have implemented the code successfully with the data provided by make_blobs.
However, when I change the input data with the array [[0],[0]....[0],[20]] which are a set of 0 and a 20(noise), the output (cluster.labels_) is a set of -1 which means that the input data are all noises.
I am wondering if there is any problem.

move to scikit-learn contrib?

Hey.
Scikit-learn recently created a "contrib" repo to house related projects:
https://github.com/scikit-learn-contrib/scikit-learn-contrib/blob/master/README.md

The idea is to collect scikit-learn compatible algorithms in a single place, and provide more exposure.
It would be great to have hdbscan there, if you want to move the repo.

Best,
Andy

Is there an R implementation of this algorithm?

This is probably a dumb question, but I found an R implementation of dbscan, so I'm hoping there's one of hdbscan.

Import error on Ubuntu - undefined symbol: PyFPE_jbuf

Hi,

I have followed the instructions on Github and have successfully (apparently) installed hdbscan. However, when I import hdbscan I receive the following error:

import hdbscan
Traceback (most recent call last):
File "", line 1, in
File "/home/soleilbleu/miniconda2/lib/python2.7/site-packages/hdbscan/init.py", line 1, in
from .hdbscan_ import HDBSCAN, hdbscan
File "/home/soleilbleu/miniconda2/lib/python2.7/site-packages/hdbscan/hdbscan_.py", line 43, in
from .plots import CondensedTree, SingleLinkageTree, MinimumSpanningTree
File "/home/soleilbleu/miniconda2/lib/python2.7/site-packages/hdbscan/plots.py", line 13, in
from sklearn.manifold import TSNE
File "/home/soleilbleu/.local/lib/python2.7/site-packages/sklearn/manifold/init.py", line 6, in
from .isomap import Isomap
File "/home/soleilbleu/.local/lib/python2.7/site-packages/sklearn/manifold/isomap.py", line 11, in
from ..decomposition import KernelPCA
File "/home/soleilbleu/.local/lib/python2.7/site-packages/sklearn/decomposition/init.py", line 11, in
from .sparse_pca import SparsePCA, MiniBatchSparsePCA
File "/home/soleilbleu/.local/lib/python2.7/site-packages/sklearn/decomposition/sparse_pca.py", line 9, in
from ..linear_model import ridge_regression
File "/home/soleilbleu/.local/lib/python2.7/site-packages/sklearn/linear_model/init.py", line 17, in
from .coordinate_descent import (Lasso, ElasticNet, LassoCV, ElasticNetCV,
File "/home/soleilbleu/.local/lib/python2.7/site-packages/sklearn/linear_model/coordinate_descent.py", line 29, in
from . import cd_fast
ImportError: /home/soleilbleu/.local/lib/python2.7/site-packages/sklearn/linear_model/cd_fast.so: undefined symbol: PyFPE_jbuf

Usage with images

Tried using the implementation as is for working with images, as was being done in sklearn using KMeans / MiniBatchKMeans / Meanshift clustering. But consistently run into MemoryError (for images as small as 200x200 as well). Here is a sample code -

import cv2
import numpy as np
import hdbscan

image = cv2.imread('/home/ubuntu/x.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image = image.reshape((image.shape[0] * image.shape[1], 3))

clusterer = hdbscan.HDBSCAN(min_cluster_size=100)
cluster_labels = clusterer.fit_predict(image)

Error stack trace :

MemoryError                               Traceback (most recent call last)
<ipython-input-12-b887e72bd6d5> in <module>()
----> 1 cluster_labels = clusterer.fit_predict(image)

/usr/local/lib/python2.7/dist-packages/hdbscan-0.1-py2.7-linux-x86_64.egg/hdbscan/hdbscan_.pyc in fit_predict(self, X, y)
    338             cluster labels
    339         """
--> 340         self.fit(X)
    341         return self.labels_
    342 

/usr/local/lib/python2.7/dist-packages/hdbscan-0.1-py2.7-linux-x86_64.egg/hdbscan/hdbscan_.pyc in fit(self, X, y)
    320          self._condensed_tree,
    321          self._single_linkage_tree,
--> 322          self._min_spanning_tree) = hdbscan(X, **self.get_params())
    323         return self
    324 

/usr/local/lib/python2.7/dist-packages/hdbscan-0.1-py2.7-linux-x86_64.egg/hdbscan/hdbscan_.pyc in hdbscan(X, min_cluster_size, min_samples, metric, p, algorithm)
    235     else:
    236         return _hdbscan_large_kdtree(X, min_cluster_size, 
--> 237                                      min_samples, metric, p)
    238 
    239 class HDBSCAN(BaseEstimator, ClusterMixin):

/usr/local/lib/python2.7/dist-packages/hdbscan-0.1-py2.7-linux-x86_64.egg/hdbscan/hdbscan_.pyc in _hdbscan_large_kdtree(X, min_cluster_size, min_samples, metric, p)
    107         p = 2
    108 
--> 109     mutual_reachability_ = kdtree_pdist_mutual_reachability(X, metric, p, min_samples)
    110 
    111     min_spanning_tree = mst_linkage_core_pdist(mutual_reachability_)

/home/vg/.python-eggs/hdbscan-0.1-py2.7-linux-x86_64.egg-tmp/hdbscan/_hdbscan_reachability.so in hdbscan._hdbscan_reachability.kdtree_pdist_mutual_reachability (hdbscan/_hdbscan_reachability.c:2820)()

/home/vg/.python-eggs/hdbscan-0.1-py2.7-linux-x86_64.egg-tmp/hdbscan/_hdbscan_reachability.so in hdbscan._hdbscan_reachability.kdtree_pdist_mutual_reachability (hdbscan/_hdbscan_reachability.c:2432)()

/usr/lib/python2.7/dist-packages/scipy/spatial/distance.pyc in pdist(X, metric, p, w, V, VI)
   1174 
   1175     m, n = s
-> 1176     dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)
   1177 
   1178     wmink_names = ['wminkowski', 'wmi', 'wm', 'wpnorm']

MemoryError:

Any obvious problem with the code? Or is this to be expected?

import hdbscan error in 0.5

in robust_single_linkage_.py I believe that line 24 should be changed from
from dist_metrics import DistanceMetric
to
from .dist_metrics import DistanceMetric
If you're busy I'll spend some time tomorrow morning getting my github working and try a pull request.

ImportError Traceback (most recent call last)
in ()
----> 1 import hdbscan

/Users/jchealy/anaconda/lib/python3.5/site-packages/hdbscan/init.py in ()
1 from .hdbscan_ import HDBSCAN, hdbscan
----> 2 from .robust_single_linkage_ import RobustSingleLinkage, robust_single_linkage

/Users/jchealy/anaconda/lib/python3.5/site-packages/hdbscan/robust_single_linkage_.py in ()
22 from ._hdbscan_linkage import single_linkage, mst_linkage_core_cdist, label
23 from ._hdbscan_boruvka import KDTreeBoruvkaAlgorithm, BallTreeBoruvkaAlgorithm
---> 24 from dist_metrics import DistanceMetric
25 from ._hdbscan_reachability import mutual_reachability
26 from .plots import SingleLinkageTree

ImportError: No module named 'dist_metrics'

Different labeling results when using ball tree vs k-d tree for backing structure

Hi,

Thank you for your work on this library! I have been playing around with classifying some multivariate normal distributions, and noticed that the library outputs different results for some datasets when using a ball tree as the backing structure versus a k-d tree. I'm not familiar with classification algorithms, but should the results be different based on the backing data structure? I have included a set of code below, along with an array of 2D features loadable with numpy, that shows these differences.

numpy version: 1.9.2
hdbscan verison: 0.6.5
scikit-learn version: 0.17

File containing .npy array with features:
features.zip

Reproduction code, may need to edit 'features.npy' to point to path of unzipped file from above.

import numpy as np
import hdbscan
import matplotlib.pyplot as plt

features = np.load('features.npy')

ballClassifier = hdbscan.HDBSCAN(min_cluster_size=32, gen_min_span_tree = True, algorithm='boruvka_balltree')
ballClassifier.fit(features)
ballLabels = ballClassifier.labels_

kdClassifier = hdbscan.HDBSCAN(min_cluster_size=32, gen_min_span_tree = True, algorithm='boruvka_kdtree')
kdClassifier.fit(features)
kdLabels = kdClassifier.labels_

plt.figure()

for subIdx, classifier in enumerate([ballClassifier, kdClassifier]):
    labels = classifier.labels_
    plt.subplot(2, 2, 2*subIdx + 1)
    uniqueLabels = sorted(list(set(labels)))

    colors = plt.cm.Spectral(np.linspace(0, 1, max( len(uniqueLabels), 6)))

    for label, col in zip(uniqueLabels, colors):
        class_member_mask = (labels == label)

        xy = features[class_member_mask]
        plt.plot(xy[:, 0], xy[:, 1], '.', markerfacecolor=col,
                markeredgecolor=col, markersize=1)

    plt.subplot(2, 2, 2*subIdx + 2)
    classifier.condensed_tree_.plot()


plt.show()

Plot showing the ball tree output (top) vs k-d tree output (bottom):

Install hdbscan on windows

Installing hdbscan on windows is really really difficult, because of the various c++ compilers which has to match your python version exactly.

Is it possible for you to create a wheel package for hdbscan for python 3.5 x64?

Created using "pip wheel hdbscan", assuming wheel and hdbscan is installed.

Thanks in advance.

Division by zero with 'precomputed' matrix

Please see the log below:

lbl = hd.HDBSCAN(min_cluster_size=3,metric="precomputed").fit_predict(dst)
File "/usr/local/lib/python2.7/dist-packages/hdbscan/hdbscan_.py", line 549, in fit_predict
self.fit(X)
File "/usr/local/lib/python2.7/dist-packages/hdbscan/hdbscan_.py", line 531, in fit
self.min_spanning_tree) = hdbscan(X, **self.get_params())
File "/usr/local/lib/python2.7/dist-packages/hdbscan/hdbscan.py", line 379, in hdbscan
return tree_to_labels(X, single_linkage_tree, min_cluster_size) + (result_min_span_tree,)
File "/usr/local/lib/python2.7/dist-packages/hdbscan/hdbscan.py", line 53, in _tree_to_labels
labels, probabilities = get_clusters(condensed_tree, stability_dict)
File "hdbscan/_hdbscan_tree.pyx", line 466, in hdbscan._hdbscan_tree.get_clusters (hdbscan/_hdbscan_tree.c:9279)
File "hdbscan/_hdbscan_tree.pyx", line 508, in hdbscan._hdbscan_tree.get_clusters (hdbscan/_hdbscan_tree.c:9142)
File "hdbscan/_hdbscan_tree.pyx", line 418, in hdbscan._hdbscan_tree.get_probabilities (hdbscan/_hdbscan_tree.c:7703)
ZeroDivisionError: float division

'dst' is a Numpy array which encodes the distance matrix. I would appreciate if you would comment on this issue.

Memory usage of HDBSAN for moderate large dataset (0.1 to 1 million)?

Hi,

I used pip -install hdbscan on a 64-bit windows desktop in a python 2.7 environment. The packages works great with my test dataset (a few thousands of point in 3-dimensional space). However when I test on a subset of real data, ~ 50000 points, all memory (~12 GB) was occupied by python and my computer got frozen.
Could you please help me understand what's the memory consumption with respect to data points?

Thank you.

Can't install as hdbscan as a dependency to a package

I'm trying to install hdbscan as part of a package with the command:
virtualenv venv && source venv/bin/activate && pip install -e .

Where requirements.txt contains -e . and setup.py contains:

 install_requires=[
                        'cython==0.23.4',
                        'numpy==1.10.2',
                        'scikit-learn==0.17.0',
                        'hdbscan==0.7.2',
                        ...])

I'm getting the following error

Complete output from command python setup.py egg_info:
    hdbscan/setup.py:8: UserWarning: No module named Cython.Distutils
      warnings.warn(e.message)
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "hdbscan/setup.py", line 13, in <module>
        import numpy
    ImportError: No module named numpy

Doesn't seem to matter that pip installs cython and numpy first...

manual select cluster function

I am trying to add some manual selection on the condensed_tree plot. However, this function is really slow: clusterer.condensed_tree_.plot()

Is it majorly because the Matplotlib slowness or there is other issue? I appreciate any suggestion you give.

Buffer dtype mismatch, expected 'int64_t' but got 'int'

Hi,

I tried running the plot_hdbscan.py example, but it failed with an error:

  File "X:/somepath/example.py", line 44, in <module>
    hdb = HDBSCAN(min_cluster_size=10).fit(X)
  File "C:\Python27\lib\site-packages\hdbscan\hdbscan_.py", line 520, in fit
    self._min_spanning_tree) = hdbscan(X, **self.get_params())
  File "C:\Python27\lib\site-packages\hdbscan\hdbscan_.py", line 362, in hdbscan
    gen_min_span_tree)
  File "C:\Python27\lib\site-packages\sklearn\externals\joblib\memory.py", line 283, in __call__
    return self.func(*args, **kwargs)
  File "C:\Python27\lib\site-packages\hdbscan\hdbscan_.py", line 162, in _hdbscan_boruvka_kdtree
    alg = KDTreeBoruvkaAlgorithm(tree, min_samples, metric=metric, leaf_size=leaf_size // 3)
  File "hdbscan/_hdbscan_boruvka.pyx", line 273, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__ (hdbscan\_hdbscan_boruvka.c:4629)
ValueError: Buffer dtype mismatch, expected 'int64_t' but got 'int'

I am not really sure how to proceed now. I think it might be a configuration issue, but the only thing that I think might be relevant is that I had to manually download/install VCForPython27.msi, as instructed by pip, and that I had to manually install cython, as pip install hdbscan kept failing with a cython related error and I figured that might be the issue. I remember reading that cython has to use the same version of C/C++ compiler that was used to compile python itself, but I'm not sure how to verify that is indeed the case (python seems to have used MSC v.1500 32 bit), I can just assume pip pointed me to the right distribution, i.e. VCForPython27.msi.

I'm on Windows 10, Python 2.7.10 (default, May 23 2015, 09:40:32) [MSC v.1500 32 bit (Intel)] on win32, I have MSVC 2015 installed (if relevant), and pip freeze reports:

cycler==0.9.0
Cython==0.23.4
decorator==4.0.4
fastcluster==1.1.20
hcluster==0.2.0
hdbscan==0.6
matplotlib==1.5.0
numpy==1.9.3
Pillow==3.0.0
pyparsing==2.0.6
python-dateutil==2.4.2
pytz==2015.7
scikit-learn==0.17
scipy==0.16.1
six==1.10.0

which exceeds your requirements.txt. Numpy is with MLK, all libraries installed either through pip or from Christoph Goelke's binaries (http://www.lfd.uci.edu/~gohlke/pythonlibs/).

Any other ideas as to what might be wrong? Thanks!

min_cluster_size is actually "max noise size"

Hi,

Thank you for this amazing clustering algorithm and such easy-to-use library. However I think I've found a minor bug. min_cluster_size keyword actually stands for a maximum size, which is not considered a cluster. See example below:

data['Cluster2'] = hdbscan.HDBSCAN(min_cluster_size=2).fit_predict(data[['x', 'y', 'z']])
data['Cluster3'] = hdbscan.HDBSCAN(min_cluster_size=3).fit_predict(data[['x', 'y', 'z']])
gb2 = data.groupby('Cluster2')
l = np.nan
for n, cluster in gb2:
    l = np.nanmin([l, cluster.shape[0]])
print l
gb3 = data.groupby('Cluster3')
l = np.nan
for n, cluster in gb3:
    l = np.nanmin([l, cluster.shape[0]])
print l

Prints out
3.0
4.0

(I use the latest version available through pip)

boruvka joblib error

hdbscan 0.6.5, sklearn 0.17.0
calling HDBSCAN.fit() with algorithm=boruvka_kdtree or boruvka_balltree, i sometimes get this following error. it works fine with algorithm=prims_kdtree or prims_balltree

Traceback (most recent call last):
File "", line 1, in
File "c:\python2764\Lib\multiprocessing\forking.py", line 380, in main
prepare(preparation_data)
File "c:\python2764\Lib\multiprocessing\forking.py", line 495, in prepare
'parents_main', file, path_name, etc
...( references to my code calling HDBSCAN.fit() )...
File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\hdbscan\hdbscan_.py", line 531, in fit
self.min_spanning_tree) = hdbscan(X, **self.get_params())
File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\hdbscan\hdbscan.py", line 363, in hdbscan
gen_min_span_tree)
File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\sklearn\externals\joblib\memory.py", line 283, in call
return self.func(_args, *kwargs)
File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\hdbscan\hdbscan.py", line 163, in _hdbscan_boruvka_kdtree
alg = KDTreeBoruvkaAlgorithm(tree, min_samples, metric=metric, leaf_size=leaf_size // 3)
File "hdbscan/_hdbscan_boruvka.pyx", line 335, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.init (hdbscan_hdbscan_boruvka.c:4746)
File "hdbscan/_hdbscan_boruvka.pyx", line 364, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds (hdbscan_hdbscan_boruvka.c:5401)
File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\sklearn\externals\joblib\parallel.py", line 771, in call
n_jobs = self._initialize_pool()
File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\sklearn\externals\joblib\parallel.py", line 518, in _initialize_pool
raise ImportError('[joblib] Attempting to do parallel computing '
ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using "if name == 'main'". Please see the joblib documentation on Parallel for more information

Error in generating condensed tree plot

Hi,

I am getting the following error while generating clusterer.condensed_tree_.plot()

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-1ce791da72a0> in <module>()
----> 1 clusterer.condensed_tree_.plot()

/home/intel/anaconda2/lib/python2.7/site-packages/hdbscan/plots.pyc in plot(self, leaf_separation, cmap, select_clusters, label_clusters, axis, colorbar, log_size)
    272                 'Use get_plot_data to calculate the relevant data without plotting.')
    273 
--> 274         plot_data = self.get_plot_data(leaf_separation=leaf_separation, log_size=log_size)
    275 
    276         if cmap != 'none':

/home/intel/anaconda2/lib/python2.7/site-packages/hdbscan/plots.pyc in get_plot_data(self, leaf_separation, log_size)
    112 
    113         for cluster in range(last_leaf, root - 1, -1):
--> 114             split = self._raw_tree[['child', 'lambda']]
    115             split = split[(self._raw_tree['parent'] == cluster) &
    116                           (self._raw_tree['child_size'] > 1)]

/home/intel/anaconda2/lib/python2.7/site-packages/numpy/core/_internal.pyc in _index_fields(ary, names)
    321     for name in names:
    322         if name not in dt.fields:
--> 323             raise ValueError("no field of name %s" % name)
    324 
    325     formats = [dt.fields[name][0] for name in names]

ValueError: no field of name lambda

Also I was wondering if it is possible to extract all clusters from condensed tree corresponding to all lambda levels with the ids of points belonging to them or representing parent child relationship between them. I have already tried clusterer.condensed_tree_.to_pandas() but I was not able to understand its output. I ran the test for 50 points and it generated 54 rows in the dataframe. That will mean 4 clusters if I am understanding it correctly.

But in these 4 clusters as well there should be a parent child relationship in a hierarchy. How to establish that ? Also, how to associate actual points data with the dataframe generated.

Sorry, If my post is not clear. Let me know and I'll provide more details if needed.

More formats on CondensedTree output

Support more output formats; ideally a network graph of only cluster nodes with each node containing all the points contained in that cluster (and the lambda birth value of the cluster as an attribute). Also an alternative numpy/pandas format that contains the same sort of information in a relatively accessible way.

Possible error with high dimensional sparse distanse matrix

Hi!
I am trying to use hdbscan with closeness matrix with shape (202599, 202599). 0 indicates that items are far away from each other.

Code:

  clusterer = hdbscan.HDBSCAN(min_cluster_size=10, metric="precomputed")
  labels = clusterer.fit_predict(csr)

Error:

Traceback (most recent call last):
  File "clustering.py", line 9, in <module>
    labels = clusterer.fit_predict(csr)
  File "/usr/local/lib/python2.7/dist-packages/hdbscan/hdbscan_.py", line 667, in fit_predict
    self.fit(X)
  File "/usr/local/lib/python2.7/dist-packages/hdbscan/hdbscan_.py", line 649, in fit
    self._min_spanning_tree) = hdbscan(X, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/hdbscan/hdbscan_.py", line 435, in hdbscan
    gen_min_span_tree, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/memory.py", line 281, in __call__
    return self.func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/hdbscan/hdbscan_.py", line 73, in _hdbscan_generic
    min_samples, alpha)
  File "hdbscan/_hdbscan_reachability.pyx", line 41, in hdbscan._hdbscan_reachability.mutual_reachability (hdbscan/_hdbscan_reachability.c:1564)
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 644, in partition
    a.partition(kth, axis=axis, kind=kind, order=order)```

Extract tree structure from clustering

I'm wondering if it is possible to get a tree-like structure of the resulting clustering. What I'm ultimately trying to do is to get a dendogram overlaid on top of a heat map (e.g. seaborn.clustermap)

From looking at the source code, it seems like there's a condensed tree attribute and that there's a _raw_tree attribute, is that what I should be looking at?

cc @ElDeveloper

Add something about parameter selection to the notebooks?

I really appreciate the work you've done with the "How HDBSCAN Works" notebook, but they don't really touch on the issue of parameter selection. You do mention at the end of the clustering comparison notebook that HDBSCAN is not that sensitive to the choice of min_samples, but that isn't much to go on.

If I'm not mistaken, the alpha and min_samples parameters are both "inherited" from robust single linkage clustering. But Chaudhuri and Dasgupta are not terribly forthcoming about what their alpha parameter actually does, apart from stating that "we'd like alpha as small as possible" (pg. 6, "Rates of convergence for the cluster tree"). So I'm left scratching my head as to how to select these.

Do you think you'd be able to talk about how to intelligently choose the core size (min_samples) and that alpha parameter? You also mention sensible defaults, but you don't mention what makes them sensible.

Unfortunately, I don't have access to Campello, et al.'s original paper, so if it's explained there I won't be able to see it. Dr. Campello mentions on his site that unformatted copies of his work are available, and I'm planning on reaching out to him for that purpose if I can't get access through my old university. But I still think it would be a great addition to these already very helpful notebooks (and it can't hurt the adoption of HDBSCAN itself as a "sensible default" for clustering applications).

Extracting the hierarchy

I am trying to extract the hierarchy to a pandas dataframe, following the "how HDBSCAN works" notebook, by invoking:
>>> clusterer.single_linkage_tree_.to_pandas()
and I get:

Traceback (most recent call last):
File "", line 1, in
File "build/bdist.linux-x86_64/egg/hdbscan/plots.py", line 520, in to_pandas
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 226, in init
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 363, in _init_dict
dtype=dtype)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 5158, in _arrays_to_mgr
index = extract_index(arrays)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 5206, in extract_index
raise ValueError('arrays must all be same length')
ValueError: arrays must all be same length

Am I doing something wrong?

It would also be great if there was a numpy array export capability out there in the form of scipy.cluster.hierarchy.linkage output; inspecting the above to_pandas method suggests that the returned dataframe would be of a different format --if I ever manage to get it right:

inspect.getmembers(clusterer.single_linkage_tree_.to_pandas)

Return a pandas dataframe representation of the single linkage tree.\n\n Each row of the dataframe corresponds to an edge in the tree.\n The columns of the dataframe are parent, left_child, \n right_child, distance and size.\n\n The parent, left_child and right_child are the ids of the\n parent and child nodes in the tree. Node ids less than the number \n of points in the original dataset represent individual points, while\n ids greater than the number of points are clusters.\n\n The distance value is the at which the child nodes merge to form\n the parent node.\n\n The size is the number of points in the parent node.\n

Array of cluster components and cluster labels

Hello!

Thank you for this amazing algorithm! I am trying to transition from DBSCAN (sklearn) to this version of hdbscan. I need some help figuring out this algorithm.

Using,DBSCAN from sklearn, I can get the clustered components with db.components_. Is there any way of doing this using hdbscan?

My intention is to create a 2d array with the cluster label on one column and the components on the other. Basically, I would like to look at each data point and its relevant cluster. What is the best way of doing this using hdbscan?

Thank you!

hdbscan.HDBSCAN().fit('group')

I know that the algorithm will fit very easily using a column from a full pandas dataframe, but is there an elegant solution for 'fitting' across categorical groups, either by using 'groupby.transform' or through iteration?

pip install error under python 3.5

I tried to pip install hdbscan in a fresh conda environment under python 3.5. I get an odd cython error.

(py35)MacBook-Pro-03:hdbscan jchealy$ pip install hdbscan
Collecting hdbscan
Downloading hdbscan-0.4.2.tar.gz (553kB)
100% |████████████████████████████████| 557kB 479kB/s
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "/private/var/folders/_k/3j8v2fk52w70jlrs965xk79c0000gr/T/pip-build-55_jvcgz/hdbscan/setup.py", line 4, in
from Cython.Distutils import build_ext
ImportError: No module named 'Cython'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 20, in <module>
  File "/private/var/folders/_k/3j8v2fk52w70jlrs965xk79c0000gr/T/pip-build-55_jvcgz/hdbscan/setup.py", line 8, in <module>
    warnings.warn(e.message)
AttributeError: 'ImportError' object has no attribute 'message'

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/_k/3j8v2fk52w70jlrs965xk79c0000gr/T/pip-build-55_jvcgz/hdbscan

More windows fun

Ok, I'm trying to get this up and running under windows 7 64-bit via the conda install.

Conda install was had no hitches but upon importing hdbscan I get an error.

import hdbscan

ImportError Traceback (most recent call last)
in ()
----> 1 import hdbscan

H:\Programs\Anaconda\lib\site-packages\hdbscan__init__.py in ()
----> 1 from .hdbscan_ import HDBSCAN, hdbscan
2 from .robust_single_linkage_ import RobustSingleLinkage, robust_single_linkage

H:\Programs\Anaconda\lib\site-packages\hdbscan\hdbscan_.py in ()
27 check_array = check_arrays
28
---> 29 from ._hdbscan_linkage import (single_linkage,
30 mst_linkage_core,
31 mst_linkage_core_pdist,

ImportError: No module named 'hdbscan._hdbscan_linkage

I've got the appropriate .so files here:

Directory of H:\Programs\Anaconda\Lib\site-packages\hdbscan
12/06/2015 10:32 AM

.
12/06/2015 10:32 AM ..
12/06/2015 10:31 AM 0 dir
12/03/2015 10:23 PM 315,520 dist_metrics.so
12/03/2015 10:23 PM 24,265 hdbscan_.py
12/03/2015 10:23 PM 26,905 plots.py
12/03/2015 10:23 PM 14,579 robust_single_linkage_.py
12/03/2015 10:23 PM 373,264 _hdbscan_boruvka.so
12/03/2015 10:23 PM 247,144 _hdbscan_linkage.so
12/03/2015 10:23 PM 95,092 _hdbscan_reachability.so
12/03/2015 10:23 PM 324,792 _hdbscan_tree.so
12/03/2015 10:23 PM 118 init.py
12/06/2015 10:20 AM pycache
11 File(s) 1,421,679 bytes
3 Dir(s) 2,647,232,114,688 bytes free

I'm not a huge windows guys but I'll poke around and see if I can work it out.

Support streaming updating of clusters

In principle this is quite easy: keep the minimal spanning tree, and given a new point, compute core distance, and add an edge to the spanning tree to connect the point. Now run a min spanning tree correction algorithm -- there are several, and some are very cheap, that convert a near minimum spanning tree to a minimum spanning tree. With that done you can rerun condense tree and extract clusters.

Obviously batching up blocks of new points is going to be better than true streaming, but this should work in principle.

import hdbscan fails with UnicodeDecodeError

Hi,

I use Anaconda on Windows 10 x64 with Python 3.4 and I installed hdbscan with:

conda install cython
conda install scikit-learn
pip install hdbscan

When I import hdbscan I get the following error:

[pyenv] C:\temp\lakdjfklj>ipython
Python 3.4.5 |Continuum Analytics, Inc.| (default, Jul  5 2016, 14:53:07) [MSC v.1600 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.

IPython 4.2.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import hdbscan
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-1-3f1c79fb1b69> in <module>()
----> 1 import hdbscan

C:\temp\lakdjfklj\pyenv\lib\site-packages\hdbscan\__init__.py in <module>()
----> 1 from .hdbscan_ import HDBSCAN, hdbscan
      2 from .robust_single_linkage_ import RobustSingleLinkage, robust_single_linkage

C:\temp\lakdjfklj\pyenv\lib\site-packages\hdbscan\hdbscan_.py in <module>()
     28     check_array = check_arrays
     29
---> 30 from ._hdbscan_linkage import (single_linkage,
     31                                mst_linkage_core,
     32                                mst_linkage_core_vector,

hdbscan/_hdbscan_linkage.pyx in init hdbscan._hdbscan_linkage (hdbscan\_hdbscan_linkage.c:20689)()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x83 in position 4: invalid start byte

nan probabilities

        model = hdbscan.HDBSCAN(metric="manhattan", min_cluster_size=min_samples)
        model.fit(data)

model.probabilities_ has several NaN values, what does that mean?
can i avoid it?

The case of unclustered data for HDBSCAN

Hi,

First, thank you for this awesome implementation of HDBSCAN. It is certainly one of the best and most generic clustering algorithm and having this high performance implementation is cool with SciKit.

My datasets are groups of large time series with the same length (between 2000 and 15000 time points and 20-200 series in a dataset). Some series are really similar and i want to regroup them. In such high dimensions and so few series, cluster are not spherical and i don''t know how many clusters there are. At first, i tried DBSCAN but i need to take count of variable density of my clusters that i saw with a dendrogram of hierarchical clustering. That's why i tried HDBSCAN now and it seems to be a good one for my problem.

However, i have datasets with only one cluster of series (at sight...) and HDBSCAN return more clusters than one. So i took a look at the implementation and i saw that you exclude the root of the condensed_tree so that the algorithm will never return only one cluster in your implementation.

I tried to edit it myself in the file_hdbscan_tree.pyx (i added the root, remplaced by zeros the NaN in births list and deleted the last child for which stability is not well-defined). I succeed to "unlock" the possibility but i experiment issues with some other datasets with several clusters for which the results of your version HDBSCAN are correct, the results are different with mine without reason (the root cluster is not stable enough but in my version many little clusters are created).

So here is my request : is it possible for you to add a parameter to choose when to exclude the root of the condensed_tree or not in the evaluation of the number of clusters ? I saw that you provided classes to make some choice but it's a bit difficult to deal with stability computations in the framework. I think my request is not a big deal to code.

I will continue to work on it to make a custom version where this case is allowed because i really need it but it should be great for other to have the choice for this particular case. In my opinion, stability of the root cluster makes sense and it must be considered as the others, but i understand why enforce the clustering.

Thank you again for this so cool implementation, i can provide you what i change to add this possibility if you wish,

Best,

C++ Memory error on Linux/Docker

First off, thanks for your work on the implementation of the algorithm, it's excellent. I do most of my development on OSX and HDBSCAN has worked like a charm.

However, I've recently been trying to deploy some models requiring the package on some EC2 instances and keep getting segmentation faults associated with the munmap_chunk() method. As an example, I've included the output of a test script that I've run either on my development machine, or on the Linux box:

Test Script

import numpy as np
import cython as ct
import scipy as sp
import sklearn as sk

import subprocess

# get machine info
proc_cpu = subprocess.Popen(['/bin/bash'], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
cpu_info = proc_cpu.communicate('lscpu')[0]
proc_os = subprocess.Popen(['/bin/bash'], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
os_info = proc_os.communicate('uname -a')[0]

# custom logger
from crunch.logger import Logger 
logger = Logger('test hdbscan')

# report info on current machine
logger.log.info('Current hardware configuration:\n%s' % cpu_info)
logger.log.info('Current os configuration:\n%s' % os_info)

# display requirements and current python distros
requirements = """
cython>=0.22
numpy>=1.9
scipy >= 0.9
scikit-learn>=0.16
"""
logger.log.info('Requirements for HDBSCAN:\n%s' % requirements)

current_config = """
cython: %s
numpy: %s
scipy: %s
scikit-learn: %s
""" % (ct.__version__, np.__version__, sp.__version__, sk.__version__)
logger.log.info('Current Python configuration:\n%s' % current_config)

# generate cluster data
np.random.seed(4711)  # for repeatability 
sample_size = 50
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[sample_size,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[sample_size,])
c = np.random.multivariate_normal([40, 40], [[20, 1], [1, 30]], size=[sample_size,])
d = np.random.multivariate_normal([80, 80], [[30, 1], [1, 30]], size=[sample_size,])
e = np.random.multivariate_normal([0, 100], [[100, 1], [1, 100]], size=[sample_size,])
f = 300. * (np.random.rand(sample_size,2) - .5)

# cast data as different dtypes (doesn't matter)
X = np.asarray(np.concatenate((a, b, c, d, e, f),), dtype=np.float32) 

# attempt to run clustering
logger.log.info('Running clustering...')
from hdbscan import HDBSCAN
cl = HDBSCAN(min_cluster_size=10)
classes = cl.fit_predict(X)
logger.log.info('Success!')

logger.log.info('Results:')
print classes

The output when run on OSX

/bin/bash: line 1: lscpu: command not found
INFO   | test hdbscan | Current hardware configuration:

INFO   | test hdbscan | Current os configuration:
Darwin crunch.local 14.3.0 Darwin Kernel Version 14.3.0: Mon Mar 23 11:59:05 PDT 2015; root:xnu-2782.20.48~5/RELEASE_X86_64 x86_64

INFO   | test hdbscan | Requirements for HDBSCAN:
cython>=0.22
numpy>=1.9
scipy >= 0.9
scikit-learn>=0.16

INFO   | test hdbscan | Current Python configuration:
cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.1
scikit-learn: 0.17.1

INFO   | test hdbscan | Running clustering...
INFO   | test hdbscan | Success!
INFO   | test hdbscan | Results:
[ 3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
 -1 -1 -1 -1 -1  1  0 -1 -1 -1 -1  3 -1  1 -1  2 -1 -1 -1 -1 -1 -1 -1 -1  2

Great! Now let's try it on EC2...

The output when run on Linux

INFO   | test hdbscan | Current hardware configuration:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Stepping:              4
CPU MHz:               2500.058
BogoMIPS:              5000.11
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-3

INFO   | test hdbscan | Current os configuration:
Linux 44f4d4de5575 3.13.0-71-generic #114-Ubuntu SMP Tue Dec 1 02:34:22 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

INFO   | test hdbscan | Requirements for HDBSCAN:
cython>=0.22
numpy>=1.9
scipy >= 0.9
scikit-learn>=0.16

INFO   | test hdbscan | Current Python configuration:
cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.1
scikit-learn: 0.17

INFO   | test hdbscan | Running clustering...
*** Error in `python': munmap_chunk(): invalid pointer: 0x00000000030cef80 ***

This likely isn't a problem with the HDBSCAN package per se, but perhaps how Linux vs OSX allocates/deallocates memory. Unfortunately, I do not have direct access to the Linux box/Docker containers in order to run Valgrind or the like in order to track down the error (they're deployed by a 3rd party service, i.e. the *** Error in 'python':...***).

One possible solution is to add a compilation flag that checks for memory size, though this may affect performance. Would love to hear your thoughts...

doubt about the labels array

What does -1 mean in the labels array after the method "fit()" is called?
I guess it's a value to indicate that the clustering algorithm wasn't able to assign the element to a cluster, is it right?

RobustSingleLinkage doesn't work

OK, a bigger issue now... I wanted to play a little with robust single linkage, but seems like it doesn't work:

clusterer = hdbscan.RobustSingleLinkage(cut=0.125, k=7)
cluster_labels = clusterer.fit_predict(data)

And I get this:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-87-39b93d0a5ec8> in <module>()
----> 1 clusterer = hdbscan.RobustSingleLinkage(cut=0.125, k=7)
      2 cluster_labels = clusterer.fit_predict(data)

AttributeError: 'module' object has no attribute 'RobustSingleLinkage'

Am I doing something wrong?

Prims algorithm is broken for large dimensional datasets

Run the Prims version of HDBSCAN on the digits dataset from sklearn and you will get awful results. Both Boruvka and the generic approach get it right. What's the problem here? Still to be determined.

soft clustering

This algorithm seems very efficient and promising in my application.

One issue: In the document is was said there is a soft clustering method that shows the membership strength, however, I only see attribute probabilities_. labels_ and outlier_scores_. And the probability with label -1 is always 0.

The problem is that the algo always gives lots of -1 cluster that don't belong to any clusters but clearly there are lots of points in -1 cluster should belongs to some other cluster.

A "Predict" Function

Is it possible to integrate a "predict" function, which allows for the prediction of labels on new data? Currently, to do this task I'm using a nearest neighbor search but that's particularly inefficient. I might work on building this for myself from your codebase if my schedule is forgiving!

Pip Fails to install

Hi,

Linux here, Python 2.7 with the 8.1.2 version of pip! The setup process seems to fails to find a file named: '_hdbscan_tree.c'. This happens with pip or manual install.

Any help appreciated.

conda install issue

Hi there. I'm using hdbscan via a pip install hdbscan successfully in a Linux 64 bit Anaconda Python 3.4 environment (cheers!).

However - if I try to use the conda installer against what I think is your channel, I get a problem:

$ conda install -c https://conda.anaconda.org/lmcinnes hdbscan
<success>
...
$ ipython
...
In [1]: import hdbscan
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-3f1c79fb1b69> in <module>()
----> 1 import hdbscan

/home/ian/anaconda/envs/scratch/lib/python3.4/site-packages/hdbscan/__init__.py in <module>()
----> 1 from .hdbscan_ import HDBSCAN, hdbscan
      2 from .robust_single_linkage_ import RobustSingleLinkage, robust_single_linkage

/home/ian/anaconda/envs/scratch/lib/python3.4/site-packages/hdbscan/hdbscan_.py in <module>()
     27     check_array = check_arrays
     28 
---> 29 from ._hdbscan_linkage import (single_linkage,
     30                                mst_linkage_core,
     31                                mst_linkage_core_pdist,

ImportError: /home/ian/anaconda/envs/scratch/lib/python3.4/site-packages/hdbscan/_hdbscan_linkage.so: invalid ELF header

I've switched back to the pip version, I'd love to keep this in conda if possible.

cluster selection based on stability is really sensitive to the selection of fall_off_size

Thanks for providing this amazing library. I learnt a lot from your implementation. Among all clustering scheme I've tried, this is one of the best so far.

I spent a while to study your amazing Cython implementation and the idea behind. Minimum spanning tree is really amazingly informative. However, in practical use, I am always feeling there is a problem on the automatic cluster selection. It is just too sensitive to the selection of fall_off_size(min_sample). Besides, when two obvious distinguish clusters are connected by very few noisy points in between, it is more likely they would be put together. I understand mutual distance is used to address that, but automatic selection based on stability seems to bring the quality down.

Before the automatic selection I think everything is perfect, you do have a minimum spanning tree to cut, a single linkage tree to do any migration, split or merge. I do feel there would be some room to improve or even replace the condense tree for automatic selection of cluster.

AssertionError fitting clusters

I was trying your implementation with point samples (around 6,000 points) gathered from a shapefile. I read the points and transformed them to a numpy array as required by the library. When I try:

 clusterer.fit(points_sample)

I get the following error:

Traceback (most recent call last):

  File "<ipython-input-35-c48bcd0dd0bd>", line 1, in <module>
    clusterer.fit(blobs)

  File "/usr/local/lib/python2.7/dist-packages/hdbscan/hdbscan_.py", line 489, in fit
    self._min_spanning_tree) = hdbscan(X, **self.get_params())

  File "/usr/local/lib/python2.7/dist-packages/hdbscan/hdbscan_.py", line 358, in hdbscan
    p, gen_min_span_tree)

  File "/usr/local/lib/python2.7/dist-packages/hdbscan/hdbscan_.py", line 152, in _hdbscan_large_kdtree
    assert(len(candidates) > 0)

Then I tried running the example from the notebook and raisng the number of samples (for simplicity, I just used the blobs). When I raised the number of samples up to 4,000 I got the exact same error.

I also tried changing the min_cluster_size parameter from 5 up to 50 and still get the same error (both with my data and with the blobs).

Maybe I'm failing to see something?

Fresh linux install: limits.h not found

Hi guys,

I had a minor issue with getting hdbscan to compile on a fresh linux mint install. I figured I'd document it here in to save other folks a bit of hassle.

The meat of the error looked like:
/usr/lib/gcc/x86_64-linux-gnu/4.8/include-fixed/limits.h:168:61: fatal error: limits.h: No such file or directory

This was resolved via:
sudo apt-get install g++

How to create clusters using haversine formula

I have some sample long/lat data i would like to have clustered, but i cannot seem to get any meaningful data out of using HDBSCAN. I'm surely doing something wrong.

hdb = HDBSCAN(min_cluster_size=3, metric='haversine').fit(sample_data)
hdb_labels = hdb.labels_
n_clusters_hdb_ = len(set(hdb_labels)) - (1 if -1 in hdb_labels else 0)

It gives me two clusters with my sample data of long lats(50 long/lats), rest is noise. How exactly do i get the measured distance between the points?

Say i want my distance between points to be ~500 meters, how exactly can i extract that from the clustered data?