d-chambers / dbscan1d Goto Github PK

View Code? Open in Web Editor NEW

23.0 2.0 5.0 138 KB

An efficient 1D implementation of the DBSCAN clustering algorithm

License: GNU Lesser General Public License v3.0

Python 80.93% Jupyter Notebook 19.07%

machine-learning clustering dbscan-algorithm python

dbscan1d's People

Contributors

Stargazers

Watchers

Forkers

nguyenbaopc leeoniya miandai zheng2049

dbscan1d's Issues

dbscan metric

Hi there,
I noticed that the metric is not included in the constructor which will make it incompatible with DBSCAN official version should support:

dbs = DBSCAN1D(eps=1.0, min_samples=10,metric='Euclidean')

I have looked into the code and I can see from here:

    def _get_is_core(self, ar):
        """ Determine if each point is a core. """
        mineps = np.searchsorted(ar, ar - self.eps, side="left")
        maxeps = np.searchsorted(ar, ar + self.eps, side="right")
        core = (maxeps - mineps) >= self.min_samples
        return core

That is equivalent to euclidean distance in one dimension abs(p1-p2), is this correct?
Would there be any point in supporting other distances in 1 dimension which are not many but for example will be useful to have (p1-p2)^2.

Maybe throw an exception for any other non supported distances?

Possible bug in label vs core counts

Hi,
found an issue when counting labels with core points.
If you try this:


from sklearn.datasets import make_blobs

from dbscan1d.core import DBSCAN1D

import random
random.seed(0)

for x in range(1,100):
  # make blobs to test clustering
  X,y = make_blobs(1_000_000, centers=2, n_features=1)

  # init dbscan object
  dbs = DBSCAN1D(eps=.5, min_samples=4)

  # get labels for each point
  labels = dbs.fit_predict(X)
  core_pts = dbs.core_sample_indices_

  core_size = core_pts[0].size
  label_size = np.where(labels >=0)[0].size

  if core_size != label_size:
    print('Total points %d' % len(X))
    print('Cluster ID: %s' % np.unique(labels))
    print('Total noise %s' % np.where(labels <0)[0].size)
    print('Total core %s' % np.where(labels >=0)[0].size)
    print('Total core points %d' % core_pts[0].size)

Very often you will find this situation:

Total points 1000000
Cluster ID: [-1  0  1]
Total noise 1
Total core 999999
Total core points 999998

The labels do not match the core point counts.

Inconsistent labels values when there is only one cluster

First and foremost thank you for this efficient implementation of the alghoritm for 1D data.
With that being said i noticed that the labelling seems to be inconsistent, when there is only one cluster sometimes the points belonging to the cluster are labeled with 1 sometimes with 0, why does this happen and could it happen when there's more than one cluster?
Following a reproducibile example:
label_is_zero=[86400.0,86400.0,86400.0,86401.0,86399.0,86400.0,86401.0,86399.0,86400.0,86400.0,86400.0,86402.0,86398.0,86401.0,86399.0,86401.0,86400.0,86399.0,86399.0,86401.0,86399.0,86401.0,86399.0,86402.0,86399.0,86400.0,86401.0,86401.0]
label_is_one=[46823, 46818, 46816, 46816, 46819]

d-chambers / dbscan1d Goto Github PK

dbscan1d's People

Contributors

Stargazers

Watchers

Forkers

dbscan1d's Issues

dbscan metric

Possible bug in label vs core counts

Inconsistent labels values when there is only one cluster

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent