d-chambers / dbscan1d Goto Github PK
View Code? Open in Web Editor NEWAn efficient 1D implementation of the DBSCAN clustering algorithm
License: GNU Lesser General Public License v3.0
An efficient 1D implementation of the DBSCAN clustering algorithm
License: GNU Lesser General Public License v3.0
Hi there,
I noticed that the metric is not included in the constructor which will make it incompatible with DBSCAN official version should support:
dbs = DBSCAN1D(eps=1.0, min_samples=10,metric='Euclidean')
I have looked into the code and I can see from here:
def _get_is_core(self, ar):
""" Determine if each point is a core. """
mineps = np.searchsorted(ar, ar - self.eps, side="left")
maxeps = np.searchsorted(ar, ar + self.eps, side="right")
core = (maxeps - mineps) >= self.min_samples
return core
That is equivalent to euclidean distance in one dimension abs(p1-p2), is this correct?
Would there be any point in supporting other distances in 1 dimension which are not many but for example will be useful to have (p1-p2)^2.
Maybe throw an exception for any other non supported distances?
Hi,
found an issue when counting labels with core points.
If you try this:
from sklearn.datasets import make_blobs
from dbscan1d.core import DBSCAN1D
import random
random.seed(0)
for x in range(1,100):
# make blobs to test clustering
X,y = make_blobs(1_000_000, centers=2, n_features=1)
# init dbscan object
dbs = DBSCAN1D(eps=.5, min_samples=4)
# get labels for each point
labels = dbs.fit_predict(X)
core_pts = dbs.core_sample_indices_
core_size = core_pts[0].size
label_size = np.where(labels >=0)[0].size
if core_size != label_size:
print('Total points %d' % len(X))
print('Cluster ID: %s' % np.unique(labels))
print('Total noise %s' % np.where(labels <0)[0].size)
print('Total core %s' % np.where(labels >=0)[0].size)
print('Total core points %d' % core_pts[0].size)
Very often you will find this situation:
Total points 1000000
Cluster ID: [-1 0 1]
Total noise 1
Total core 999999
Total core points 999998
The labels do not match the core point counts.
First and foremost thank you for this efficient implementation of the alghoritm for 1D data.
With that being said i noticed that the labelling seems to be inconsistent, when there is only one cluster sometimes the points belonging to the cluster are labeled with 1 sometimes with 0, why does this happen and could it happen when there's more than one cluster?
Following a reproducibile example:
label_is_zero=[86400.0,86400.0,86400.0,86401.0,86399.0,86400.0,86401.0,86399.0,86400.0,86400.0,86400.0,86402.0,86398.0,86401.0,86399.0,86401.0,86400.0,86399.0,86399.0,86401.0,86399.0,86401.0,86399.0,86402.0,86399.0,86400.0,86401.0,86401.0]
label_is_one=[46823, 46818, 46816, 46816, 46819]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.