eren-ck / st_dbscan Goto Github PK
View Code? Open in Web Editor NEWST-DBSCAN: Simple and effective tool for spatial-temporal clustering
License: MIT License
ST-DBSCAN: Simple and effective tool for spatial-temporal clustering
License: MIT License
Hello, thanks for a nice and straightforward implementation of the ST-DBSCAN algorithm!
I ended up looking at your source and tried to understand it myself. What you essentially did is that you feed a distance matrix that sets the distances that do not meet the temporal eps to doubled the spatial eps, and then call sklearn's DBSCAN on the distance matrix.
For this block of code in st_dbscan.py:
time_dist = squareform(pdist(X[:, 0].reshape(n, 1), metric=self.metric))
euc_dist = squareform(pdist(X[:, 1:], metric=self.metric))
# filter the euc_dist matrix using the time_dist
dist = np.where(time_dist <= self.eps2, euc_dist, 2 * self.eps1)
db = DBSCAN(eps=self.eps1, min_samples=self.min_samples, metric='precomputed')
db.fit(dist)
You called squareform
twice to form two square matrices for computed time and spatial distances. And then dist
will be a third square matrix that has the same dimension as time_dist
and euc_dist
. This means you will have three matrices with relatively large size in terms of memory usage. This of course depends on the data size. For my data, they all have > 50000 rows (that results approx. 50000 by 50000 matrices, my data is float64, so each matrix is over 16GB of memory) so the algorithm breaks without processing the data in chunks.
What I would do is this:
time_dist = pdist(X[:, 0].reshape(n, 1), metric=self.metric)
euc_dist = pdist(X[:, 1:], metric=self.metric)
# filter the euc_dist matrix using the time_dist
dist = np.where(time_dist <= self.eps2, euc_dist, 2 * self.eps1)
db = DBSCAN(eps=self.eps1, min_samples=self.min_samples, metric='precomputed')
db.fit(squareform(dist))
In this case, only one square matrix is needed.
Hi there,
I was reading the original paper and your implementation and correct if I am wrong but there is no provision for the density factor thus there will be issues identifying adjacent clusters.
Unless this is implemented in the standard DBSCAN algorithm in SKLEARN but I can't find any info in there either.
First, thanks for this implementation of this clustering method.
I was trying to use it with spatial attributes (x,y) and for each location I have a time series (and not only one value).
I understood from the paper that it was possible
But from the demo and the comment here
st_dbscan/src/st_dbscan/st_dbscan.py
Line 75 in ace025a
Hi! Thanks so much for this implementation.
I wanted some guidance on how to use a different distance metric than the default euclidean.
I have data with multiple features and wanted to use another distance metric, such as mahalanobis
would the implementation be as under:-
st_dbscan = ST_DBSCAN(eps1 = 0.4, eps2 = 5, min_samples = 5, metric = 'mahalanobis')
I did try the above, but got an error Singular matrix. However, when I checked the correlation, it seems to be ok,
Also, in case I would want to use a different weightage for each of the features while calculating the distance, how should i go about it?
Would be grateful if you could please help out.
Thanks
Does it work on more dimension data like 3D (x,y,z)? I tested with 3D data and everything seemed to be fine. I just want you to clarify.
By the way, I've been looking for C++ of ST-DBSCAN as well but no luck. Do you have any idea for C++ version?
What would be the units of these values in eps1 = 0.05, eps2 = 10. Are they m/km or s/min
hello,eren-ck!
when I use "st_dbscan.fit_frame_split", I set
st_dbscan.fit_frame_split(data, frame_size = 500)
sometimes it goes well,sometimes it goes wrong:
the length of labels don't equal the length of data
how should I solve this question? Thank you!
I tested 10000 records of random data with frame_size = 50
138 if not frame_size > 0.0 or not frame_overlap > 0.0 or frame_size < frame_overlap:
139 raise ValueError(
--> 140 'frame_size, frame_overlap not correctly configured.')
141
142 # unique time points
ValueError: frame_size, frame_overlap not correctly configured.
Hi there,
First, congrats for the great implementation of ST-DBSCAN. I'm evaluating using it for a research on spatiotemporal clustering of meteorological data (with multiple features).
However, I have a question on your implementation:
An example of data point would be: on the first timestep, 300 is a specific city, x is latitude, y is longitude, feature 1 is precipitation, feature 2 is temperature, feature n is solar radiation:
[0, 300, -23.5505, -46.6333, 1271.001, 28.763, 17.971]
Thanks a lot!
Roberto
Hey! As mentioned in #7, there seem to be edge cases where the labels
computed by fit_frame_split()
don't match the row count of X
fed to it. Not quite sure what's causing it at first glance!
The error in question:
ValueError: Length of values (17465) does not match length of index (17612)
The use in question (it is sorted by timestamp
ascending before it goes in):
clustering = ST_DBSCAN(eps1=0.25, eps2=250, min_samples=10).fit_frame_split(sub_df.loc[:, ["timestamp","x","y"]].values, 2000)
sub_df["cluster"] = clustering.labels
Attached is the subset CSV of ordered timestamp/x/y data that yielded this for me. Timestamp is unix_millis
, x/y are in an arbitrary space for particle data for a side project.
Currently looking into a temporary rewrite of it for the memory constraints I'm currently fighting with (I turned here because with fit()
, some very large (>100k) position datasets that are only a couple hundred MB in Pandas turned out via memory_profile
to cause up to a 6.8 GB increment in memory use! which eats heap and crashes smaller workers on my compute cluster, etc... probably the darn matrices becoming not-so-sparse).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.