eren-ck / st_dbscan Goto Github PK

View Code? Open in Web Editor NEW

126.0 126.0 24.0 460 KB

ST-DBSCAN: Simple and effective tool for spatial-temporal clustering

License: MIT License

Python 100.00%

clustering dbscan-clustering spatio-temporal spatio-temporal-analysis

st_dbscan's People

Contributors

Stargazers

Watchers

st_dbscan's Issues

Usage of squareform

Hello, thanks for a nice and straightforward implementation of the ST-DBSCAN algorithm!

I ended up looking at your source and tried to understand it myself. What you essentially did is that you feed a distance matrix that sets the distances that do not meet the temporal eps to doubled the spatial eps, and then call sklearn's DBSCAN on the distance matrix.

For this block of code in st_dbscan.py:

time_dist = squareform(pdist(X[:, 0].reshape(n, 1), metric=self.metric))
euc_dist = squareform(pdist(X[:, 1:], metric=self.metric))

# filter the euc_dist matrix using the time_dist
dist = np.where(time_dist <= self.eps2, euc_dist, 2 * self.eps1)

db = DBSCAN(eps=self.eps1, min_samples=self.min_samples, metric='precomputed')
db.fit(dist)

You called squareform twice to form two square matrices for computed time and spatial distances. And then dist will be a third square matrix that has the same dimension as time_dist and euc_dist. This means you will have three matrices with relatively large size in terms of memory usage. This of course depends on the data size. For my data, they all have > 50000 rows (that results approx. 50000 by 50000 matrices, my data is float64, so each matrix is over 16GB of memory) so the algorithm breaks without processing the data in chunks.

What I would do is this:

time_dist = pdist(X[:, 0].reshape(n, 1), metric=self.metric)
euc_dist = pdist(X[:, 1:], metric=self.metric)

# filter the euc_dist matrix using the time_dist
dist = np.where(time_dist <= self.eps2, euc_dist, 2 * self.eps1)

db = DBSCAN(eps=self.eps1, min_samples=self.min_samples, metric='precomputed')
db.fit(squareform(dist))

In this case, only one square matrix is needed.

density factor implementation

Hi there,
I was reading the original paper and your implementation and correct if I am wrong but there is no provision for the density factor thus there will be issues identifying adjacent clusters.
Unless this is implemented in the standard DBSCAN algorithm in SKLEARN but I can't find any info in there either.

Provide time series before the spatial attributes

First, thanks for this implementation of this clustering method.
I was trying to use it with spatial attributes (x,y) and for each location I have a time series (and not only one value).
I understood from the paper that it was possible

But from the demo and the comment here

st_dbscan/src/st_dbscan/st_dbscan.py

Line 75 in ace025a

X : 2D numpy array with

.
I understand that only 1 value can be provided as temporal feature (and not a complete time series). Am I wrong ?
Thanks again
Ronan

Another distance metric

Hi! Thanks so much for this implementation.
I wanted some guidance on how to use a different distance metric than the default euclidean.
I have data with multiple features and wanted to use another distance metric, such as mahalanobis
would the implementation be as under:-
st_dbscan = ST_DBSCAN(eps1 = 0.4, eps2 = 5, min_samples = 5, metric = 'mahalanobis')

I did try the above, but got an error Singular matrix. However, when I checked the correlation, it seems to be ok,

Also, in case I would want to use a different weightage for each of the features while calculating the distance, how should i go about it?
Would be grateful if you could please help out.

Thanks

No issue... Just some questions.

Does it work on more dimension data like 3D (x,y,z)? I tested with 3D data and everything seemed to be fine. I just want you to clarify.

By the way, I've been looking for C++ of ST-DBSCAN as well but no luck. Do you have any idea for C++ version?

Units or metrics

What would be the units of these values in eps1 = 0.05, eps2 = 10. Are they m/km or s/min

wrong when use "st_dbscan.fit_frame_split"

hello,eren-ck!
when I use "st_dbscan.fit_frame_split", I set
st_dbscan.fit_frame_split(data, frame_size = 500)
sometimes it goes well,sometimes it goes wrong:
the length of labels don't equal the length of data
how should I solve this question? Thank you!

ValueError: frame_size, frame_overlap not correctly configured

I tested 10000 records of random data with frame_size = 50

    138         if not frame_size > 0.0 or not frame_overlap > 0.0 or frame_size < frame_overlap:
    139             raise ValueError(
--> 140                 'frame_size, frame_overlap not correctly configured.')
    141 
    142         # unique time points

ValueError: frame_size, frame_overlap not correctly configured.

Using the model for multiple features

Hi there,

First, congrats for the great implementation of ST-DBSCAN. I'm evaluating using it for a research on spatiotemporal clustering of meteorological data (with multiple features).
However, I have a question on your implementation:

How should I organize the inputs to the model for multiple features for each data point? As far as I understood from the documentation, it should be:
[[time_step1, data_point_ID, x, y, feature 1, feature 2, feature n],[time_step2, data_point_ID, x, y, feature 1, feature 2, feature n]]

An example of data point would be: on the first timestep, 300 is a specific city, x is latitude, y is longitude, feature 1 is precipitation, feature 2 is temperature, feature n is solar radiation:
[0, 300, -23.5505, -46.6333, 1271.001, 28.763, 17.971]

Thanks a lot!

Roberto

fit_frame_split - ValueError: Length of values does not match length of index

Hey! As mentioned in #7, there seem to be edge cases where the labels computed by fit_frame_split() don't match the row count of X fed to it. Not quite sure what's causing it at first glance!

The error in question:

ValueError: Length of values (17465) does not match length of index (17612)

The use in question (it is sorted by timestamp ascending before it goes in):

clustering = ST_DBSCAN(eps1=0.25, eps2=250, min_samples=10).fit_frame_split(sub_df.loc[:, ["timestamp","x","y"]].values, 2000)
sub_df["cluster"] = clustering.labels

Attached is the subset CSV of ordered timestamp/x/y data that yielded this for me. Timestamp is unix_millis, x/y are in an arbitrary space for particle data for a side project.

Currently looking into a temporary rewrite of it for the memory constraints I'm currently fighting with (I turned here because with fit(), some very large (>100k) position datasets that are only a couple hundred MB in Pandas turned out via memory_profile to cause up to a 6.8 GB increment in memory use! which eats heap and crashes smaller workers on my compute cluster, etc... probably the darn matrices becoming not-so-sparse).

ST_DBSCAN_2024_03_14.csv

eren-ck / st_dbscan Goto Github PK

st_dbscan's People

Contributors

Stargazers

Watchers

Forkers

st_dbscan's Issues

Usage of squareform

density factor implementation

Provide time series before the spatial attributes

Another distance metric

No issue... Just some questions.

Units or metrics

wrong when use "st_dbscan.fit_frame_split"

ValueError: frame_size, frame_overlap not correctly configured

Using the model for multiple features

fit_frame_split - ValueError: Length of values does not match length of index

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent