Hi Matteo, Thanks for the great work! I am applying FISHDBC to clust

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Well, I'd need you to answer my question <a class="issue-link js-issue-link" data-erro

Clustering 2 Million 20-d vectors using Euclidean distance takes too long during clustering phase about flexible-clustering HOT 5 OPEN

whaowhao commented on May 30, 2024

Clustering 2 Million 20-d vectors using Euclidean distance takes too long during clustering phase

from flexible-clustering.

Comments (5)

matteodellamico commented on May 30, 2024

Hi @whaowhao and thanks for trying out this! For sure your use case doesn't look like one of the best use cases for FISHDBC, you may want to try HDBSCAN (https://hdbscan.readthedocs.io/). FISHDBC should work better with very high dimensionality and/or custom, expensive, distance functions. I wonder if you can share for what you're applying this algorithm...

That said,

you can probably improve your performance by using a fast, vectorized distance function (the vectorized parameter--undocumented, I know--should help). Say your dataset is a n*m numpy array where n is the number of elements and m is the dimension, you could define something like

from scipy.spatial.distance import euclidean

def d(i, js):
    return euclidean([data[i]], data[js])

and then pass it to FISHDBC with vectorized=True (add items by their indexes in your dataset, not by their values).
You might speed up this a bit more eg by implementing the distance function with numba, if I recall correctly.
2) are you running out of RAM? there may be memory issues there
3) how are you running the algorithms? If you're just calling add() several times without ever calling update() you're postponing substantial work and it's not very surprising that clusterer.cluster() spends a lot of time in update().

from flexible-clustering.

whaowhao commented on May 30, 2024

Thank you Matteo!

I am aiming to use FISHDBC to cluster 10-50 millions 20-dimensional text embeddings. While HDBSCAN* can indeed handle array of 2 million x 20 faster than FISHDBC, it would have OOM problem (I believe) when scaling to 10 million. Thus I chose FISHDBC for this task.

So the clustering finished right after I made this post, so for array of size 2million x 20, it took ~3 hour in the building phase and ~4 hour in the cluster phase.

Besides using vectorized=True, do you have other suggestions for scaling the array to 50million x 20 ?

from flexible-clustering.

matteodellamico commented on May 30, 2024

Well, I'd need you to answer my question #3 :) Are you adding elements with .add() or .update()? update regularly calls update_mst, and if you're not using it you should call update_mst at regular intervals yourself to limit memory usage.

Provide for plentiful RAM and cross fingers :) I'd be very interested in knowing which dataset size you're managing to handle with which machine and which timing, if you can disclose that :)

Another, lateral, thing: FISHDBC was created to not need embeddings: if you have a (dis)similarity function that works directly on the raw data, maybe try it out (and let me know if you find interesting results)!

from flexible-clustering.

whaowhao commented on May 30, 2024

Gotcha, currently I am using .add() for all data points, and then do .cluster() after all points are added.
I will try calling update_mst to see the speed/memory trade-offs. A side question, would calling update_mst make the overall pipeline slower?
I have ~600Gb RAM, hopefully, that'd be enough :)
Regarding embedding, unfortunately, edit distance that works directly on text doesn't work as well as Euclidean distance of text embeddings.

from flexible-clustering.

matteodellamico commented on May 30, 2024

If, as I expect, memory usage keeps growing linearly with the data set size, you should definitely be fine with 600 GB of RAM.

I expect some kind of speed/memory trade-off depending on the frequency with which you call update_mst, but I wouldn't expect much loss in terms of speed. I didn't have such a big machine to play on, so I guess you're on your own there. Looks like your computation is split almost half-half between MST maintenance and the HNSW part, which is interesting.

from flexible-clustering.

Clustering 2 Million 20-d vectors using Euclidean distance takes too long during clustering phase about flexible-clustering HOT 5 OPEN

Comments (5)

Related Issues (6)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent