Comments (5)
Hi @whaowhao and thanks for trying out this! For sure your use case doesn't look like one of the best use cases for FISHDBC, you may want to try HDBSCAN (https://hdbscan.readthedocs.io/). FISHDBC should work better with very high dimensionality and/or custom, expensive, distance functions. I wonder if you can share for what you're applying this algorithm...
That said,
- you can probably improve your performance by using a fast, vectorized distance function (the vectorized parameter--undocumented, I know--should help). Say your dataset is a n*m numpy array where n is the number of elements and m is the dimension, you could define something like
from scipy.spatial.distance import euclidean
def d(i, js):
return euclidean([data[i]], data[js])
and then pass it to FISHDBC with vectorized=True (add items by their indexes in your dataset, not by their values).
You might speed up this a bit more eg by implementing the distance function with numba, if I recall correctly.
2) are you running out of RAM? there may be memory issues there
3) how are you running the algorithms? If you're just calling add()
several times without ever calling update()
you're postponing substantial work and it's not very surprising that clusterer.cluster()
spends a lot of time in update()
.
from flexible-clustering.
Thank you Matteo!
I am aiming to use FISHDBC to cluster 10-50 millions 20-dimensional text embeddings. While HDBSCAN* can indeed handle array of 2 million x 20 faster than FISHDBC, it would have OOM problem (I believe) when scaling to 10 million. Thus I chose FISHDBC for this task.
So the clustering finished right after I made this post, so for array of size 2million x 20, it took ~3 hour in the building phase and ~4 hour in the cluster phase.
Besides using vectorized=True, do you have other suggestions for scaling the array to 50million x 20 ?
from flexible-clustering.
Well, I'd need you to answer my question #3 :) Are you adding elements with .add()
or .update()
? update
regularly calls update_mst
, and if you're not using it you should call update_mst
at regular intervals yourself to limit memory usage.
Provide for plentiful RAM and cross fingers :) I'd be very interested in knowing which dataset size you're managing to handle with which machine and which timing, if you can disclose that :)
Another, lateral, thing: FISHDBC was created to not need embeddings: if you have a (dis)similarity function that works directly on the raw data, maybe try it out (and let me know if you find interesting results)!
from flexible-clustering.
Gotcha, currently I am using .add()
for all data points, and then do .cluster()
after all points are added.
I will try calling update_mst
to see the speed/memory trade-offs. A side question, would calling update_mst
make the overall pipeline slower?
I have ~600Gb RAM, hopefully, that'd be enough :)
Regarding embedding, unfortunately, edit distance that works directly on text doesn't work as well as Euclidean distance of text embeddings.
from flexible-clustering.
If, as I expect, memory usage keeps growing linearly with the data set size, you should definitely be fine with 600 GB of RAM.
I expect some kind of speed/memory trade-off depending on the frequency with which you call update_mst
, but I wouldn't expect much loss in terms of speed. I didn't have such a big machine to play on, so I guess you're on your own there. Looks like your computation is split almost half-half between MST maintenance and the HNSW part, which is interesting.
from flexible-clustering.
Related Issues (6)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from flexible-clustering.