Code Monkey home page Code Monkey logo

Comments (5)

matteodellamico avatar matteodellamico commented on May 30, 2024

Hi @whaowhao and thanks for trying out this! For sure your use case doesn't look like one of the best use cases for FISHDBC, you may want to try HDBSCAN (https://hdbscan.readthedocs.io/). FISHDBC should work better with very high dimensionality and/or custom, expensive, distance functions. I wonder if you can share for what you're applying this algorithm...

That said,

  1. you can probably improve your performance by using a fast, vectorized distance function (the vectorized parameter--undocumented, I know--should help). Say your dataset is a n*m numpy array where n is the number of elements and m is the dimension, you could define something like
from scipy.spatial.distance import euclidean

def d(i, js):
    return euclidean([data[i]], data[js])

and then pass it to FISHDBC with vectorized=True (add items by their indexes in your dataset, not by their values).
You might speed up this a bit more eg by implementing the distance function with numba, if I recall correctly.
2) are you running out of RAM? there may be memory issues there
3) how are you running the algorithms? If you're just calling add() several times without ever calling update() you're postponing substantial work and it's not very surprising that clusterer.cluster() spends a lot of time in update().

from flexible-clustering.

whaowhao avatar whaowhao commented on May 30, 2024

Thank you Matteo!

I am aiming to use FISHDBC to cluster 10-50 millions 20-dimensional text embeddings. While HDBSCAN* can indeed handle array of 2 million x 20 faster than FISHDBC, it would have OOM problem (I believe) when scaling to 10 million. Thus I chose FISHDBC for this task.

So the clustering finished right after I made this post, so for array of size 2million x 20, it took ~3 hour in the building phase and ~4 hour in the cluster phase.

Besides using vectorized=True, do you have other suggestions for scaling the array to 50million x 20 ?

from flexible-clustering.

matteodellamico avatar matteodellamico commented on May 30, 2024

Well, I'd need you to answer my question #3 :) Are you adding elements with .add() or .update()? update regularly calls update_mst, and if you're not using it you should call update_mst at regular intervals yourself to limit memory usage.

Provide for plentiful RAM and cross fingers :) I'd be very interested in knowing which dataset size you're managing to handle with which machine and which timing, if you can disclose that :)

Another, lateral, thing: FISHDBC was created to not need embeddings: if you have a (dis)similarity function that works directly on the raw data, maybe try it out (and let me know if you find interesting results)!

from flexible-clustering.

whaowhao avatar whaowhao commented on May 30, 2024

Gotcha, currently I am using .add() for all data points, and then do .cluster() after all points are added.
I will try calling update_mst to see the speed/memory trade-offs. A side question, would calling update_mst make the overall pipeline slower?
I have ~600Gb RAM, hopefully, that'd be enough :)
Regarding embedding, unfortunately, edit distance that works directly on text doesn't work as well as Euclidean distance of text embeddings.

from flexible-clustering.

matteodellamico avatar matteodellamico commented on May 30, 2024

If, as I expect, memory usage keeps growing linearly with the data set size, you should definitely be fine with 600 GB of RAM.

I expect some kind of speed/memory trade-off depending on the frequency with which you call update_mst, but I wouldn't expect much loss in terms of speed. I didn't have such a big machine to play on, so I guess you're on your own there. Looks like your computation is split almost half-half between MST maintenance and the HNSW part, which is interesting.

from flexible-clustering.

Related Issues (6)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.