Code Monkey home page Code Monkey logo

flexible-clustering's Introduction

Flexible clustering

A project for scalable hierachical clustering, thanks to a Flexible, Incremental, Scalable, Hierarchical Density-Based Clustering algorithms (FISHDBC, for the friends).

This package lets you use an arbitrary dissimilarity function you write (or reuse from somebody else's work!) to cluster your data.

Please see the paper at https://arxiv.org/abs/1910.07283

Dependencies

Installation

python3 setup.py install

A projects allowing scalable hierarchical clustering, thanks to an approximated version of OPTICS, on arbitrary data and distance measures.

Quickstart

Look at the HDBSCAN documentation for the meaning of the return values of the cluster method. There are plenty of configuration options, inherited by HNSWs and HDBSCAN, but the only compulsory argument is a dissimilarity function between arbitrary data elements:

import flexible_clustering

clusterer = flexible_clustering.FISHDBC(my_dissimilarity)
for elem in my_data:
    clusterer.add(elem)
labels, probs, stabilities, condensed_tree, slt, mst = clusterer.cluster()

for elem in some_new_data: # support cheap incremental clustering
    clusterer.add(elem)
# new clustering according to the newly available data
labels, probs, stabilities, condensed_tree, slt, mst = clusterer.cluster()

Make sure to run everything from outside the source directory, to avoid confusing Python path.

Demo/Example

Look at the fishdbc_example.py file for something more (it requires matplotlib to be run).

Want More Info?

Send me an email at [email protected]. I'll improve the docs as and if people use this.

Author

Matteo Dell'Amico

BSD 3-clause; see the LICENSE file.

flexible-clustering's People

Contributors

matteodellamico avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

flexible-clustering's Issues

Pre-processing recommendations?

Hello!

First of all, bravo on making hierarchical density clustering applicable to big data! I'm considering applying your algorithm to a set of ~40 million 200-dimensional document embeddings.

However, I have a question that isn't strictly related to FISHDBC. I struggled with the curse of dimensionality (distance uniformity) with HDBSCAN, and I suspect I will also have trouble here. Do you know of a remedy (dimensionality reduction, a different metric, etc.) that can be efficiently applied to such a large volume of data?

Clustering 2 Million 20-d vectors using Euclidean distance takes too long during clustering phase

Hi Matteo,

Thanks for the great work! I am applying FISHDBC to cluster 2 million 20-dimensional data points using Euclidean distance, but contrary to what I expected by reading the paper - the build phase took 3 hours (which is reasonable), but the clustering phase (calling clusterer.cluster() ) is taking more than 4 hours and still running.

Wouldn't the clustering phase take much less time than build phase (as shown in the Table 8 household dataset runtime)? Do you have a hunch as to what is going on?

Thank you :)

Fitted model to pickle

Hey Matteo,

thanks for the paper and implementation of the FISHDBC. Great work!
I’ve tried to use it in production but I had problems to pickle the fitted model. Have you tried exporting the fitted model ?

Best

First issue!

Hi! Just stumbled upon your work and we've been using hdbscan on a clustering problem which suffers with scalabilty. Been trying to put your package to work, but issues have ocurred.

Firstly on setup.py I needed to include :

`import numpy ...

setup(...

include_dirs=[numpy.get_include()]`

for the installation to complete successfully.

there is still an issue, though. Trying to run your example i stumbled upon this error:

ModuleNotFoundError: No module named 'flexible_clustering.unionfind'

Is there a solution to this?

And thank you so much for your work, looks really promising!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.