Light

matteodellamico / flexible-clustering Goto Github PK

View Code? Open in Web Editor NEW

80.0 4.0 16.0 44 KB

Clustering for arbitrary data and dissimilarity function

License: BSD 3-Clause "New" or "Revised" License

Python 95.57% Cython 4.43%

clustering non-metric streaming-data flexible-clustering

flexible-clustering's Introduction

Flexible clustering

A project for scalable hierachical clustering, thanks to a Flexible, Incremental, Scalable, Hierarchical Density-Based Clustering algorithms (FISHDBC, for the friends).

This package lets you use an arbitrary dissimilarity function you write (or reuse from somebody else's work!) to cluster your data.

Please see the paper at https://arxiv.org/abs/1910.07283

Dependencies

Python 3
Cython
hdbscan: https://github.com/scikit-learn-contrib/hdbscan
scipy: https://www.scipy.org/

Installation

python3 setup.py install

A projects allowing scalable hierarchical clustering, thanks to an approximated version of OPTICS, on arbitrary data and distance measures.

Quickstart

Look at the HDBSCAN documentation for the meaning of the return values of the cluster method. There are plenty of configuration options, inherited by HNSWs and HDBSCAN, but the only compulsory argument is a dissimilarity function between arbitrary data elements:

import flexible_clustering

clusterer = flexible_clustering.FISHDBC(my_dissimilarity)
for elem in my_data:
    clusterer.add(elem)
labels, probs, stabilities, condensed_tree, slt, mst = clusterer.cluster()

for elem in some_new_data: # support cheap incremental clustering
    clusterer.add(elem)
# new clustering according to the newly available data
labels, probs, stabilities, condensed_tree, slt, mst = clusterer.cluster()

Make sure to run everything from outside the source directory, to avoid confusing Python path.

Demo/Example

Look at the fishdbc_example.py file for something more (it requires matplotlib to be run).

Want More Info?

Send me an email at [email protected]. I'll improve the docs as and if people use this.

Author

Matteo Dell'Amico

Copyright

BSD 3-clause; see the LICENSE file.

flexible-clustering's People

Contributors

Stargazers

Watchers

Forkers

vishalbelsare qfxlcyc supertulli postyear zhiyanov abhinay-anubola rauhryan google-ml pieromacaluso berkan352 yyht liqul mahmoodalmansooei niceboy120 datacraft-ai jolespin

flexible-clustering's Issues

Pre-processing recommendations?

Hello!

First of all, bravo on making hierarchical density clustering applicable to big data! I'm considering applying your algorithm to a set of ~40 million 200-dimensional document embeddings.

However, I have a question that isn't strictly related to FISHDBC. I struggled with the curse of dimensionality (distance uniformity) with HDBSCAN, and I suspect I will also have trouble here. Do you know of a remedy (dimensionality reduction, a different metric, etc.) that can be efficiently applied to such a large volume of data?

Clustering 2 Million 20-d vectors using Euclidean distance takes too long during clustering phase

Hi Matteo,

Thanks for the great work! I am applying FISHDBC to cluster 2 million 20-dimensional data points using Euclidean distance, but contrary to what I expected by reading the paper - the build phase took 3 hours (which is reasonable), but the clustering phase (calling clusterer.cluster() ) is taking more than 4 hours and still running.

Wouldn't the clustering phase take much less time than build phase (as shown in the Table 8 household dataset runtime)? Do you have a hunch as to what is going on?

Thank you :)

Fitted model to pickle

Hey Matteo,

thanks for the paper and implementation of the FISHDBC. Great work!
I’ve tried to use it in production but I had problems to pickle the fitted model. Have you tried exporting the fitted model ?

Best

First issue!

Hi! Just stumbled upon your work and we've been using hdbscan on a clustering problem which suffers with scalabilty. Been trying to put your package to work, but issues have ocurred.

Firstly on setup.py I needed to include :

`import numpy ...

setup(...

include_dirs=[numpy.get_include()]`

for the installation to complete successfully.

there is still an issue, though. Trying to run your example i stumbled upon this error:

ModuleNotFoundError: No module named 'flexible_clustering.unionfind'

Is there a solution to this?

And thank you so much for your work, looks really promising!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.