Code Monkey home page Code Monkey logo

git's Introduction

GIT: Clustering Based on Graph of Intensity Topology

This repository contains the implementation code for paper:
GIT: Clustering Based on Graph of Intensity Topology

Brief introduction

Accuracy, Robustness to noises and scales, Interpretability, Speed, and Easy to use (ARISE) are crucial requirements of a good clustering algorithm. However, achieving these goals simultaneously is challenging, and most advanced approaches only focus on parts of them.

Towards an overall consideration of these aspects, we propose a novel clustering algo rithm, namely GIT (Clustering Based on Graph of Intensity Topology). GIT considers both local and global data structures: firstly forming local clusters based on intensity peaks of samples, and then estimating the global topological graph (topo-graph) between these local clusters. We use the Wasserstein Distance between the predicted and prior class proportions to automatically cut noisy edges in the topo-graph and merge connected local clusters as final clusters. Then, we compare GIT with seven competing algorithms on five synthetic datasets and nine real-world datasets.

The pipeline is shown as below:


We show the process of clustering on toy datasets as follows:

Overview

  • git_cluster/ contains the core algorithm.
  • dataloaders/ contains dataloader classes for different datasets.
  • utils/ includes measurements and plot tools for understanding clustering 150.
  • 1-Accuracy/ ...
  • 2-Speed/ ...
  • 3-Robustness/ ...
  • 5-Dimension_reduction/ ...

Installation

Install from release code

Install the latest version from the GitHub repository via:

pip install git+https://github.com/gaozhangyang/GIT

Install from source code

Build setup.py and install GIT:

python setup.py build
python setup.py install

Try to import the package:

from git_cluster import GIT

Usage

We have provided quick_start.ipynb as an example, and users can refer this notebook.

We first read the data through toy_dataloader:

from dataloaders import Toy_DataLoader as DataLoader
X, Y_true = Dataloader(name='circles).load()

Then, build GIT class and choose decent hyperparameters k:

from git_cluster import GIT
git = GIT(k=10)

The predicted results are available through fit method.

Y_pred = git.fit_predict(X)

Reproduce the results in our paper

Dependencies

  • python >= 3.7
  • hdbscan == 0.8.26
  • pandas==1.3.4
  • plotly==5.3.1
  • scikit-learn==0.23.2
  • scipy==1.7.1
  • numpy==1.20.0
  • pydpc==0.1.3

Reproducing steps

(1) We have provided an environment setting file of conda. Users can easily reproduce the environment by the following commands:

  conda env create -f environment.yml
  conda activate git_cluster

(2) We compare our method with various clustering methods, among which Quichshift++, DPA, Spectacl require an extra installation. Users should follows the usages from their official github repositories.

We recommend users run the following commands:

  • To install Quichshift++:
pip install git+https://github.com/google/quickshift
  • To install DPA:
pip install git+https://github.com/mariaderrico/DPA
  • To install Spectacl:
pip install git+https://bitbucket.org/Sibylse/spectacl/src/master/

(3) Open the jupyter notebooks in 1-Accuracy, 2-Speed, 3-Robustness, 5-Dimension_reduction.

Citation

If you find this code or idea useful, please cite our work:

@article{gao2021git,
  title={Git: Clustering Based on Graph of Intensity Topology},
  author={Gao, Zhangyang and Lin, Haitao and Tan, Cheng and Wu, Lirong and Li, Stan and others},
  journal={arXiv preprint arXiv:2110.01274},
  year={2021}
}

Contact

If you have any questions, feel free to contact us through the following emails or Github issues.

E-mails

[email protected] or [email protected]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.