Code Monkey home page Code Monkey logo

edo-pasto / parallel-flexible-clustering Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 0.0 5.95 MB

The thesis presents the parallelisation of a state-of-the art clustering algorithm, FISHDBC. This objective has been achived by improving the main data structures and components of the algorithm: HNSW, MST and HDBSCAN. My contribution is based on a lock-free strategy, completely wrote in Python.

Python 97.61% Cython 2.39%
clustering clustering-algorithm clustering-methods clusters clusters-detection concurrent-programming high-performance-computing hnsw minimum-spanning-trees multiprocess

parallel-flexible-clustering's Introduction

Parallel Flexible Clustering

Master Thesis in Computer Science - December 2023

- Thesis Student: Edoardo Pastorino, 5169595

- Supervisors: Matteo Dell'Amico, Daniele D'Agostino

- Reviewer: Vito Paolo Pastore

Introduction:

The thesis presents the parallelisation of a state-of-the art clustering algorithm, FISHDBC. This objective has been achived by improving the main data structures and components of the algorithm: HNSW, a graph-based data structure used in approximate nearest neighbor search; MST, a tree that spans all the vertices in the graph while minimizing the total weight of the edges; the HDBSCAN clustering, designed to perform robust clustering of data points based on their density. My contribution is based on a lock-free strategy, made feasible because FISHDBC provides an approximated solution, and provides good performance figures. It is worth to note that the Parallel Flexible Clustering algorithm is completely wrote in Python, without dependencies to other languages. This represents an important feature making it user friendly and highly customisable, considering that user defined distance metrics for computing similarity among data are mostly written in this language

Project Structure:


.

├── parallel-flexible-clustering-master

│   ├── parallel_flexible_clustering

│   │    ├── __init__.py

│   │    ├── create_text_dataset.py

│   │    ├── parallel_fishdbc_example.py
    
│   │    ├── fishdbc.py

│   │    ├── hnsw_parallel.py

│   │    ├── hnsw.py

│   │    ├── parallel_hnsw_example.py

│   │    └── unionfind.c

│   ├── setup.cfg 

│   └── setup.py 

Insallation & Execution:

In this appendix section we will see some basic instructions to install everything you need to execute the FISHDBC algorithm on a Linux machine. First of all you have to download or clone the GitHub repository:

clone https://github.com/edo-pasto/Parallel-FISHDBC.git

Secondly, I really suggest to create a virtual environment, or with conda or with pip as package manager (in the following instructions i will use conda, but with pip the process is very similar and you can check \cite{pipInstr}):

conda create --name myenv python=3.10.12

After the creation of the environment (the suggestion is to create the environment with python version 3.10.12 or 3.11.3) we can activate it with:

conda activate myenv

Now that the preliminary operations are done we can start to install all the dependencies needed by the algorithm to work. These first packages can be installed directly with one single conda install command:

conda install numpy scipy pandas scikit-learn numba matplotlib   

The package allowing us to use the Leveshtein distance should be installed from a different conda source:

conda install -c conda-forge levenshtein

Also Cython is installed with a different command:

conda install -c anaconda cython

The last package to be installed is the one associated to the HDBSCAN, this time installed with pip (is not possible with conda):

pip install hdbscan

The last installation step is to execute the setup.py file in the following way:

python3 setup.py install

Now we will see how to execute experiments of the algorithm launching the fishdbc_example.py file. If you want to execute a classical run of the parallel FISHDBC you should type:

python3 parallel_fishdbc_example.py --dataset blob --nitems 10000 --centers 5 --distance euclidean --parallel 16

where you can specify the data set that you want to use as input (blob or text), the number of items of the input data set (--nitems), the numbers of the data set's centroids (--centers), depending on the type of data the distance to be used (the available distances are: euclidean, sqeuclidean, minkowski, cosine for blob data, and levenshtein for text data) and the number of processes to use (--parallel 1 means original single process FISHDBC, --parallel > 1 means multi process, you can pass from 1 to 16 processes).
If you want to execute the parallel FISHDBC with text data as input you can write:

python3 parallel_fishdbc_example.py --dataset text --distance levenshtein --nitems 1000 --centers 10 --parallel 16

If, instead, you desire to execute the original single process FISHDBC you can also write:

python3 parallel_fishdbc_example.py --dataset blob --nitems 1000 --centers 5 --distance euclidean 

It is possible to take a look at the accuracy of the final parallel (but also the single process) FISHDBC clustering enabling the --test option:

python3 parallel_fishdbc_example.py --dataset blob --nitems 10000 --centers 5 --distance euclidean --parallel 16 --test True

There is also the opportunity to execute an example of only the parallel HNSW creation (again with the possibility to perform some accuracy tests thank to the --test True option):

python3 parallel_hnsw_example.py --dataset blob --distance euclidean --parallel 16 --nitems 10000 --centers 5

Of course you can execute the parallel HNSW also for textual data (but in this case you cannot perform any kind of test)

python3 parallel_hnsw_example.py --dataset text --distance levenshtein --nitems 1000 --centers 10 --parallel 16

parallel-flexible-clustering's People

Contributors

edo-pasto avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.