Code Monkey home page Code Monkey logo

data-analysis's Introduction

Hierarchical Clustering with kmedioids, kmeans, and kmeanscpp

This repository contains a Python program for performing hierarchical clustering using various clustering algorithms such as kmedioids, kmeans, and kmeanscpp. The program allows you to build, load, and search hierarchical clustering structures based on the specified algorithms and parameters.

Table of Contents

Overview

This program implements a hierarchical clustering algorithm that supports multiple clustering algorithms, including kmedioids, kmeans, and kmeanscpp. It provides functionalities to build hierarchical clustering structures, load existing structures, and search for clusters based on input data.

Getting Started

Prerequisites

  • networkx
  • matplotlib
  • numpy
  • argparse
  • loguru

Installation

  1. Clone this repository:

    git clone https://github.com/DoYouEvenStackSmash/data-analysis.git
    cd data-analysis/src/python-processing
  2. Install the required dependencies using pip:

    pip install -r requirements.txt

Usage

The program supports three main operations: building hierarchical clustering, loading existing clustering, and searching for clusters. You can use the command-line interface to perform these operations.

Build Hierarchical Clustering

To build hierarchical clustering with different clustering parameters, use the following command:

python3 clustering_driver.py build -i example_2_2.npy -k 3 -R 30 -C 45 -o output

Replace example_2_2.npy with the path to your input data file. The -k, -R, and -C flags allow you to specify the number of clusters, number of iterations, and cutoff value respectively. The -o flag is optional and can be used to specify an output file to save the hierarchical clustering structure. This will produce 3 files: output_tree_hierarchy.json, output_tree_data_list.npy, and output_tree_node_vals.npy.

Load Hierarchical Clustering

To load an existing hierarchical clustering structure, use the following command:

python3 clustering_driver.py load -t existing_tree.json -G

Replace existing_tree.json with the path to the generated JSON file containing the hierarchy. Included in the JSON file is a resources field which includes the necessary support files to build the tree. The -G flat builds the tree as an adjacency list and serializes it as tree_representation.graphml

Search Hierarchical Clustering

To search for clusters in an existing hierarchical clustering structure, use the following command:

python3 clustering_driver.py search -t existing_tree.json -M exampleM_2_2.npy -G

Replace existing_tree.json with the path to the JSON file containing the hierarchy and exampleM_2_2.npy with the path to the large input data file. The -G flag is optional and generates a graph from the tree data.

Likelihood evaluation (Experimental)

To compute the likelihoods of reference data given input data, use the following command:

python3 clustering_driver.py likelihood -t existing_tree.json -M some_input.npy -G

This will perform several variants of the likelihood evaluation, and create csv files containing the results of the different computations.

Contributing

Contributions are welcome! If you have any ideas or improvements, please feel free to open issues or pull requests.

License

This project is licensed under the MIT License - see the LICENSE file for details.

data-analysis's People

Contributors

doyouevenstacksmash avatar

Stargazers

Erik Henning Thiede avatar  avatar

Watchers

Erik Henning Thiede avatar  avatar

Forkers

noah-jaffe

data-analysis's Issues

incorrect likelihoods?

Currently I calculate likelihoods by doing the following:

def likelihood(omega, m, N_pix, noise=1):

image

The paper suggests another which... is a little harder to reason about considering we don't have any kind of temporal ordering in the data as far as I know
image

Maybe we're supposed to do some numerical technique overall for the posterior integration, it seems the likelihood is incorrect.

What to do?

k means clustering produces empty clusters

k means empty cluster
https://github.com/DoYouEvenStackSmash/data-analysis/blob/4d42a33719890f4ec7719a1dd8bc50e4dc9e18d2/src/python-processing/kmeans.py#L71C4-L71C4

early stop condition:

# TODO: Investigate Estop behavior

Currently it is possible for the number of centroids to decrease during iterations of k means. This is permissible and triggers an early stop condition, which greatly improves performance but possibly at the cost of soundness. more investigation required to determine whether this is important for overall good clustering and tree search.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.