Code Monkey home page Code Monkey logo

bhc's Introduction

bhc

A modern approach to the Bayesian hierarchical clustering algorithm.

Bayesian Hierarchical Clustering (BHC) is an agglomerative tree-based method for identifying underlying population structures ("clusters"). BHC was introduced by K. Heller and Z. Ghahramani as a way to approximate the more computationally intensive infinite Gaussian mixture model.

The advantage of these models over their counterparts lies in the fact that an ex ante number of clusters does not need to be specified. Instead, the Bayesian paradigm allows for regularized flexibility via a prior placed on the cluster concentration parameter $\alpha$. This module is build using object-oriented programing (OOP) methodologies found in python to build a module that can be used similar manner to the popular scikit-learn library. As with many Bayesian methods, the increased flexibility of BHC comes at a computational cost and the increased risk of poor results due to misspecified priors.


Algorithm Description

The core of the BHC algorithm relies on a Bayesian hypothesis test in which two alternatives are compared:

  1. $H_1$ is the hypothesis that two clusters $D_i$ and $D_j$ were generated from the same distribution $p(x | \theta)$ with the prior distribution for $\theta$ being $p(\theta | \beta)$. The probability of clusters $i$ and $j$ being generated from the same distribution is defined as $p(D_k|H_1)$. The posterior for this hypothesis is:

$$ r_k = \frac{\pi_k p(D_k|H_1)}{\pi_k p(D_k|H_1) + (1 - \pi_k) p(D_i|T_i)p(D_j|T_j)} $$

Note that $\pi_k$ is the prior probability of a merge occuring for clusters $i$ and $j$. This makes the denominator of this expression the Bayesian evidence.

  1. $H_2$ is the hypothesis that the two clusters $D_i$ and $D_j$ were generated from two independent distributions and therefore should not be joined together as cluster $D_k$. The probability of $H_2$ is calculated as $p(D_k|H_2) = p(D_i|T_i)p(D_j|T_j)$ where $T_i$ and $T_j$ are the subclusters being examined.

All existing clusters are compared and joined based on the cluster with the highest posterior merge probability $r_k$.


Using bhc

The distribution families in bhc take the same key-word arguments as those found in scipy.stats for ease of use.

The Multivariate Normal-inverse Wishart family

family="normal_inv_gamma" takes the following kwargs for params:

params = {
        "multivariate_normal": {"mean": [a vector], "cov": [a matrix]},
        "invwishart": {"df": [an integer], "scale": [a matrix], "r": [a scalar]}  # r is a scaling factor on the prior precision of the mean
    }

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.invgamma.html#scipy.stats.invgamma

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html#scipy.stats.norm


Development Plan

  1. OOP

  2. Need to create an all-encompassing structure to contain all components - think sklearn style models/objects.

  3. Need to create a cluster object suitable for a "union-find" structure/algorithm.

  4. data types:

    4.a) multivariate gaussian

    4.b) dirchlet-multinomial

  5. Pytest

  6. https://www.python.org/dev/peps/pep-0008/

  7. Use sphinx for documentation: https://www.sphinx-doc.org/en/master/

bhc's People

Contributors

samvoisin avatar

Stargazers

 avatar

Watchers

 avatar

bhc's Issues

Documentation

Currently the documentation in the broad project README file is sub-par. This must be updated when the package structure is determined for v.0.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.