bhc

A modern approach to the Bayesian hierarchical clustering algorithm.

Bayesian Hierarchical Clustering (BHC) is an agglomerative tree-based method for identifying underlying population structures ("clusters"). BHC was introduced by K. Heller and Z. Ghahramani as a way to approximate the more computationally intensive infinite Gaussian mixture model.

The advantage of these models over their counterparts lies in the fact that an ex ante number of clusters does not need to be specified. Instead, the Bayesian paradigm allows for regularized flexibility via a prior placed on the cluster concentration parameter $\alpha$. This module is build using object-oriented programing (OOP) methodologies found in python to build a module that can be used similar manner to the popular scikit-learn library. As with many Bayesian methods, the increased flexibility of BHC comes at a computational cost and the increased risk of poor results due to misspecified priors.

Algorithm Description

The core of the BHC algorithm relies on a Bayesian hypothesis test in which two alternatives are compared:

$H_1$ is the hypothesis that two clusters $D_i$ and $D_j$ were generated from the same distribution $p(x | \theta)$ with the prior distribution for $\theta$ being $p(\theta | \beta)$. The probability of clusters $i$ and $j$ being generated from the same distribution is defined as $p(D_k|H_1)$. The posterior for this hypothesis is:

$$ r_k = \frac{\pi_k p(D_k|H_1)}{\pi_k p(D_k|H_1) + (1 - \pi_k) p(D_i|T_i)p(D_j|T_j)} $$

Note that $\pi_k$ is the prior probability of a merge occuring for clusters $i$ and $j$. This makes the denominator of this expression the Bayesian evidence.

$H_2$ is the hypothesis that the two clusters $D_i$ and $D_j$ were generated from two independent distributions and therefore should not be joined together as cluster $D_k$. The probability of $H_2$ is calculated as $p(D_k|H_2) = p(D_i|T_i)p(D_j|T_j)$ where $T_i$ and $T_j$ are the subclusters being examined.

All existing clusters are compared and joined based on the cluster with the highest posterior merge probability $r_k$.

Using bhc

The distribution families in bhc take the same key-word arguments as those found in scipy.stats for ease of use.

The Multivariate Normal-inverse Wishart family

family="normal_inv_gamma" takes the following kwargs for params:

params = {
        "multivariate_normal": {"mean": [a vector], "cov": [a matrix]},
        "invwishart": {"df": [an integer], "scale": [a matrix], "r": [a scalar]}  # r is a scaling factor on the prior precision of the mean
    }

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.invgamma.html#scipy.stats.invgamma

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html#scipy.stats.norm

Development Plan

OOP
Need to create an all-encompassing structure to contain all components - think sklearn style models/objects.
Need to create a cluster object suitable for a "union-find" structure/algorithm.
data types:

4.a) multivariate gaussian

4.b) dirchlet-multinomial
Pytest
https://www.python.org/dev/peps/pep-0008/
Use sphinx for documentation: https://www.sphinx-doc.org/en/master/

samvoisin / bhc Goto Github PK

bhc's Introduction

bhc

Algorithm Description

Using bhc

The Multivariate Normal-inverse Wishart family

Development Plan

bhc's People

Contributors

Stargazers

Watchers

bhc's Issues

Multinomial-Dirchlet family

Unit testing BHC

parellelization

Documentation

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent