A modern approach to the Bayesian hierarchical clustering algorithm.
Bayesian Hierarchical Clustering (BHC) is an agglomerative tree-based method for identifying underlying population structures ("clusters"). BHC was introduced by K. Heller and Z. Ghahramani as a way to approximate the more computationally intensive infinite Gaussian mixture model.
The advantage of these models over their counterparts lies in the fact that an ex ante number of clusters does not need to be specified. Instead, the Bayesian paradigm allows for regularized flexibility via a prior placed on the cluster concentration parameter
The core of the BHC algorithm relies on a Bayesian hypothesis test in which two alternatives are compared:
-
$H_1$ is the hypothesis that two clusters$D_i$ and$D_j$ were generated from the same distribution$p(x | \theta)$ with the prior distribution for$\theta$ being$p(\theta | \beta)$ . The probability of clusters$i$ and$j$ being generated from the same distribution is defined as$p(D_k|H_1)$ . The posterior for this hypothesis is:
Note that
-
$H_2$ is the hypothesis that the two clusters$D_i$ and$D_j$ were generated from two independent distributions and therefore should not be joined together as cluster$D_k$ . The probability of$H_2$ is calculated as$p(D_k|H_2) = p(D_i|T_i)p(D_j|T_j)$ where$T_i$ and$T_j$ are the subclusters being examined.
All existing clusters are compared and joined based on the cluster with the highest posterior merge probability
The distribution families in bhc
take the same key-word arguments as those found in scipy.stats
for ease of use.
family="normal_inv_gamma" takes the following kwargs for params
:
params = {
"multivariate_normal": {"mean": [a vector], "cov": [a matrix]},
"invwishart": {"df": [an integer], "scale": [a matrix], "r": [a scalar]} # r is a scaling factor on the prior precision of the mean
}
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.invgamma.html#scipy.stats.invgamma
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html#scipy.stats.norm
-
OOP
-
Need to create an all-encompassing structure to contain all components - think sklearn style models/objects.
-
Need to create a
cluster
object suitable for a "union-find" structure/algorithm. -
data types:
4.a) multivariate gaussian
4.b) dirchlet-multinomial
-
Pytest
-
Use sphinx for documentation: https://www.sphinx-doc.org/en/master/