Common segmentation algorithms and distance metrics

There are many ways to divide your data into subgroups. This is a short overview of some of these methods, enriched with a few comments on each method. In the second table I list several methods of calculating pairwise distances between records in your data.

There are many nuances to each of the methods described here - they are just a quick duckduckgo search away on stack overflow and myriad textbooks. Towards Data Science has a great post with animations. There are many interactive applications to see clustering in action, e.g. Stanford class, or ClusterEng which can use your own data or sample (genomic) data.

Lastly, real-world segmentation rarely looks like textbook examples. To quote this excellent blogpost:

If there is variability in our data, a cluster analysis will yield partitions. Given a partitioning, a data analyst will magnify those differences by focusing on contrastive comparisons and assigning evocative names. Once we have names, especially if those names have high imagery, can we be blamed for the reification of minor distinctions?

Name	Input	Particularly useful for	Pros	Cons	R function	Example
K-means	Continuous variables only (standardized)	Continuous data sets, quick exploration of data. Can be used on already-transformed data (e.g. unit activities in neural networks, principal components)	Simple, widely used, many tools to compute optimal clusters	Variables need to be standardized; clusters need to span similar size in space and be spherical. Breaks down in so many ways	`stats::kmeans`	Maibach et al.: Translating Health Psychology into Effective Health Communication: The American Healthstyles Audience Segmentation Project
K-medoids	Dissimilarity matrix	Data with some outliers (more robust than k-means), mixed data that can be converted to distance	Works with any distance matrix. Slightly more robust than k-means. Cluster centres are actual exemplars from data.	Like k-means, not a model-based method so no metrics of model fit.	`cluster::pam`	Kenyon et al.: Classification of incidence and prevalence of certain sexually transmitted infections by world regions
Decision tree	Any mix of variable types	Segmentations with one (specific/composite) outcome variable to guide the process; quick test of most strongly predictive variables	Segment construction very transparent and splitting rules are explicit. Leaves should be profiled to enrich description of each segment beyond few variables in decision tree.	Population split by most predictive variables only, so requires relevant and reliable outcome. Requires additional work to achieve reliable tree.	`rpart::rpart`	Ozgulbas et al.: Financial profiling of public hospitals: an application by data mining
Latent class analysis	Categorical variables only	Data sets that are categorical or have been converted to categorical.	One of few methods that can deal with categorical variables. Results simple to interpret (just class probabilities). As mixture model, outputs metrics to evaluate optimal class number.	Lose information as continuous distributions are discretized.	`poLCA::poLCA`	Kemperman et al.: Heterogeneity in Urban Park Use of Aging Visitors: A Latent Class Analysis
Gaussian mixture models	Multivariate gaussian distributed data	Continuous data where you can reasonably assume subpopulations have gaussian distributions on each variable (fits a distribution described by mean and standard deviation).	Where k-means assumes circular distribution, GMMs can fit any ellipse-shaped data. Mclust package very powerful in performing cluster selection, dimensionality reduction, and other useful functions.	Inaccurate if continuous variables for a subgroup have non-normal distributions in some variables, which is often the case in real-world data.	`mclust::Mclust`	Raaijmakers et al.: Consumer segmentation based on health-related motive orientations and fruit and vegetable consumption
Hierarchical clustering	Dissimilarity matrix	Quick explorations, visualizing clustering solutions with dendrograms.	Versatile technique, dendrogram gives useful intuition on number of clusters.	Commits to joining most adjacent records without reconsidering groupings – might require e.g. k-means to refine clusters.	`stats::hclust`	Martins et al.: Clustering symptoms of non-severe malaria in semi-immune Amazonian patients
TwoStep clustering	Continuous and categorical variables	Large data (solves some memory issues), hands-off solution. Similar to hierarchical clustering. Sequentially groups records that are similar, then performs hierarchical clustering across these groups.	SPSS provides list of relevance of each variable to clustering solution. Does the heavy lifting from start to finish including optimal solution suggestion.	Order of records matters for final solution. Bit of a black box. SPSS only.	SPSS-native, possibly also available in `prcr` package or deprecated `birch::`	Helm et al: Subgrouping outpatients of an environmental medicine unit using SCL-90-R and cluster analysis
Manual segmentation	Based on existing knowledge or eyeballing distributions, set manual cut-offs	Situations in which you have strong priors on what is important and common sense can tell you what a relevant subgroup is (e.g. in ecommerce, everyone who hasn’t been to the shop in 30+ days)	Fast, leverages domain knowledge, easy to implement, not subject to vagaries of algorithms.	Will never be better than your intuition, human mind can only take into account handful of variables.	Any intervention where e.g. some age cut-off is set manually.

Calculating distances between records

Some of the more common clustering methods – including partitioning around medoids (PAM) and hierarchical clustering – work with "distance" or "dissimilarity" matrices (as does k-nearest neighbours). These are square matrices with number of rows and columns equal to number of records in your data. Each element in this matrix represents the dissimilarity between two records. The diagonal will be 0 (each record is identical to itself), and the unit on the non-diagonal depends on the method of calculating distance.

More advanced applications calculate distance in some abstract space, such as the feature space spanned by the activations of all neurons in the final layer of a neural net, or in a space generated for word embeddings.

There are many such methods and there is no need to cover them all. Some of the more important and common ones are described here.

Name	Input	Particularly useful for	Pros	Cons	R function
Euclidean distance	Continuous variables only, each on the same scale (e.g. standard deviations from standardization)	Run-of-the-mill clustering with continuous variables	Literally the shortest distance between two points in the space of the variables.	All distances approximate the same value in high-dimensional space (great Stack Exchange post), so be careful if you have many variables.	`stats::dist(df, method=’euclidean’)`
Manhattan distance	As Euclidean.	Working with k-medoids, high-dimensional data, and for outlier data	Less influenced by outliers than Euclidean distance. Less bad in high-dimensional settings.		`stats::dist(df, method='manhattan')`
Gower distance	Any mix of variables	Mixed dataframes. Gower distance will use an appropriate distance metric on each variable independently, then combine them across variables. No standardization needed.	Hugely flexible and allows virtually any data set to be segmented by PAM and hierarchical clustering	Still need to consider outliers and skewed distributions as with most other methods.	`cluster::daisy(df, metric='gower')`
Random forest distance	Any mix of variables	Taking a raw dataframe and brute-forcing it into a distance matrix by building 1000s of decision trees and seeing how often each pair of records ends up in the same leaf (or leaves close to one another)	Rather elegant one-size-fits-all approach. Can be used for both directed and undirected distance calculation (by setting target of each tree to random variable or non-random target variable). R’s randomforest package is very good.	Takes substantial computation and has some knobs to tweak.	`randomForest::randomForest(~., data=df, ntree=10000, proximity=TRUE, oob.proximity=TRUE)`
Correlation-based distance	Continuous variables on same scale only	Clustering based on patterns of values rather than actual values themselves.	Places individuals close together that exhibit a similar pattern (e.g. high-high-low-low). Helps you avoid some of the pitfalls of other distance metrics in high dimensions.	Depending on your variables, the actual values of variables might be much more important than patterns of responses. Only works if you can get each variable in the same space (e.g. z-score first)	`r = cor(t(df))` then `dist = as.dist(1-r)` link
Cosine (dis)similarity	Continuous variables on the same scale	High-dimensional data, popular in natural language processing	Reduce arbitrarily dimensional space down to single number without some of the unexpected issues with euclidean distance	not sure	`cos_angle = lsa::cosine(t(df))` then `dist = as.dist(1 - cos_angle)` (to be checked as cos_angle can be negative)

petersmittenaar / segmentation Goto Github PK

segmentation's Introduction

Common segmentation algorithms and distance metrics

Calculating distances between records

segmentation's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent