Jenks Natural Breaks Optimization - a 1D classification method to minimise in-class variance or L1 rounding error.
From Wikipedia: Jenks natural breaks classification method is a data clustering method designed to determine the best arrangement of values into different classes. This is done by seeking to minimize each class’s average deviation from the class mean, while maximizing each class’s deviation from the means of the other groups. In other words, the method seeks to reduce the variance within classes and maximize the variance between classes.
Jenks Classification is an iterative optimization process. We start by
- Define
n
(arbitrary) initial classes. We use a maximum entropy method: The data array is sorted and split inton
equal chunks, the boundaries of these classes are defined as the initialbreaks
for then
classes.
Then loop over
-
For each class, calculate the sum of deviations
DEV
of all data points within that class from the class mean. This deviation can be linear (L1-norm), i.e. abs(xi - μ), or quadratic (L2-norm), i.e. (xi-μ)^2. -
Calculate the sum of all
DEV
s from all classes and record that value, which is important to assess convergence. For linear deviations this is the average rounding error (or quantization error), which should decrease over iterations, for squared deviations this is the something like the in-class variance, which can be used to calculated the goodness of variance fit, which should approach 1.0 -
For each class, compare the
DEV
of that class to the next classesDEV
.3.1 if larger: The
break
between that class and the neighbouring class should be shifted bys
to make the class with the smallerDEV
larger, i.e. so that it contains more data points.3.2 else, shift by
s
in the other direction.
The tricky bit is how to define s
, which is technically a flux
of data points from one class to the other (hence, it's called flux
in the function arguments), that means how much to change the class boundaries in each iteration. We used the following approaches
-
The flux should scale with the size of the 'donating' class. For a constant flux, a class with
N
members passes a certain fraction (forflux=0.1
this is 10%) to the adjacent class. -
The flux decreases by
fluxadjust
(a multiplicative factor in the function arguments) if the previous flux direction was oppsite, i.e. in the previous iteration the same two classes exchanged data points in the opposite direction. This is helpful for convergence.
2.1 The flux can also increase by fluxadjust
when the flux direction previously was the same. This accelerates the convergence.
- The flux cannot be larger than the size of the donating class. This guarantees that every class always contains at least one member and avoids overlapping classes.
Specify the number of classes n
, and provide some data data
, then
using Jenks
data = randn(10_000)
n = 5
JR = JenksClassification(n,data)
After completion the struct JR
contains the break indices JR.breaks
, the bounds JR.bounds
, the number of elements per class JR.n_in_class
and a few other result parameters.
julia> JR.n_in_class
5-element Array{Int64,1}:
1281
2564
2729
2241
1185
A comprehensive example can be found in this notebook
In the REPL do
] add https://github.com/milankl/Jenks.jl