m-py / anticlust Goto Github PK
View Code? Open in Web Editor NEWSubset partitioning via anticlustering
License: Other
Subset partitioning via anticlustering
License: Other
Right now I have three functions that implement an exchange algorithm; 2x specialized functions that are speed optimized for maximizing the kmeans and cluster editing objectives, respectively, 1x a generic version that can maximize any objective function.
This means there is a lot of redundant code. It would be desirable to merge the three functions.
The difficulty in merging is that each of the three functions has need for different data structures that are generated and updated throughout the exchange method. I need to test if it is possible to merge the functions in a reasonable way despite this difficulty.
As the exchange method is now the default algorithm (and is very strongly recommended in comparison to random sampling) it seems a bit too much to include a parallelize
option for random sampling -- remove it. This also means removing the argument seed
which is only relevant for making parallel random sampling reproducible. Removing two arguments from the anticlustering()
is good because it has too many right now, and this change will clean up the code base in general.
Since the new variables are appended very early to the input data when objective = "kplus"
in anticlustering()
, preclustering (i.e., matching()
) also uses these variables, which does not make sense and should be fixed.
@unDocUMeantIt provided an algorithm for anticlustering that is based on efficiently finding preclusters. From his function centroid_anticlustering()
, read out these preclusters and use them as a backend in the balanced_clustering()
function when method = "heuristic"
. My tests indicate that this clustering heuristic is faster and better than any that are currently implemented. This function should also be called when preclustering = TRUE
in the anticlustering()
function.
This means that I will be able to remove the following functions from the code base: equal_sized_kmeans()
, greedy_balanced_k_clustering()
, greedy_matching()
and any lower level functions that are only called from within these functions.
There is no reason that features are standardized within the anticlustering()
function, users could do it with a call to scale()
before calling anticlustering()
. I am not even sure if standardization makes much sense in the context of anticlustering (or at least I have yet to see any advantages).
The preclustering
argument should accept a preclustering vector as input, not only TRUE
/FALSE
. If the input is TRUE
, the preclustering is computed within the function anticlustering
.
If the preclustering
argument accepts a clustering vector, this allows more flexibility in combining different methods (i.e., exact matching as preclustering, combined with a random sampling heuristic for anticlustering).
This makes the R session crash quite reliably (about at least once every ten attempts):
library(anticlust)
N <- 100
K <- N/2
cannot_link <- c(1, rep(2:(N-1), each = 2), N)
cannot_link <- matrix(cannot_link, ncol = 2, byrow = TRUE)
cannot_link <- rbind(cannot_link, t(apply(cannot_link, 1, rev)))
mat <- matrix(1, nrow = N, ncol = N)
mat[cannot_link] <- -1
anticlustering(mat, K = K, objective = "dispersion")
I get
*** caught segfault ***
address (nil), cause 'unknown'
Dear Developers,
First of all, thank your for developing such a useful package in R.
I've ran into an issue applying anticlusting on large data set (N = 295k) consisting of 5 variables (2 numeric, 3 categorical).
Numeric variables: age and duration
Categorical variables: gender (2 levels), riskzone (42 levels) and language (4 levels).
set.seed(98772)
sample_tbl <-
sample_tbl %>%
mutate(group = anticlustering(sample_tbl[,c("age","duration")],
K = 2,
categories = sample_tbl[,c("gender", "riskzone", "language")],
objective = "kplus",
standardize = TRUE))
After a few seconds I am running into an error:
Error: segfault from C stack overflow
Do you have any input/strategy on how to overcome this issue?
Thanks in advance
Apparently, with the optimized "local-updating" version of k-means anticlustering, the objective is incorrectly updated when the group sizes are unequal. Better results are obtained when recomputing the entire objective during each iteration. Reproducible example:
library(anticlust)
features <- schaper2019[, 3:6]
K <- 3
init <- sample(rep(1:3, nrow(schaper2019) * c(1/4, 1/4, 1/2)))
anticlusters <- anticlustering(
features,
K = init,
objective = variance_objective,
categories = schaper2019$room
)
mean_sd_tab(features, anticlusters)
# rating_consistent rating_inconsistent syllables frequency
# 1 "4.49 (0.24)" "1.10 (0.07)" "3.42 (1.10)" "18.33 (2.43)"
# 2 "4.49 (0.25)" "1.10 (0.07)" "3.42 (0.72)" "18.29 (2.24)"
# 3 "4.49 (0.25)" "1.10 (0.06)" "3.42 (0.94)" "18.31 (2.49)"
anticlusters <- anticlustering(
features,
K = init,
objective = "variance",
categories = schaper2019$room
)
mean_sd_tab(features, anticlusters)
# rating_consistent rating_inconsistent syllables frequency
# 1 "4.46 (0.24)" "1.11 (0.07)" "3.79 (1.10)" "19.75 (2.83)"
# 2 "4.51 (0.26)" "1.11 (0.06)" "2.96 (0.75)" "17.38 (1.74)"
# 3 "4.50 (0.24)" "1.10 (0.07)" "3.46 (0.82)" "18.06 (2.13)"
Now that the exchange method is the default option for anticlustering, it is desirable that the distance objective is computed faster. Instead of recomputing all distances by cluster, do something like the following:
[i,j]
is TRUE
whenever the elements i and j are part of the same cluster. After a swap, swap the columns and rows for the elements i and j (because they just exchange their cluster partners), but also set the entries [i, j]
and [j, i]
to FALSE
(exchange partners are not part of the same cluster).sum
.Add the possibility to include categorical constraints when method = "ilp"
Thanks for a great package - it has been fantastic for balancing stimuli sets in complex experiments. I was wondering if there's any way to include a variable with NA values conditional on a categorical variable. At the moment NA values are not permitted (understandable). Something like:
library(tidyverse)
library(anticlust)
df <- mtcars |>
mutate(hp = ifelse(vs == 0, NA, hp)) |>
select(mpg, disp, hp, vs)
anticlustering(
df[,1:3],
K = c(9, 9, 9, 5),
objective = "variance",
categories = df$vs
)
Right now, my best idea is to do the clustering separately for each category (in this case, each level of hp
) and then combine the data into the final groups, but I was wondering if there are any other ways.
Cheers -
Maybe, at some point, anticlustering()
should also be callable similarly to the following way:
anticlustering(
iris,
numeric_vars = c(Sepal.Length, Sepal.Width),
categorical_vars = Species,
K = 3
)
That is, the first argument is a generic data argument that includes the entire data frame that users work with and then specify only the column names to select numeric and categorical variables. It would probably just require to add the arguments numeric_vars
and categorical_vars
to anticlustering()
, test if they exist, and then use non-standard-evaluation to extract the relevant data from the first argument. This would also be better integrated into a tidyverse workflow. All of this does not make sense if the data input is a distance matrix, which still has to be supported.
Currently, we would have to use the following, which may be less appealing to users:
anticlustering(
iris[, c("Sepal.Length", "Sepal.Width")],
categories = iris$Species,
K = 3
)
For my application of anticlust it would be very useful if assignment of individual elements to clusters could be fixed or constrained a priori in anticlustering()
. Instead of considering all K clusters for the constrained element, the algorithm would consider only a specific subset of clusters.
My use case is the assignment of versions of a psychological test to school classes during field testing. A small subset of classes have asked to use or not use a specific version; I still want to balance the covariates (averaged student characteristics) between versions across all classes taking these constraints into account.
A list of possible cluster memberships would be a straightforward way of specifying the constraints. Empty (NULL) list elements could denote unconstrained cluster selection. For example, with four elements and three clusters, the following list would specify unconstrained cluster selection for elements 1 and 2, constrain element 3 to cluster 2, and allow only clusters 2 or 3 for element 4:
list(
NULL, # unconstrained assignment for element 1
c(1, 2, 3), # unconstrained assignment for element 2 (since we only have 3 clusters)
2, # element 3 fixed to cluster 2
c(2, 3) # element 4 constrained to clusters 2 or 3
)
Maybe this is already possible somehow but I was unable to figure out how. Also of course, it may well be that this is not possible to implement for some reason. But I still thought it worthwhile to signal that there is demand for this feature (if only from me…).
Finally, thank you for the anticlust package!
Internally, an augmented data set is passed to anticlustering()
, and preclustering is then conducted on the basis of the "normal" features + the additional k-plus variables, which does not make sense. Therefore, kplus_anticlustering()
needs to perform preclustering itself before calling anticlustering()
. Calling anticlustering(..., objective = "kplus", preclustering = TRUE)
works correctly however (but this is reduced in its functionality because it only considers means and variances and not higher order moments).
In the following cases, the restriction of the same group size is not needed and can be dropped (That means: allow for deviations of group sizes by 1):
I received this question via email and share with permission. It is similar to #46 regarding the inclusion of constraints on the cluster membership of items:
I have been using anticlust to assign subjects to groups and the library has been performing very well for me. One use case that I haven’t found a clean solution to is when I need to increase the sample size after I’ve already assigned some subjects to groups. Is there a way to do that? For example, if I have three groups (A,B,C) of 10 subjects and I find that I need to add 10 more subjects in a second round of experiments, is there a way to run
anticlustering()
with the previous 10 subjects already assigned to A, B,C and have them considered when I add the second round of ten to each group?
What I currently do is just run
anticlustering()
on the second group as if it were independent and try to make the final assignments manually. Not terribly hard to do, so it’s not a huge issue for me if there isn’t a way to do so (ie, I wouldn’t make a feature request), but I thought I would ask if there is a method that already exists.
The BILS heuristic sometimes does not return a partition that has an optimal value of the dispersion, even if it is initialized with a partition that has the optimal value (which contradicts the logic of the pareto set, which must contain a partition if it has the best value on one criterion).
Reproducible example:
data <- structure(c(2L, 2L, 3L, 5L, 1L, 3L, 3L, 2L, 5L, 1L, 4L, 4L, 1L,
3L, 4L, 4L, 1L, 5L, 3L, 4L, 2L, 3L, 2L, 3L, 3L, 1L, 5L, 4L, 4L,
5L, 3L, 2L, 4L, 5L, 2L, 3L, 3L, 1L, 3L, 2L, 3L, 3L, 1L, 2L, 2L,
2L, 4L, 1L, 5L, 5L, 3L, 3L, 5L, 1L, 4L, 2L, 5L, 4L, 5L, 1L, 2L,
3L, 1L, 1L, 3L, 2L, 4L, 5L, 3L, 4L, 5L, 3L, 1L, 5L, 2L, 4L, 2L,
1L, 5L, 2L, 5L, 1L, 1L, 2L, 4L, 2L, 1L, 1L, 1L, 4L, 1L, 3L, 2L,
1L, 1L, 5L, 5L, 4L, 4L, 4L, 5L, 4L, 1L, 3L, 5L, 4L, 2L, 1L, 4L,
1L, 3L, 1L, 3L, 3L, 2L, 3L, 4L, 2L, 1L, 5L, 3L, 4L, 5L, 5L, 4L,
1L, 1L, 4L, 3L, 5L, 2L, 1L, 4L, 4L, 4L, 3L, 2L, 2L, 3L, 5L, 4L,
3L, 3L, 1L, 5L, 5L, 1L, 1L, 1L, 5L, 5L, 4L, 2L, 2L, 4L, 2L, 1L,
3L, 5L, 3L, 1L, 2L, 4L, 4L, 1L, 5L, 1L, 4L, 3L, 4L, 5L, 5L, 4L,
3L, 3L, 2L, 5L, 5L, 1L, 3L, 2L, 3L, 4L, 2L, 5L, 3L, 3L, 2L, 2L,
4L, 2L, 1L, 4L, 1L, 5L, 2L, 5L, 2L, 2L, 3L, 2L, 3L, 3L, 1L, 1L,
5L, 1L, 5L, 1L, 2L, 1L, 3L, 3L, 4L, 2L, 4L, 3L, 1L, 3L, 4L, 2L,
5L, 2L, 1L, 2L, 3L, 3L, 2L, 2L, 4L, 5L, 2L, 3L, 1L, 5L, 3L, 2L,
1L, 4L, 4L, 3L, 1L, 2L, 3L, 1L, 1L, 2L, 2L, 4L, 3L, 2L, 2L, 5L,
1L, 3L, 2L, 2L, 4L, 4L, 4L, 5L, 5L, 4L, 4L, 2L, 5L, 2L, 2L, 4L,
5L, 3L, 3L, 2L, 2L, 1L, 3L, 5L, 3L, 5L, 1L, 2L, 4L, 3L, 5L, 5L,
5L, 4L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 3L, 1L, 1L, 1L, 1L, 3L, 1L,
2L, 3L, 4L, 4L, 3L, 4L, 2L, 3L, 4L, 3L, 4L, 5L, 1L, 5L, 4L, 5L,
1L, 1L, 1L, 2L, 2L, 4L, 1L, 2L, 1L, 3L, 3L, 1L, 4L, 3L, 5L, 2L,
4L, 2L, 2L, 1L, 1L, 3L, 5L, 5L, 1L, 4L, 2L, 3L, 3L, 2L, 5L, 4L,
1L, 4L, 3L, 5L, 5L, 4L, 5L, 1L, 5L, 4L, 5L, 5L, 5L, 3L, 4L, 5L,
5L, 4L, 4L, 3L, 3L, 4L, 1L, 4L, 2L, 2L, 4L, 1L, 1L, 2L, 4L, 5L,
3L, 1L, 3L, 3L, 2L, 4L, 1L, 3L, 5L, 5L, 5L, 2L, 5L, 5L, 1L, 5L,
1L, 2L, 1L, 1L, 2L, 4L, 5L, 2L, 2L, 2L, 4L, 5L, 2L, 3L, 1L, 4L,
3L, 3L, 3L, 2L, 4L, 4L, 2L, 3L, 1L, 4L, 1L, 1L, 4L, 3L, 5L, 2L,
5L, 2L, 4L, 2L, 2L, 4L, 4L, 1L, 3L, 1L, 3L, 3L, 3L, 5L, 2L, 1L,
5L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 2L, 5L, 5L, 2L, 5L, 2L, 3L, 1L,
3L, 3L, 5L, 5L, 2L, 4L, 3L, 5L, 1L, 1L, 5L, 3L, 2L, 5L, 4L, 1L,
5L, 5L, 1L, 1L, 5L, 4L, 5L, 4L, 5L, 5L, 1L, 2L, 5L, 1L, 5L, 4L,
3L, 4L, 3L, 1L, 1L, 1L, 5L, 1L, 4L, 5L, 2L, 1L, 4L, 5L, 3L, 1L,
4L, 4L, 1L, 1L, 3L, 4L, 5L, 1L, 1L, 5L, 3L, 4L, 3L, 2L, 2L, 4L,
3L, 2L, 4L, 4L, 5L, 5L, 1L, 5L, 3L, 2L, 1L, 1L, 3L, 2L, 2L, 3L,
5L, 5L, 5L, 4L, 1L, 2L, 4L, 5L, 2L, 4L, 1L, 5L, 4L, 5L, 2L, 5L,
4L, 1L, 2L, 2L, 2L, 5L, 5L, 3L, 2L, 2L, 3L, 3L, 3L, 4L, 1L, 5L,
2L, 1L, 1L, 1L, 5L, 1L, 2L, 4L, 2L, 5L, 2L, 2L, 5L, 4L, 3L, 5L,
3L, 4L, 1L, 4L, 2L, 1L, 5L, 3L, 4L, 4L, 1L), dim = c(120L, 5L
))
# optimal_dispersion(data, K = K)$dispersion # 2.236068
opt_groups <- c(1, 1, 4, 4, 5, 3, 2, 2, 1, 2, 4, 4, 5, 2, 1, 3, 2, 3, 3, 3,
1, 2, 1, 1, 1, 1, 3, 3, 2, 4, 1, 4, 2, 1, 2, 3, 1, 4, 1, 4, 2,
4, 3, 2, 3, 4, 5, 1, 5, 4, 1, 3, 3, 2, 5, 2, 1, 2, 5, 3, 5, 4,
5, 3, 5, 5, 2, 2, 5, 5, 1, 5, 2, 2, 4, 4, 3, 4, 3, 4, 1, 1, 2,
3, 5, 1, 5, 5, 2, 3, 4, 5, 1, 2, 2, 5, 4, 5, 4, 3, 5, 4, 4, 3,
3, 2, 3, 1, 1, 1, 2, 3, 5, 3, 5, 4, 4, 5, 4, 5)
set.seed(12345)
bils_groups <- bicriterion_anticlustering(data, K = opt_groups, R = c(1, 0))
dispersion_objective(data, opt_groups)
# [1] 2.236068
apply(bils_groups, 1, FUN = function(x) dispersion_objective(data, x))
# 1 2 3 5 6
# 1.414214 1.414214 1.414214 1.414214 1.732051
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.