m-py / anticlust Goto Github PK

View Code? Open in Web Editor NEW

28.0 28.0 4.0 4.71 MB

Subset partitioning via anticlustering

License: Other

R 75.51% C 24.45% Shell 0.03%

anticlust's People

Contributors

Stargazers

Watchers

Forkers

undocumeantit manalama minghao2016 statunizaga

anticlust's Issues

Merge generic and specialized exchange methods

Right now I have three functions that implement an exchange algorithm; 2x specialized functions that are speed optimized for maximizing the kmeans and cluster editing objectives, respectively, 1x a generic version that can maximize any objective function.

This means there is a lot of redundant code. It would be desirable to merge the three functions.

The difficulty in merging is that each of the three functions has need for different data structures that are generated and updated throughout the exchange method. I need to test if it is possible to merge the functions in a reasonable way despite this difficulty.

Remove argument `parallelize` from `anticlustering()`

As the exchange method is now the default algorithm (and is very strongly recommended in comparison to random sampling) it seems a bit too much to include a parallelize option for random sampling -- remove it. This also means removing the argument seed which is only relevant for making parallel random sampling reproducible. Removing two arguments from the anticlustering() is good because it has too many right now, and this change will clean up the code base in general.

Preclustering is broken when objective = "kplus"

Since the new variables are appended very early to the input data when objective = "kplus" in anticlustering(), preclustering (i.e., matching()) also uses these variables, which does not make sense and should be fixed.

Replace current preclustering functions

@unDocUMeantIt provided an algorithm for anticlustering that is based on efficiently finding preclusters. From his function centroid_anticlustering(), read out these preclusters and use them as a backend in the balanced_clustering() function when method = "heuristic". My tests indicate that this clustering heuristic is faster and better than any that are currently implemented. This function should also be called when preclustering = TRUE in the anticlustering() function.

This means that I will be able to remove the following functions from the code base: equal_sized_kmeans(), greedy_balanced_k_clustering(), greedy_matching() and any lower level functions that are only called from within these functions.

Remove the argument `standardize`

There is no reason that features are standardized within the anticlustering() function, users could do it with a call to scale() before calling anticlustering(). I am not even sure if standardization makes much sense in the context of anticlustering (or at least I have yet to see any advantages).

Argument `preclustering` should accept a preclustering vector

The preclustering argument should accept a preclustering vector as input, not only TRUE/FALSE. If the input is TRUE, the preclustering is computed within the function anticlustering.

If the preclustering argument accepts a clustering vector, this allows more flexibility in combining different methods (i.e., exact matching as preclustering, combined with a random sampling heuristic for anticlustering).

Maximizing dispersion can crash when using default algorithm

This makes the R session crash quite reliably (about at least once every ten attempts):

library(anticlust)

N <- 100
K <- N/2

cannot_link <- c(1, rep(2:(N-1), each = 2), N)
cannot_link <- matrix(cannot_link, ncol = 2, byrow = TRUE)
cannot_link <- rbind(cannot_link, t(apply(cannot_link, 1, rev)))
mat <- matrix(1, nrow = N, ncol = N)
mat[cannot_link] <- -1
anticlustering(mat, K = K, objective = "dispersion")

I get

 *** caught segfault ***
address (nil), cause 'unknown'

anticlust does stop due to large N

Dear Developers,

First of all, thank your for developing such a useful package in R.

I've ran into an issue applying anticlusting on large data set (N = 295k) consisting of 5 variables (2 numeric, 3 categorical).

Numeric variables: age and duration
Categorical variables: gender (2 levels), riskzone (42 levels) and language (4 levels).

set.seed(98772)
sample_tbl <- 
  sample_tbl %>% 
   mutate(group = anticlustering(sample_tbl[,c("age","duration")], 
    K = 2, 
    categories  = sample_tbl[,c("gender", "riskzone", "language")],
    objective = "kplus",
    standardize = TRUE))

After a few seconds I am running into an error:

Error: segfault from C stack overflow

Do you have any input/strategy on how to overcome this issue?

Thanks in advance

K-means optimization is incorrect for unequal group sizes

Apparently, with the optimized "local-updating" version of k-means anticlustering, the objective is incorrectly updated when the group sizes are unequal. Better results are obtained when recomputing the entire objective during each iteration. Reproducible example:

library(anticlust)

features <- schaper2019[, 3:6]

K <- 3
init <- sample(rep(1:3, nrow(schaper2019) * c(1/4, 1/4, 1/2)))

anticlusters <- anticlustering(
  features,
  K = init,
  objective = variance_objective,
  categories = schaper2019$room
)

mean_sd_tab(features, anticlusters)
# rating_consistent rating_inconsistent syllables     frequency     
# 1 "4.49 (0.24)"     "1.10 (0.07)"       "3.42 (1.10)" "18.33 (2.43)"
# 2 "4.49 (0.25)"     "1.10 (0.07)"       "3.42 (0.72)" "18.29 (2.24)"
# 3 "4.49 (0.25)"     "1.10 (0.06)"       "3.42 (0.94)" "18.31 (2.49)"

anticlusters <- anticlustering(
  features,
  K = init,
  objective = "variance",
  categories = schaper2019$room
)

mean_sd_tab(features, anticlusters)
# rating_consistent rating_inconsistent syllables     frequency     
# 1 "4.46 (0.24)"     "1.11 (0.07)"       "3.79 (1.10)" "19.75 (2.83)"
# 2 "4.51 (0.26)"     "1.11 (0.06)"       "2.96 (0.75)" "17.38 (1.74)"
# 3 "4.50 (0.24)"     "1.10 (0.07)"       "3.46 (0.82)" "18.06 (2.13)"

Speed-optimize exchange method for objective = "distance"

Now that the exchange method is the default option for anticlustering, it is desirable that the distance objective is computed faster. Instead of recomputing all distances by cluster, do something like the following:

Store the distance matrix and use indexing to read the relevant distance after each swap
To read the relevant distances, store a boolean matrix where the entry [i,j] is TRUE whenever the elements i and j are part of the same cluster. After a swap, swap the columns and rows for the elements i and j (because they just exchange their cluster partners), but also set the entries [i, j] and [j, i] to FALSE (exchange partners are not part of the same cluster).
To compute the objective, use the boolean matrix (with a restriction to the upper or lower triangular part) on the distance matrix and call sum.

Categorical constraints for ILP method

Add the possibility to include categorical constraints when method = "ilp"

Accommodating NA values conditional on a categorical variable

Thanks for a great package - it has been fantastic for balancing stimuli sets in complex experiments. I was wondering if there's any way to include a variable with NA values conditional on a categorical variable. At the moment NA values are not permitted (understandable). Something like:

library(tidyverse)
library(anticlust)
df <- mtcars |> 
  mutate(hp = ifelse(vs == 0, NA, hp)) |> 
  select(mpg, disp, hp, vs)

anticlustering(
  df[,1:3],
  K = c(9, 9, 9, 5),
  objective = "variance",
  categories = df$vs
)

Right now, my best idea is to do the clustering separately for each category (in this case, each level of hp) and then combine the data into the final groups, but I was wondering if there are any other ways.

Cheers -

Non-standard evaluation

Maybe, at some point, anticlustering() should also be callable similarly to the following way:

anticlustering(
  iris,
  numeric_vars = c(Sepal.Length, Sepal.Width),
  categorical_vars = Species,
  K = 3
)

That is, the first argument is a generic data argument that includes the entire data frame that users work with and then specify only the column names to select numeric and categorical variables. It would probably just require to add the arguments numeric_vars and categorical_vars to anticlustering(), test if they exist, and then use non-standard-evaluation to extract the relevant data from the first argument. This would also be better integrated into a tidyverse workflow. All of this does not make sense if the data input is a distance matrix, which still has to be supported.

Currently, we would have to use the following, which may be less appealing to users:

anticlustering(
  iris[, c("Sepal.Length", "Sepal.Width")],
  categories = iris$Species,
  K = 3
)

Feature request: fix/constrain cluster assignment in anticlustering()

For my application of anticlust it would be very useful if assignment of individual elements to clusters could be fixed or constrained a priori in anticlustering(). Instead of considering all K clusters for the constrained element, the algorithm would consider only a specific subset of clusters.

My use case is the assignment of versions of a psychological test to school classes during field testing. A small subset of classes have asked to use or not use a specific version; I still want to balance the covariates (averaged student characteristics) between versions across all classes taking these constraints into account.

A list of possible cluster memberships would be a straightforward way of specifying the constraints. Empty (NULL) list elements could denote unconstrained cluster selection. For example, with four elements and three clusters, the following list would specify unconstrained cluster selection for elements 1 and 2, constrain element 3 to cluster 2, and allow only clusters 2 or 3 for element 4:

list(
  NULL,       # unconstrained assignment for element 1
  c(1, 2, 3), # unconstrained assignment for element 2 (since we only have 3 clusters)
  2,          # element 3 fixed to cluster 2
  c(2, 3)     # element 4 constrained to clusters 2 or 3
)

Maybe this is already possible somehow but I was unable to figure out how. Also of course, it may well be that this is not possible to implement for some reason. But I still thought it worthwhile to signal that there is demand for this feature (if only from me…).

Finally, thank you for the anticlust package!

kplus_anticlustering() does not correctly work with preclustering = TRUE

Internally, an augmented data set is passed to anticlustering(), and preclustering is then conducted on the basis of the "normal" features + the additional k-plus variables, which does not make sense. Therefore, kplus_anticlustering() needs to perform preclustering itself before calling anticlustering(). Calling anticlustering(..., objective = "kplus", preclustering = TRUE) works correctly however (but this is reduced in its functionality because it only considers means and variances and not higher order moments).

Equal group size

In the following cases, the restriction of the same group size is not needed and can be dropped (That means: allow for deviations of group sizes by 1):

unrestricted random sampling
categorical random sampling
(not for preclustered random sampling)

Adding Elements to Existing Groups

I received this question via email and share with permission. It is similar to #46 regarding the inclusion of constraints on the cluster membership of items:

I have been using anticlust to assign subjects to groups and the library has been performing very well for me. One use case that I haven’t found a clean solution to is when I need to increase the sample size after I’ve already assigned some subjects to groups. Is there a way to do that? For example, if I have three groups (A,B,C) of 10 subjects and I find that I need to add 10 more subjects in a second round of experiments, is there a way to run anticlustering() with the previous 10 subjects already assigned to A, B,C and have them considered when I add the second round of ten to each group?

What I currently do is just run anticlustering() on the second group as if it were independent and try to make the final assignments manually. Not terribly hard to do, so it’s not a huge issue for me if there isn’t a way to do so (ie, I wouldn’t make a feature request), but I thought I would ask if there is a method that already exists.

BILS heuristic sometimes discards optimal partition from pareto set

The BILS heuristic sometimes does not return a partition that has an optimal value of the dispersion, even if it is initialized with a partition that has the optimal value (which contradicts the logic of the pareto set, which must contain a partition if it has the best value on one criterion).

Reproducible example:

data <- structure(c(2L, 2L, 3L, 5L, 1L, 3L, 3L, 2L, 5L, 1L, 4L, 4L, 1L, 
                    3L, 4L, 4L, 1L, 5L, 3L, 4L, 2L, 3L, 2L, 3L, 3L, 1L, 5L, 4L, 4L, 
                    5L, 3L, 2L, 4L, 5L, 2L, 3L, 3L, 1L, 3L, 2L, 3L, 3L, 1L, 2L, 2L, 
                    2L, 4L, 1L, 5L, 5L, 3L, 3L, 5L, 1L, 4L, 2L, 5L, 4L, 5L, 1L, 2L, 
                    3L, 1L, 1L, 3L, 2L, 4L, 5L, 3L, 4L, 5L, 3L, 1L, 5L, 2L, 4L, 2L, 
                    1L, 5L, 2L, 5L, 1L, 1L, 2L, 4L, 2L, 1L, 1L, 1L, 4L, 1L, 3L, 2L, 
                    1L, 1L, 5L, 5L, 4L, 4L, 4L, 5L, 4L, 1L, 3L, 5L, 4L, 2L, 1L, 4L, 
                    1L, 3L, 1L, 3L, 3L, 2L, 3L, 4L, 2L, 1L, 5L, 3L, 4L, 5L, 5L, 4L, 
                    1L, 1L, 4L, 3L, 5L, 2L, 1L, 4L, 4L, 4L, 3L, 2L, 2L, 3L, 5L, 4L, 
                    3L, 3L, 1L, 5L, 5L, 1L, 1L, 1L, 5L, 5L, 4L, 2L, 2L, 4L, 2L, 1L, 
                    3L, 5L, 3L, 1L, 2L, 4L, 4L, 1L, 5L, 1L, 4L, 3L, 4L, 5L, 5L, 4L, 
                    3L, 3L, 2L, 5L, 5L, 1L, 3L, 2L, 3L, 4L, 2L, 5L, 3L, 3L, 2L, 2L, 
                    4L, 2L, 1L, 4L, 1L, 5L, 2L, 5L, 2L, 2L, 3L, 2L, 3L, 3L, 1L, 1L, 
                    5L, 1L, 5L, 1L, 2L, 1L, 3L, 3L, 4L, 2L, 4L, 3L, 1L, 3L, 4L, 2L, 
                    5L, 2L, 1L, 2L, 3L, 3L, 2L, 2L, 4L, 5L, 2L, 3L, 1L, 5L, 3L, 2L, 
                    1L, 4L, 4L, 3L, 1L, 2L, 3L, 1L, 1L, 2L, 2L, 4L, 3L, 2L, 2L, 5L, 
                    1L, 3L, 2L, 2L, 4L, 4L, 4L, 5L, 5L, 4L, 4L, 2L, 5L, 2L, 2L, 4L, 
                    5L, 3L, 3L, 2L, 2L, 1L, 3L, 5L, 3L, 5L, 1L, 2L, 4L, 3L, 5L, 5L, 
                    5L, 4L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 3L, 1L, 1L, 1L, 1L, 3L, 1L, 
                    2L, 3L, 4L, 4L, 3L, 4L, 2L, 3L, 4L, 3L, 4L, 5L, 1L, 5L, 4L, 5L, 
                    1L, 1L, 1L, 2L, 2L, 4L, 1L, 2L, 1L, 3L, 3L, 1L, 4L, 3L, 5L, 2L, 
                    4L, 2L, 2L, 1L, 1L, 3L, 5L, 5L, 1L, 4L, 2L, 3L, 3L, 2L, 5L, 4L, 
                    1L, 4L, 3L, 5L, 5L, 4L, 5L, 1L, 5L, 4L, 5L, 5L, 5L, 3L, 4L, 5L, 
                    5L, 4L, 4L, 3L, 3L, 4L, 1L, 4L, 2L, 2L, 4L, 1L, 1L, 2L, 4L, 5L, 
                    3L, 1L, 3L, 3L, 2L, 4L, 1L, 3L, 5L, 5L, 5L, 2L, 5L, 5L, 1L, 5L, 
                    1L, 2L, 1L, 1L, 2L, 4L, 5L, 2L, 2L, 2L, 4L, 5L, 2L, 3L, 1L, 4L, 
                    3L, 3L, 3L, 2L, 4L, 4L, 2L, 3L, 1L, 4L, 1L, 1L, 4L, 3L, 5L, 2L, 
                    5L, 2L, 4L, 2L, 2L, 4L, 4L, 1L, 3L, 1L, 3L, 3L, 3L, 5L, 2L, 1L, 
                    5L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 2L, 5L, 5L, 2L, 5L, 2L, 3L, 1L, 
                    3L, 3L, 5L, 5L, 2L, 4L, 3L, 5L, 1L, 1L, 5L, 3L, 2L, 5L, 4L, 1L, 
                    5L, 5L, 1L, 1L, 5L, 4L, 5L, 4L, 5L, 5L, 1L, 2L, 5L, 1L, 5L, 4L, 
                    3L, 4L, 3L, 1L, 1L, 1L, 5L, 1L, 4L, 5L, 2L, 1L, 4L, 5L, 3L, 1L, 
                    4L, 4L, 1L, 1L, 3L, 4L, 5L, 1L, 1L, 5L, 3L, 4L, 3L, 2L, 2L, 4L, 
                    3L, 2L, 4L, 4L, 5L, 5L, 1L, 5L, 3L, 2L, 1L, 1L, 3L, 2L, 2L, 3L, 
                    5L, 5L, 5L, 4L, 1L, 2L, 4L, 5L, 2L, 4L, 1L, 5L, 4L, 5L, 2L, 5L, 
                    4L, 1L, 2L, 2L, 2L, 5L, 5L, 3L, 2L, 2L, 3L, 3L, 3L, 4L, 1L, 5L, 
                    2L, 1L, 1L, 1L, 5L, 1L, 2L, 4L, 2L, 5L, 2L, 2L, 5L, 4L, 3L, 5L, 
                    3L, 4L, 1L, 4L, 2L, 1L, 5L, 3L, 4L, 4L, 1L), dim = c(120L, 5L
                    ))
# optimal_dispersion(data, K = K)$dispersion # 2.236068
opt_groups <- c(1, 1, 4, 4, 5, 3, 2, 2, 1, 2, 4, 4, 5, 2, 1, 3, 2, 3, 3, 3, 
  1, 2, 1, 1, 1, 1, 3, 3, 2, 4, 1, 4, 2, 1, 2, 3, 1, 4, 1, 4, 2, 
  4, 3, 2, 3, 4, 5, 1, 5, 4, 1, 3, 3, 2, 5, 2, 1, 2, 5, 3, 5, 4, 
  5, 3, 5, 5, 2, 2, 5, 5, 1, 5, 2, 2, 4, 4, 3, 4, 3, 4, 1, 1, 2, 
  3, 5, 1, 5, 5, 2, 3, 4, 5, 1, 2, 2, 5, 4, 5, 4, 3, 5, 4, 4, 3, 
  3, 2, 3, 1, 1, 1, 2, 3, 5, 3, 5, 4, 4, 5, 4, 5)

set.seed(12345)
bils_groups <- bicriterion_anticlustering(data, K = opt_groups, R = c(1, 0))
dispersion_objective(data, opt_groups)
# [1] 2.236068
apply(bils_groups, 1, FUN = function(x) dispersion_objective(data, x))
#        1        2        3        5        6 
# 1.414214 1.414214 1.414214 1.414214 1.732051