Code Monkey home page Code Monkey logo

mlr3cluster's Introduction

mlr3cluster

Package website: release | dev

Cluster analysis for mlr3.

r-cmd-check CRAN status StackOverflow Mattermost

mlr3cluster is an extension package for cluster analysis within the mlr3 ecosystem. It is a successor of clustering capabilities of mlr2.

Installation

Install the last release from CRAN:

install.packages("mlr3cluster")

Install the development version from GitHub:

# install.packages("pak")
pak::pak("mlr-org/mlr3cluster")

Feature Overview

The current version of mlr3cluster contains:

  • A selection of 24 clustering learners that represent a wide variety of clusterers: partitional, hierarchical, fuzzy, etc.
  • A selection of 4 performance measures
  • Two built-in tasks to get started with clustering

Also, the package is integrated with mlr3viz which enables you to create great visualizations with just one line of code!

Cluster Analysis

Cluster Learners

Key Label Packages
clust.MBatchKMeans Mini Batch K-Means ClusterR
clust.SimpleKMeans K-Means (Weka) RWeka
clust.agnes Agglomerative Hierarchical Clustering cluster
clust.ap Affinity Propagation Clustering apcluster
clust.bico BICO Clustering stream
clust.birch BIRCH Clustering stream
clust.cmeans Fuzzy C-Means Clustering Learner e1071
clust.cobweb Cobweb Clustering RWeka
clust.dbscan Density-Based Clustering dbscan
clust.dbscan_fpc Density-Based Clustering with fpc fpc
clust.diana Divisive Hierarchical Clustering cluster
clust.em Expectation-Maximization Clustering RWeka
clust.fanny Fuzzy Analysis Clustering cluster
clust.featureless Featureless Clustering
clust.ff Farthest First Clustering RWeka
clust.hclust Agglomerative Hierarchical Clustering stats
clust.hdbscan HDBSCAN Clustering dbscan
clust.kkmeans Kernel K-Means kernlab
clust.kmeans K-Means stats, clue
clust.mclust Gaussian Mixture Models Clustering mclust
clust.meanshift Mean Shift Clustering LPCM
clust.optics OPTICS Clustering dbscan
clust.pam Partitioning Around Medoids cluster
clust.xmeans X-means RWeka

Cluster Measures

Key Label Packages
clust.ch Calinski Harabasz fpc
clust.dunn Dunn fpc
clust.silhouette Silhouette cluster
clust.wss Within Sum of Squares fpc

Example

library(mlr3)
library(mlr3cluster)

task = tsk("usarrests")
learner = lrn("clust.kmeans")
learner$train(task)
prediction = learner$predict(task = task)

More Resources

Check out the blogpost for a more detailed introduction to the package. Also, mlr3book has a section on clustering.

Future Plans

  • Add more learners and measures
  • Integrate the package with mlr3pipelines (work in progress)

If you have any questions, feedback or ideas, feel free to open an issue here.

mlr3cluster's People

Contributors

be-marc avatar damirpolat avatar dependabot[bot] avatar github-actions[bot] avatar henrifnk avatar m-muecke avatar mb706 avatar mllg avatar pat-s avatar sebffischer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mlr3cluster's Issues

Affinity Propagation cannot be constructed without having the package installed

s = p_uty(default = apcluster::negDistMat(r = 2L), tags = c("required", "train")),
p = p_uty(custom_check = function(x) {
if (test_numeric(x)) {
return(TRUE)
} else {
stop("`p` needs to be a numeric vector")
}
}, default = NA, tags = "train"),
q = p_dbl(lower = 0L, upper = 1L, tags = "train"),
maxits = p_int(lower = 1L, default = 1000L, tags = "train"),
convits = p_int(lower = 1L, default = 100L, tags = "train"),
lam = p_dbl(lower = 0.5, upper = 1L, default = 0.9, tags = "train"),
includeSim = p_lgl(default = FALSE, tags = "train"),
details = p_lgl(default = FALSE, tags = "train"),
nonoise = p_lgl(default = FALSE, tags = "train"),
seed = p_int(tags = "train")
)
ps$values = list(s = apcluster::negDistMat(r = 2L))

Learners should be constructable without having their packages installed.
This is important, because otherwise the mlr_learners -> Dictionary conversion fails, when someone does not have the apcluster package installed.

Make labels consistent and add potential clustering type

Currently the naming of the labels is inconsistent. Some have clustering or learner appended and some don't.
Would also be nice to have some sort of labelling for the clustering typ:

  • Deterministic (hard clustering):

    • Hierarchical classification methods: Agglomerative & divisive methods
    • Optimal partitions: k-means & variants
  • Density-based methods: DBSCAN & variants

  • Probabilistic (fuzzy clustering):

    • Model-based methods: Mixed distribution models
    • fuzzy k-means & variants

    This is done with properties, the man roxygen is not correct for this, some require updating.

dbscan causes terminal to crash

Description

learner$predict(task) causes terminal to crash when eps is much lower than optimal. Tested in defult console, Radian and Posit Cloud.

Reproducible example

Steps to reproduce:

library(mlr3)
library(mlr3cluster)

task = mlr_tasks$get("usarrests")
learner = lrn("clust.dbscan", eps = 0.5, minPts = 5) # optimal eps between 30 and 40
learner$train(task)
preds = learner$predict(task = task)

Session info:

R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252  LC_CTYPE=Portuguese_Brazil.1252    LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C
[5] LC_TIME=Portuguese_Brazil.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] mlr3cluster_0.1.8 mlr3_0.15.0

loaded via a namespace (and not attached):
 [1] compiler_4.2.2       DEoptimR_1.0-11      mlr3misc_0.11.0      class_7.3-20         tools_4.2.2          prabclus_2.3-2       digest_0.6.31
 [8] mclust_6.0.0         uuid_1.1-0           jsonlite_1.8.4       checkmate_2.1.0      clue_0.3-64          lattice_0.20-45      rlang_1.1.0
[15] cli_3.6.1            parallel_4.2.2       cluster_2.1.4        globals_0.16.2       fpc_2.2-10           stats4_4.2.2         diptest_0.76-0
[22] grid_4.2.2           nnet_7.3-18          robustbase_0.95-0    data.table_1.14.8    listenv_0.9.0        R6_2.5.1             flexmix_2.3-19
[29] parallelly_1.35.0    kernlab_0.9-32       lgr_0.4.4            backports_1.4.1      codetools_0.2-18     modeltools_0.2-23    palmerpenguins_0.1.1
[36] MASS_7.3-58.1        future_1.32.0        paradox_0.11.1       crayon_1.5.2

library error

I install the mlr3cluster through install.packages("mlr3cluster ") without error. However, the error occurs when I library it

Error: package or namespace load failed for ‘mlr3cluster’: .onLoad failed in loadNamespace() for 'mlr3cluster', details: call: rbindlist(l, use.names, fill, idcol) error: Item 2 has 6 columns, inconsistent with item 1 which has 7 columns. To fill missing columns use fill=TRUE.

I have tried to remove it and install again but of no use. Again, I tried to install for the source of github, and still does not work, and another error raise:
Error: package or namespace load failed for ‘mlr3cluster’ in get(Info[i, 1], envir = env): lazy-load database 'D:/R-4.1.0/library/mlr3cluster/R/mlr3cluster.rdb' is corrupt In addition: Warning message: In get(Info[i, 1], envir = env) : internal error -3 in R_decompress1
How could I fix it?

Warning: `predict_MBatchKMeans()` was deprecated in ClusterR 1.3.0.

Getting the following warning in the tests:

Warning (test_mlr_learners_clust_mbatchkmeans.R:38:7): Learner properties are respected
`predict_MBatchKMeans()` was deprecated in ClusterR 1.3.0.
i Beginning from version 1.4.0, if the fuzzy parameter is TRUE the function 'predict_MBatchKMeans' will return only the probabilities, whereas currently it also returns the hard clusters

[Feature request] Add Gaussian Mixture Models (Hard and fuzzy clustering)

Consider adding Gaussian Mixture Models (GMM) with the option do do both hard and fuzzy clustering. See e.g., https://brilliant.org/wiki/gaussian-mixture-model/.

An alternative implementation is e.g. available in the clusteR package (https://cran.r-project.org/web/packages/ClusterR/vignettes/the_clusterR_package.html).

Having this algorithm in mlr3cluster would complete the set of fundamental clustering algorithm and thus, the need to rely on the routine of another package when applying GMM.

Measures that rely on Tasks do not work for Pipelines

Simple Example:

task <- tsk("usarrests")
kmeans_centers <- lapply(1:10, function(x) po("scale") %>>% lrn("clust.kmeans", centers = x))
design = benchmark_grid(
  tasks = task,
  learners = kmeans_centers,
  resamplings = rsmp("insample")
)
bmr = benchmark(design)
bmr$score(msr("clust.wss"))$clust.wss

will throw an output like

[1] 355807.82 114846.81 81862.19 79208.07 70152.06 68255.12 68148.43 63241.63 54304.11 43632.32

The output from wss is obviously too high to be scaled.

The problem can be found in MeasureClustInternal that takes the "raw" task without any preprocessing to calculate the features.
I think, this is probably only an issue that mlr3cluster suffers from, as all other Measures are only dependent on the predictions ...?

private = list(
.score = function(prediction, task, ...) {
X = as.matrix(task$data(rows = prediction$row_ids))
if (!is.double(X)) { # clusterCrit does not convert lgls/ints
storage.mode(X) = "double"
}
intCriteria(X, prediction$partition, self$crit)[[1L]]
}
)

This could be avioded if there is any generic access to the preprocessed task in the pipeline.
In this case, one could exchange the taske in the function by the learner itself.
The problem is, if I enter the state of a trained pipeline, stored preprocessed Tasks are empty...

Suggest adding two clustering algorithms: consistency clustering and non-negative matrix factorization.

I am a bioinformatics PhD and I really appreciate your mlr3cluster package. This package provides many unsupervised clustering algorithms. However, I regret to find that the two most commonly used algorithms in bioinformatics analysis, consistency clustering and non-negative matrix factorization, are not included in this package. These two algorithms are widely used in the medical and biological fields. If these two algorithms are added to the package, the application scope will be greatly increased. I also hope to cite this package in my upcoming doctoral thesis. Thank you very much.

Release mlr3cluster 0.1.6

Prepare for release:

  • git pull
  • Check current CRAN check results
  • Polish NEWS
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Update cran-comments.md
  • git push

Submit to CRAN:

  • usethis::use_version('patch')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • git push
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • git push

Using graphs with clustering tasks

Hey guys,

Whats wrong in this task? I can get it working with included task (e.g., task = mlr_tasks$get("usarrests") but the reprex below won't:

# metapackage
library(mlr3verse)

# task creation
task = TaskClust$new(
    id = "cars",
    backend = subset(
        mtcars,
        select = c(
            mpg,
            cyl,
            hp
        )
    )
)

# learner
learner = lrn("clust.kmeans")

# graph
graph = po("scale") %>>%
    po("learner", learner)

# convert graph to learner
glrn = as_learner(graph)
#> Error in .__GraphLearner__initialize(self = self, private = private, super = super, : 'graph' output type not 'Prediction' (or compatible with it)

SO question

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.