Comments (4)
My first guess is that at many places we need information stored in the model.frame or the model.matrix. While the modle.frame is usually stored within the model results, this is not the case of the model.matrix. Therefore, it may be relevant to copy and attach the model.matrix (at least temporally) between the different steps. Needs to be tested.
from broom.helpers.
I've never used it personally, but perhaps the memoise pkg could be useful. It caches results and only performs the calculation the first time, and each subsequent call grabs the results from cache
from broom.helpers.
Please have a look at #255 implemented some caching of model frame and model matrix to the model object. The argument model_matrix_attr = FALSE
allows to desactivate this caching.
@ddsjoberg I didn't use memoise
to avoid additional dependencies. Do you think it would be better to rely on it?
See below a quick benchmark before and after. @lucasxteixeira using this new version and avoiding to compute the number of observations per modality (i.e. add_n
= TRUE), you can save some time (almost divided by 4 in this example).
library(broom.helpers)
set.seed(42)
size <- 5000000
df <- data.frame(
y = rnorm(size),
x1 = rnorm(size),
x2 = sample(c(TRUE, FALSE), size, replace = TRUE),
x3 = factor(sample(c("A", "B", "C"), size, replace = TRUE))
)
m <- lm(y ~ x1 + x2 + x3, data = df)
# without caching model matrix
s0 <- m |>
tidy_and_attach(model_matrix_attr = FALSE)
s1 <- s0 |>
tidy_identify_variables()
s2 <- s1 |>
tidy_add_contrasts()
s3 <- s2 |>
tidy_add_reference_rows()
s4 <- s3 |>
tidy_add_estimate_to_reference_rows()
s5 <- s4 |>
tidy_add_term_labels()
s6 <- s5 |>
tidy_add_header_rows()
s7 <- s6 |>
tidy_add_n()
b <- microbenchmark::microbenchmark(
m |> tidy_and_attach(),
s0 |> tidy_identify_variables(),
s1 |> tidy_add_contrasts(),
s2 |> tidy_add_reference_rows(),
s3 |> tidy_add_estimate_to_reference_rows(),
s4 |> tidy_add_term_labels(),
s5 |> tidy_add_header_rows(),
s6 |> tidy_add_n(),
times = 5
)
b
#> Unit: milliseconds
#> expr min lq mean
#> tidy_and_attach(m) 1308.1122 1342.4657 1511.52028
#> tidy_identify_variables(s0) 891.9426 892.2176 1013.81794
#> tidy_add_contrasts(s1) 906.9803 950.8952 1199.12334
#> tidy_add_reference_rows(s2) 1740.8123 1914.9015 2304.18500
#> tidy_add_estimate_to_reference_rows(s3) 2.4704 2.5295 2.83704
#> tidy_add_term_labels(s4) 1770.3298 1904.9071 2170.40976
#> tidy_add_header_rows(s5) 18.3213 21.1508 21.79842
#> tidy_add_n(s6) 4908.2289 5417.0192 5593.61400
#> median uq max neval
#> 1369.7221 1452.1779 2085.1235 5
#> 1003.9495 1091.5755 1189.4045 5
#> 1124.5302 1486.5224 1526.6886 5
#> 2298.6875 2490.7164 3075.8073 5
#> 2.9398 3.0409 3.2046 5
#> 2081.1680 2132.5292 2963.1147 5
#> 21.2039 21.7075 26.6086 5
#> 5766.6404 5923.1059 5953.0756 5
# with caching model matrix
s0 <- m |>
tidy_and_attach()
s1 <- s0 |>
tidy_identify_variables()
s2 <- s1 |>
tidy_add_contrasts()
s3 <- s2 |>
tidy_add_reference_rows()
s4 <- s3 |>
tidy_add_estimate_to_reference_rows()
s5 <- s4 |>
tidy_add_term_labels()
s6 <- s5 |>
tidy_add_header_rows()
s7 <- s6 |>
tidy_add_n()
b <- microbenchmark::microbenchmark(
m |> tidy_and_attach(),
s0 |> tidy_identify_variables(),
s1 |> tidy_add_contrasts(),
s2 |> tidy_add_reference_rows(),
s3 |> tidy_add_estimate_to_reference_rows(),
s4 |> tidy_add_term_labels(),
s5 |> tidy_add_header_rows(),
s6 |> tidy_add_n(),
times = 5
)
b
#> Unit: milliseconds
#> expr min lq mean
#> tidy_and_attach(m) 1183.0690 1260.9537 1618.52904
#> tidy_identify_variables(s0) 28.6777 28.7928 37.62974
#> tidy_add_contrasts(s1) 35.9631 36.2232 51.41974
#> tidy_add_reference_rows(s2) 202.8618 216.1722 247.23770
#> tidy_add_estimate_to_reference_rows(s3) 2.0402 2.0497 2.89266
#> tidy_add_term_labels(s4) 115.9453 133.5421 144.57326
#> tidy_add_header_rows(s5) 17.4916 21.6246 24.74484
#> tidy_add_n(s6) 1734.7273 2098.8633 2124.85670
#> median uq max neval
#> 1736.3493 1894.6854 2017.5878 5
#> 35.5844 47.5241 47.5697 5
#> 44.9223 64.5268 75.4633 5
#> 226.8496 294.4951 295.8098 5
#> 2.5728 3.2854 4.5152 5
#> 138.2370 139.0601 196.0818 5
#> 26.8611 28.5671 29.1798 5
#> 2193.1916 2199.2863 2398.2150 5
# overall gain
microbenchmark::microbenchmark(
tidy_plus_plus(m, model_matrix_attr = FALSE),
tidy_plus_plus(m),
tidy_plus_plus(m, add_n = FALSE),
tidy_plus_plus(m, add_n = FALSE, model_matrix_attr = FALSE),
times = 5
)
#> Unit: seconds
#> expr min
#> tidy_plus_plus(m, model_matrix_attr = FALSE) 11.806868
#> tidy_plus_plus(m) 5.100999
#> tidy_plus_plus(m, add_n = FALSE) 2.494101
#> tidy_plus_plus(m, add_n = FALSE, model_matrix_attr = FALSE) 6.815652
#> lq mean median uq max neval
#> 12.237099 12.544947 12.771944 12.951536 12.957286 5
#> 5.302496 7.109005 5.343877 5.869818 13.927835 5
#> 2.565271 2.959569 2.737820 2.961719 4.038936 5
#> 7.302845 12.897389 13.297690 18.419446 18.651313 5
Created on 2024-07-01 with reprex v2.1.0
from broom.helpers.
Thank you! It was a very fast solution.
The performance improved quite a bit.
from broom.helpers.
Related Issues (20)
- Support for mmrm models HOT 8
- Support for MASS::contr.sdif() contrasts
- Support for zero-inflated models
- beta regression is not supported yet HOT 2
- Considering a `tidy_post_fun` argument to `tidy_plus_plus()` HOT 2
- Support with coxphf model HOT 9
- Bug fixes and improvements for mixed models HOT 1
- tbl_regression (package gtsummary) and ggcoef_model (package ggstats) not working on the output of a replicate svyglm model HOT 2
- fantastic support of multivariate quantile regression for any quantile HOT 1
- Support for survival::cch model? HOT 5
- Avg_comparisons and nlme::lme() models HOT 1
- order of variable levels with marginal tidiers HOT 8
- Release broom.helpers 1.15.0
- Take into account (id) when computing model_get_n() for coxph models
- `marginaleffects::datagridcf()` is deprecated
- Do you know the status of the {margins} pkg? HOT 1
- Explore better support of VGAM::vglm() models
- marginaleffects::datagridcf() is deprecated
- pkgdown site has no search facility HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from broom.helpers.