tidymodels / agua Goto Github PK

Create and evaluate models using 'tidymodels' and 'h2o'

License: Other

R 100.00%

agua's Introduction

agua

agua enables users to fit, optimize, and evaluate models via H2O using tidymodels syntax. Most users will not have to use aqua directly; the features can be accessed via the new parsnip computational engine 'h2o'.

There are two main components in agua:

New parsnip engine 'h2o' for many models, see Get started for a complete list.
Infrastructure for the tune package.

When fitting a parsnip model, the data are passed to the h2o server directly. For tuning, the data are passed once and instructions are given to h2o.grid() to process them.

This work is based on @stevenpawley’s h2oparsnip package. Additional work was done by Qiushi Yan for his 2022 summer internship at Posit.

Installation

The CRAN version of the package can be installed via

install.packages("agua")

You can also install the development version of agua using:

require(pak)
pak::pak("tidymodels/agua")

Examples

The following code demonstrates how to create a single model on the h2o server and how to make predictions.

library(tidymodels)
library(agua)
library(h2o)
tidymodels_prefer()

# Start the h2o server before running models
h2o_start()

# Demonstrate fitting parsnip models: 
# Specify the type of model and the h2o engine 
spec <-
  rand_forest(mtry = 3, trees = 1000) %>%
  set_engine("h2o") %>%
  set_mode("regression")

# Fit the model on the h2o server
set.seed(1)
mod <- fit(spec, mpg ~ ., data = mtcars)
mod
#> parsnip model object
#> 
#> Model Details:
#> ==============
#> 
#> H2ORegressionModel: drf
#> Model ID:  DRF_model_R_1656520956148_1 
#> Model Summary: 
#>   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
#> 1            1000                     1000              285914         4
#>   max_depth mean_depth min_leaves max_leaves mean_leaves
#> 1        10    6.70600         10         27    18.04100
#> 
#> 
#> H2ORegressionMetrics: drf
#> ** Reported on training data. **
#> ** Metrics reported on Out-Of-Bag training samples **
#> 
#> MSE:  4.354249
#> RMSE:  2.086684
#> MAE:  1.657823
#> RMSLE:  0.09848976
#> Mean Residual Deviance :  4.354249

# Predictions
predict(mod, head(mtcars))
#> # A tibble: 6 × 1
#>   .pred
#>   <dbl>
#> 1  20.9
#> 2  20.8
#> 3  23.3
#> 4  20.4
#> 5  17.9
#> 6  18.7

# When done
h2o_end()

Before using the 'h2o' engine, users need to run agua::h2o_start() or h2o::h2o.init() to start the h2o server, which will be storing data, models, and other values passed from the R session.

There are several package vignettes including:

Code of Conduct

Please note that the agua project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

agua's People

Contributors

Stargazers

Watchers

Forkers

gvelasq mikebirdgeneau

agua's Issues

setup parsnip engine docs

For all models, not just those with engines in parsnip, we create engine-specific documentation in parsnip/man/rmd. Details are here. We should add docs for the h20 engines.

use `parallelism` in `h2o.grid()`

We need to find a way to to specify parallelism in h2o.grid() and allow parallel model building. One possible solution is using control_grid(parallel_over) and have a condition for that here.

@topepo

Cannot run tuning example

I am unable to run the code from the model tuning vignette here. When doing so, I get the following error when running tune_grid:

 Error in get(x, envir = ns, inherits = FALSE) : 
object 'tune_grid_loop_iter_h2o' not found
7.
get(x, envir = ns, inherits = FALSE)
6.
utils::getFromNamespace(x = "tune_grid_loop_iter_h2o", ns = "agua")
5.
fn_tune_grid_loop(resamples, grid, workflow, metrics, control, 
rng)
4.
tune_grid_loop(resamples = resamples, grid = grid, workflow = workflow, 
metrics = metrics, control = control, rng = rng)
3.
tune_grid_workflow(object, resamples = resamples, grid = grid, 
metrics = metrics, pset = param_info, control = control)
2.
tune_grid.workflow(lm_wflow, resamples = cv_splits, grid = grid, 
control = control_grid(save_pred = TRUE))
1.
tune_grid(lm_wflow, resamples = cv_splits, grid = grid, control = control_grid(save_pred = TRUE))

Any thoughts?

enable GitHub actions

Once the repo is public, let's setup usethis::use_tidy_github_actions().

auto_ml() model type

We'd need to add a model definition to parsnip (with a default engine of h2o) and add the rest in agua.

Not sure what the main arguments should be (max number of models?).

Unused arguement error while tuning

The problem

I cannot seem to tune with H20, I keep getting an "unused argument" error. I was using an example that used keras (which works fine) and I just switched the engine from keras to h2o and thought it should also work. But it didn't.

To track it down, I decided to run the code from https://agua.tidymodels.org/articles/tune.html

which is given below:

copied R code

library(tidymodels)
library(agua)
library(ggplot2)
theme_set(theme_bw())
doParallel::registerDoParallel()
h2o_start()
data(ames)

set.seed(4595)
data_split <- ames %>%
  mutate(Sale_Price = log10(Sale_Price)) %>%
  initial_split(strata = Sale_Price)
ames_train <- training(data_split)
ames_test <- testing(data_split)
cv_splits <- vfold_cv(ames_train, v = 10, strata = Sale_Price)

ames_rec <-
  recipe(Sale_Price ~ Gr_Liv_Area + Longitude + Latitude, data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>%
  step_ns(Longitude, deg_free = tune("long df")) %>%
  step_ns(Latitude, deg_free = tune("lat df"))

lm_mod <- linear_reg(penalty = tune()) %>%
  set_engine("h2o")

lm_wflow <- workflow() %>%
  add_model(lm_mod) %>%
  add_recipe(ames_rec)

grid <- lm_wflow %>%
  extract_parameter_set_dials() %>%
  grid_regular(levels = 5)

ames_res <- tune_grid(
  lm_wflow,
  resamples = cv_splits,
  grid = grid,
  control = control_grid(save_pred = TRUE,
    backend_options = agua_backend_options(parallelism = 5))
)

ames_res

The output is :

Tuning results

10-fold cross-validation using stratification

There were issues with some computations:

Error(s) x10: Error in fn(...): unused arguments (metrics_info = list(c("rmse", "rsq"), c("minimiz...

Run show_notes(.Last.tune.result) for more information.

Follow up on the suggestion

If I run show_notes …, this is the output:

unique notes:
───────────────────────────────────────────────────────────────────────────────────────────────────────
Error in fn(...): unused arguments (metrics_info = list(c("rmse", "rsq"), c("minimize", "maximize"), c("numeric", "numeric")), list(c("penalty", "deg_free", "deg_free"), c("penalty", "long df", "lat df"), c("model_spec", "recipe", "recipe"), c("linear_reg", "step_ns", "step_ns"), c("main", "ns_PCP7q", "ns_33KwK"), list(list("double", list(-10, 0), c(TRUE, TRUE), list("log-10", function (x)
log(x, base), function (x)
base^x, function (x, n = n_default)
{
raw_rng <- suppressWarnings(range(x, na.rm = TRUE))
if (any(!is.finite(raw_rng))) {
return(numeric())
}
rng <- log(raw_rng, base = base)
min <- floor(rng[1])
max <- ceiling(rng[2])
if (max == min) {
return(base^min)
}
by <- floor((max - min)/n) + 1
breaks <- base^seq(min, max, by = by)
relevant_breaks <- base^rng[1] <= breaks & breaks <= base^rng[2]
if (sum(relevant_breaks) >= (n - 2)) {
return(breaks)
}
while (by > 1) {
by <- by - 1
breaks <- base^seq(min, max, by = by)

external parallel processing

h2o parallelized internally by multithreading the training for an individual model.

We could also use R's external parallelization (via foreach or futures) to send more models to the h2o server at the same time.

We could also use both approaches.

Right now, when using multicore, it just works. For PSOCK clusters, it does not. It produces the error that it cannot find the h2o server.

Can we create a helper that will setup PSOCK clusters so that we can used them? We would need to experiment on what the worker processes are missing. It might be as simple as loading the h2o package in each.

allow other preprocessors

This line restricts h2o engines from being tuned unless there is a recipe. There are two other types of preprocessors so we should generalize this. There's probably code in tune to do this already.

case weights

The h2o functions take weights in the argument weights_column that is described as "Column with observation weights".

Release agua 0.1.0

Prepare for release:

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

[New Functionalitiy]: Add explainability/interpretability functions from h2o.

Hi,

Thanks for bringing h2o capabilities to tidymodels!.

h2o already includes various functions to help in model's interpretation/explainability for binary classification and regression models:

h2o.shap_summary_plot()
h2o.shap_explain_row_plot()
h2o.pd_multi_plot()
h2o_pd_plot()
h2o_ice_plot()

These functions can also be applied to an h2o.automl() object.

All the available h2o functionality is documented here

Thanks!
Carlos.

use pkgdown

Once the repo is public, let's use usethis::use_pkgdown(). I already made a CNAME entry so we should be able to use agua.tidymodels.org.

on.exit() for h2o tuning module

At the top of the iteration function we should run h2o.no_progress() and then use an on.exit() to:

run h2o.show_progress()
run a function that removes the model id's that were created.

validation set for xgboost and map

Mirror the api for xgboost where we have a validation arg (default = 0) that splits off some data in the wrapper to supply as a validation frame.

Release agua 0.1.0

First release:

usethis::use_cran_comments()
Update (aspirational) install instructions in README
Proofread Title: and Description:
Check that all exported functions have @return and @examples
Check that Authors@R: includes a copyright holder (role 'cph')
Check licensing of included files
Review https://github.com/DavisVaughan/extrachecks

Prepare for release:

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

todos after h2o August release

A reminder of todos after h2o's cran release (3.36.1.2)

change tune functions to use strategy = 'Sequential', also remove this line line from tune
update relevant parts in the vignette discussing parllel processing with h2o.grid
add tuning benchmark
use new progress functions
display threshold for classification models (if available)
discuss if explainability functions in #31 should be added

Internal functions used in tune_grid_loop_iter_h2o

Internal functions used in tune_grid_loop_iter_h2o that may need to be exported or carried to agua:

setup for parallel processing

~~tune:::load_namespace~~

finalize and fit workflows when loooping parameters

~~tune:::catch_and_log~~
tune:::forge_from_workflow
~~workflows:::.fit_pre~~

formatting functions for predictions

~~parsnip~~

compute metrics

~~tune::outcome_names~~
~~tune:::estimate_metrics~~

Interaction terms are ignored

The training wrapper functions (e.g., h2o_train_glm) did not receive possible interaction terms.

library(agua)
#> Loading required package: parsnip
h2o_start()

linear_mod <- linear_reg(penalty = 0.1) |> 
  set_engine("h2o") %>% 
  fit(mpg ~ wt * cyl, data = mtcars)

linear_mod$fit@parameters$x
#> [1] "wt"  "cyl"

^{Created on 2022-06-22 by the reprex package (v2.0.1)}

dials activation values

Since h2o uses different values for activation functions, we can

take the values that are consistent with current tidymodels engines (e.g. "tanh") and translate them inside of h2o_train_mlp() to be what h2o expects (e.g. "Tanh").
Also, we could expand what dials has as possible values to include others in h2o that tidymodels does not currently have (or just fail if the value is not in our current list).

Error for `h2o_start()` without java installed

When I run h2o_start() without things installed/configured correctly, I do get

> h2o_start()
The operation couldn’t be completed. Unable to locate a Java Runtime.
Please visit http://www.java.com for information on installing Java.

but then it just hangs there. It would be nice if that threw an error instead.

Error segfault

Getting following error when fitting drf model
Warning: stack imbalance in 'as.environment', 249 then 246
*** caught segfault ***
*** caught segfault ***
*** caught segfault ***
address 0x64209498, cause 'memory not mapped'
*** caught segfault ***
*** caught segfault ***
address 0x64209498, cause 'memory not mapped'
*** caught segfault ***
address 0x64209498, cause 'memory not mapped'
address 0x7fcfd18a4e7a, cause 'invalid permissions'
*** caught segfault ***
address 0x64209498, cause 'memory not mapped'