Code Monkey home page Code Monkey logo

agua's Introduction

agua

Codecov test coverage R-CMD-check

agua enables users to fit, optimize, and evaluate models via H2O using tidymodels syntax. Most users will not have to use aqua directly; the features can be accessed via the new parsnip computational engine 'h2o'.

There are two main components in agua:

  • New parsnip engine 'h2o' for many models, see Get started for a complete list.

  • Infrastructure for the tune package.

When fitting a parsnip model, the data are passed to the h2o server directly. For tuning, the data are passed once and instructions are given to h2o.grid() to process them.

This work is based on @stevenpawley’s h2oparsnip package. Additional work was done by Qiushi Yan for his 2022 summer internship at Posit.

Installation

The CRAN version of the package can be installed via

install.packages("agua")

You can also install the development version of agua using:

require(pak)
pak::pak("tidymodels/agua")

Examples

The following code demonstrates how to create a single model on the h2o server and how to make predictions.

library(tidymodels)
library(agua)
library(h2o)
tidymodels_prefer()
# Start the h2o server before running models
h2o_start()

# Demonstrate fitting parsnip models: 
# Specify the type of model and the h2o engine 
spec <-
  rand_forest(mtry = 3, trees = 1000) %>%
  set_engine("h2o") %>%
  set_mode("regression")

# Fit the model on the h2o server
set.seed(1)
mod <- fit(spec, mpg ~ ., data = mtcars)
mod
#> parsnip model object
#> 
#> Model Details:
#> ==============
#> 
#> H2ORegressionModel: drf
#> Model ID:  DRF_model_R_1656520956148_1 
#> Model Summary: 
#>   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
#> 1            1000                     1000              285914         4
#>   max_depth mean_depth min_leaves max_leaves mean_leaves
#> 1        10    6.70600         10         27    18.04100
#> 
#> 
#> H2ORegressionMetrics: drf
#> ** Reported on training data. **
#> ** Metrics reported on Out-Of-Bag training samples **
#> 
#> MSE:  4.354249
#> RMSE:  2.086684
#> MAE:  1.657823
#> RMSLE:  0.09848976
#> Mean Residual Deviance :  4.354249

# Predictions
predict(mod, head(mtcars))
#> # A tibble: 6 × 1
#>   .pred
#>   <dbl>
#> 1  20.9
#> 2  20.8
#> 3  23.3
#> 4  20.4
#> 5  17.9
#> 6  18.7

# When done
h2o_end()

Before using the 'h2o' engine, users need to run agua::h2o_start() or h2o::h2o.init() to start the h2o server, which will be storing data, models, and other values passed from the R session.

There are several package vignettes including:

Code of Conduct

Please note that the agua project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

agua's People

Contributors

gvelasq avatar hfrick avatar qiushiyan avatar simonpcouch avatar topepo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

agua's Issues

setup parsnip engine docs

For all models, not just those with engines in parsnip, we create engine-specific documentation in parsnip/man/rmd. Details are here. We should add docs for the h20 engines.

use `parallelism` in `h2o.grid()`

We need to find a way to to specify parallelism in h2o.grid() and allow parallel model building. One possible solution is using control_grid(parallel_over) and have a condition for that here.

@topepo

Cannot run tuning example

I am unable to run the code from the model tuning vignette here. When doing so, I get the following error when running tune_grid:

 Error in get(x, envir = ns, inherits = FALSE) : 
object 'tune_grid_loop_iter_h2o' not found
7.
get(x, envir = ns, inherits = FALSE)
6.
utils::getFromNamespace(x = "tune_grid_loop_iter_h2o", ns = "agua")
5.
fn_tune_grid_loop(resamples, grid, workflow, metrics, control, 
rng)
4.
tune_grid_loop(resamples = resamples, grid = grid, workflow = workflow, 
metrics = metrics, control = control, rng = rng)
3.
tune_grid_workflow(object, resamples = resamples, grid = grid, 
metrics = metrics, pset = param_info, control = control)
2.
tune_grid.workflow(lm_wflow, resamples = cv_splits, grid = grid, 
control = control_grid(save_pred = TRUE))
1.
tune_grid(lm_wflow, resamples = cv_splits, grid = grid, control = control_grid(save_pred = TRUE))

Any thoughts?

auto_ml() model type

We'd need to add a model definition to parsnip (with a default engine of h2o) and add the rest in agua.

Not sure what the main arguments should be (max number of models?).

Unused arguement error while tuning

The problem

I cannot seem to tune with H20, I keep getting an "unused argument" error. I was using an example that used keras (which works fine) and I just switched the engine from keras to h2o and thought it should also work. But it didn't.

To track it down, I decided to run the code from https://agua.tidymodels.org/articles/tune.html

which is given below:

copied R code

library(tidymodels)
library(agua)
library(ggplot2)
theme_set(theme_bw())
doParallel::registerDoParallel()
h2o_start()
data(ames)

set.seed(4595)
data_split <- ames %>%
  mutate(Sale_Price = log10(Sale_Price)) %>%
  initial_split(strata = Sale_Price)
ames_train <- training(data_split)
ames_test <- testing(data_split)
cv_splits <- vfold_cv(ames_train, v = 10, strata = Sale_Price)

ames_rec <-
  recipe(Sale_Price ~ Gr_Liv_Area + Longitude + Latitude, data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>%
  step_ns(Longitude, deg_free = tune("long df")) %>%
  step_ns(Latitude, deg_free = tune("lat df"))

lm_mod <- linear_reg(penalty = tune()) %>%
  set_engine("h2o")

lm_wflow <- workflow() %>%
  add_model(lm_mod) %>%
  add_recipe(ames_rec)

grid <- lm_wflow %>%
  extract_parameter_set_dials() %>%
  grid_regular(levels = 5)

ames_res <- tune_grid(
  lm_wflow,
  resamples = cv_splits,
  grid = grid,
  control = control_grid(save_pred = TRUE,
    backend_options = agua_backend_options(parallelism = 5))
)

ames_res

The output is :

Tuning results

10-fold cross-validation using stratification

There were issues with some computations:

  • Error(s) x10: Error in fn(...): unused arguments (metrics_info = list(c("rmse", "rsq"), c("minimiz...

Run show_notes(.Last.tune.result) for more information.

Follow up on the suggestion

If I run show_notes …, this is the output:

unique notes:
───────────────────────────────────────────────────────────────────────────────────────────────────────
Error in fn(...): unused arguments (metrics_info = list(c("rmse", "rsq"), c("minimize", "maximize"), c("numeric", "numeric")), list(c("penalty", "deg_free", "deg_free"), c("penalty", "long df", "lat df"), c("model_spec", "recipe", "recipe"), c("linear_reg", "step_ns", "step_ns"), c("main", "ns_PCP7q", "ns_33KwK"), list(list("double", list(-10, 0), c(TRUE, TRUE), list("log-10", function (x)
log(x, base), function (x)
base^x, function (x, n = n_default)
{
raw_rng <- suppressWarnings(range(x, na.rm = TRUE))
if (any(!is.finite(raw_rng))) {
return(numeric())
}
rng <- log(raw_rng, base = base)
min <- floor(rng[1])
max <- ceiling(rng[2])
if (max == min) {
return(base^min)
}
by <- floor((max - min)/n) + 1
breaks <- base^seq(min, max, by = by)
relevant_breaks <- base^rng[1] <= breaks & breaks <= base^rng[2]
if (sum(relevant_breaks) >= (n - 2)) {
return(breaks)
}
while (by > 1) {
by <- by - 1
breaks <- base^seq(min, max, by = by)

external parallel processing

h2o parallelized internally by multithreading the training for an individual model.

We could also use R's external parallelization (via foreach or futures) to send more models to the h2o server at the same time.

We could also use both approaches.

Right now, when using multicore, it just works. For PSOCK clusters, it does not. It produces the error that it cannot find the h2o server.

Can we create a helper that will setup PSOCK clusters so that we can used them? We would need to experiment on what the worker processes are missing. It might be as simple as loading the h2o package in each.

allow other preprocessors

This line restricts h2o engines from being tuned unless there is a recipe. There are two other types of preprocessors so we should generalize this. There's probably code in tune to do this already.

case weights

The h2o functions take weights in the argument weights_column that is described as "Column with observation weights".

Release agua 0.1.0

Prepare for release:

  • git pull
  • Check current CRAN check results
  • Check if any deprecation processes should be advanced, as described in Gradual deprecation
  • Polish NEWS
  • devtools::build_readme()
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::cloud_check()
  • Update cran-comments.md
  • git push
  • Draft blog post
  • Slack link to draft blog in #open-source-comms

Submit to CRAN:

  • usethis::use_version('minor')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • git push
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • usethis::use_news_md()
  • git push
  • Finish blog post
  • Tweet
  • Add link to blog post in pkgdown news menu

[New Functionalitiy]: Add explainability/interpretability functions from h2o.

Hi,

Thanks for bringing h2o capabilities to tidymodels!.

h2o already includes various functions to help in model's interpretation/explainability for binary classification and regression models:

  • h2o.shap_summary_plot()
  • h2o.shap_explain_row_plot()
  • h2o.pd_multi_plot()
  • h2o_pd_plot()
  • h2o_ice_plot()

These functions can also be applied to an h2o.automl() object.

All the available h2o functionality is documented here

Thanks!
Carlos.

use pkgdown

Once the repo is public, let's use usethis::use_pkgdown(). I already made a CNAME entry so we should be able to use agua.tidymodels.org.

on.exit() for h2o tuning module

At the top of the iteration function we should run h2o.no_progress() and then use an on.exit() to:

  • run h2o.show_progress()
  • run a function that removes the model id's that were created.

Release agua 0.1.0

First release:

Prepare for release:

  • git pull
  • devtools::build_readme()
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • git push
  • Draft blog post
  • Slack link to draft blog in #open-source-comms

Submit to CRAN:

  • usethis::use_version('minor')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • git push
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • usethis::use_news_md()
  • git push
  • Finish blog post
  • Tweet
  • Add link to blog post in pkgdown news menu

todos after h2o August release

A reminder of todos after h2o's cran release (3.36.1.2)

  • change tune functions to use strategy = 'Sequential', also remove this line line from tune

  • update relevant parts in the vignette discussing parllel processing with h2o.grid

  • add tuning benchmark

  • use new progress functions

  • display threshold for classification models (if available)

  • discuss if explainability functions in #31 should be added

Internal functions used in tune_grid_loop_iter_h2o

Internal functions used in tune_grid_loop_iter_h2o that may need to be exported or carried to agua:

setup for parallel processing

  • tune:::load_namespace

finalize and fit workflows when loooping parameters

  • tune:::catch_and_log
  • tune:::forge_from_workflow
  • workflows:::.fit_pre

formatting functions for predictions

compute metrics

  • tune::outcome_names
  • tune:::estimate_metrics

Interaction terms are ignored

The training wrapper functions (e.g., h2o_train_glm) did not receive possible interaction terms.

library(agua)
#> Loading required package: parsnip
h2o_start()

linear_mod <- linear_reg(penalty = 0.1) |> 
  set_engine("h2o") %>% 
  fit(mpg ~ wt * cyl, data = mtcars)

linear_mod$fit@parameters$x
#> [1] "wt"  "cyl"

Created on 2022-06-22 by the reprex package (v2.0.1)

dials activation values

Since h2o uses different values for activation functions, we can

  • take the values that are consistent with current tidymodels engines (e.g. "tanh") and translate them inside of h2o_train_mlp() to be what h2o expects (e.g. "Tanh").
  • Also, we could expand what dials has as possible values to include others in h2o that tidymodels does not currently have (or just fail if the value is not in our current list).

Error for `h2o_start()` without java installed

When I run h2o_start() without things installed/configured correctly, I do get

> h2o_start()
The operation couldn’t be completed. Unable to locate a Java Runtime.
Please visit http://www.java.com for information on installing Java.

but then it just hangs there. It would be nice if that threw an error instead.

Error segfault

Getting following error when fitting drf model
Warning: stack imbalance in 'as.environment', 249 then 246
*** caught segfault ***
*** caught segfault ***
*** caught segfault ***
address 0x64209498, cause 'memory not mapped'
*** caught segfault ***
*** caught segfault ***
address 0x64209498, cause 'memory not mapped'
*** caught segfault ***
address 0x64209498, cause 'memory not mapped'
address 0x7fcfd18a4e7a, cause 'invalid permissions'
*** caught segfault ***
address 0x64209498, cause 'memory not mapped'

use h2o::with_no_h2o_progress

In the very latest h2o version, they have this function documented exported. This should stop the progress bars and some other output.

edit: @ledell @tomasfryda Was the function supposed to be exported (since it is documented)?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.