Code Monkey home page Code Monkey logo

aglm's Introduction

What is this?

Accurate Generalized Linear Model (AGLM) is defined as a regularized GLM which applying a sort of feature transformations using a discretization of numerical features and specific coding methodologies of dummy variables. More details can be found in our paper.

2021/6/6: Now our paper won Charles A. Hachemeister Prize.

Installation

# The simplest way:
install.packages("aglm")

# Or the development version from GitHub:
# install.packages("devtools")
devtools::install_github("kkondo1981/aglm", ref="develop")

Usage

See the help as below after installing aglm.

library(aglm)
?"aglm-package"

aglm's People

Contributors

kazuzowo avatar kkondo1981 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

aglm's Issues

Formula input

Can you make aglm possible to accept a formula input as in glmnetUtils::glmnet?

An additional option of plot.AccurateGLM

As we can get the coefs for a particular lambda by either coef(aglm(x, y,...), s = lambda) or coef(cv.aglm(x, y,...), s = lambda), can you make plot.AccurateGLM possible to produce the plots for a particular lambda by plot(aglm(x, y,...), s = lambda) or plot(cv.aglm(x, y,...), s = lambda)?

predict.aglm should accept a cv.aglm object

predict.glmnet accepts a cv.glmnet object. But it seems that predict.aglm is not yet ready for a cv.aglm object. See an example code below.
It is problematic because the glmnet package recommends avoiding a usage like glmnet_pred1 below in the example but recommends a usage like glmnet_pred2 below.

Preamble starts

library(MASS) # For Boston
library(glmnet)
library(aglm)

Read data

xy <- Boston # xy is a data.frame to be processed.
colnames(xy)[ncol(xy)] <- "y" # Let medv be the objective variable, y.

Split data into train and test

n <- nrow(xy) # Sample size.
set.seed(2018) # For reproducibility.
test.id <- sample(n, round(n/4)) # ID numbers for test data.
test <- xy[test.id,] # test is the data.frame for testing.
train <- xy[-test.id,] # train is the data.frame for training.
x <- train[-ncol(xy)]
y <- train$y
newx <- test[-ncol(xy)]
y_true <- test$y

Preamble ends. No new information so far

predict.glmnet accepts a cv.glmnet object:

set.seed(2018)
glmnet_CV <- cv.glmnet(as.matrix(x), y)
glmnet_lambda <- glmnet_CV$lambda.min
glmnet_model <- glmnet(as.matrix(x), y, lambda = glmnet_lambda)
glmnet_pred1 <- predict(glmnet_model, newx = as.matrix(newx), type = "response")
cat("RMSE: ", sqrt(mean((y_true - glmnet_pred1)^2)), "\n")
#RMSE: 3.67907: Not recommended
glmnet_pred2 <- predict(glmnet_CV, s = glmnet_lambda,
newx = as.matrix(newx), type = "response")
cat("RMSE: ", sqrt(mean((y_true - glmnet_pred2)^2)), "\n")
#RMSE: 3.678105: Recommended

predict.aglm doesn't accept a cv.aglm object:

set.seed(2018)
aglm_CV <- cv.aglm(x, y)
aglm_lambda <- aglm_CV$lambda.min
aglm_model <- aglm(x, y, lambda = aglm_lambda)
aglm_pred1 <- predict(aglm_model, newx = newx, type = "response")
cat("RMSE: ", sqrt(mean((y_true - aglm_pred1)^2)), "\n")
#RMSE: 3.10334: Not recommended?
aglme_pred2 <- predict(aglm_CV, s = aglm_lambda,
newx = newx, type = "response")
#Error
cat("RMSE: ", sqrt(mean((y_true - aglm_pred2)^2)), "\n")
#Error

Enable custom bins for OD vars

Add arguments mentioned below to aglm():

  • bins_list: A list of numeric vectors, each element of which indicates breaks of binning of one OD var.
  • bins_names: A list or column names or column index numbers, which specify how elements of bins_list correspond to columns of x. The default value is NULL, and in that case, elements of bins_list correspond to columns of x in order.

Add equal frequency binning and make it default way

Currently, the default and only way for aglm() to make bins is Equal Width Binning, but we should add Equal Frequency Binning and make it the default option.
Furthermore, increase default bin count to 100 from current 20.

Expand width of UD mat

Modify newInput() to use drop_last=FALSE option when calling getUDummyMatForOneVec().
It's because results could be different when optimizing with regularization terms, depending on using last columns or not.

Defaults of type.measure

Defaults of type.measure seem to have some trouble.

First, try:

install.packages('CASdatasets',
repos = 'http://cas.uqam.ca/pub/R/',
type='source',
dependencies = TRUE
)
library(CASdatasets)
data(ausprivauto0405)
exy <- ausprivauto0405[1:500, c(1:6, 8)]
expo <- exy[,1]
x <- exy[2:6]
y <- exy[,7]
model <- cv.aglm(x = x, y = y, offset = log(expo), #type.measure = "deviance",
add_interaction_columns = FALSE, family = "poisson")
model@name

When I tried, the output was:

mse
"Mean-Squared Error"

But it should have been:

deviance
"Poisson Deviance"

Compare:

model <- cv.glmnet(x = sapply(x, as.numeric), y = y, offset = log(expo),
family = "poisson")
model$name

Its output:

deviance
"Poisson Deviance"

Line 97 in 2904927

Not important issue.

labels <- cut(x_vec, breaks=c(-Inf, breaks, Inf), labels=FALSE, right=FALSE)

In breaks=c(-Inf, breaks, Inf), the original breaks may have -Inf and/or Inf, which will cause an error.
In fact, I once got an error message for this reason in using aglm.
So I think the right hand is better to be unique(c(-Inf, breaks, Inf)).

Set license

Ref)
https://cran.r-project.org/web/licenses/

  • The “GNU Affero General Public License” version 3 (AGPL-3)
  • The “Artistic License” version 2.0 (Artistic-2.0)
  • The “BSD 2-clause License” (BSD_2_clause)
  • The “BSD 3-clause License” (BSD_3_clause)
  • The “GNU General Public License” version 2 (GPL-2)
  • The “GNU General Public License” version 3 (GPL-3)
  • The “GNU Library General Public License” version 2 (LGPL-2)
  • The “GNU Lesser General Public License” version 2.1 (LGPL-2.1)
  • The “GNU Lesser General Public License” version 3 (LGPL-3)
  • The “MIT License” (MIT)

L dummy option

By using O dummies, the current version implements "fusion" (fused LASSO) not only for ordered factor variables but also for numeric variables. But, for numeric variables, "linear interpolation" may be a good, possibly better, alternative.

My idea is this:

  1. Outline
    For numeric variables, add an option of using new dummies, say, L dummies, instead of using O dummies and linear terms, where L dummies are dummies for linear interpolation. (Don't use linear terms and linear interpolation at the same time.)
  2. Details
    (i) L dummies can be implemented by adding a new function, say, getLDummyMatForOneVec.
    The code is to be something like the following. (It's same as getODummyMatForOneVec but with only differences in the lines with #'s on the right.)
getLDummyMatForOneVec <- function(x_vec, breaks=NULL, nbin.max=100, only_info=FALSE) {
 # Check arguments. only integer or numerical or ordered vectors are allowed.
 assert_that(is.integer(x_vec) | is.numeric(x_vec) | is.ordered(x_vec))

 # Execute binning
 binned_x <- executeBinning(x_vec, breaks=breaks, nbin.max=nbin.max, method="width") #

 # create dummy matrix for x_vec
 nrow <- length(x_vec)
 ncol <- length(binned_x$breaks)
 dummy_mat <- (binned_x$labels - t(matrix(1:ncol, ncol, nrow))) * (binned_x$labels > t(matrix(1:ncol, ncol, nrow))) #

 if (only_info) return(list(breaks=binned_x$breaks))
 else return(list(breaks=binned_x$breaks, dummy_mat=dummy_mat))
} 

(ii) For plot-aglm.R, the corresponding part for L dummies is to be something like this (see.

aglm/R/plot-aglm.R

Lines 23 to 31 in c6570d7

# Plot for numeric features
slope <- coefs$coef.linear
steps <- coefs$coef.OD
if (is.null(slope)) slope <- 0
if (is.null(steps)) steps <- 0
x <- var_info$OD_info$breaks
y <- slope * x + cumsum(steps)
type <- ifelse(slope == 0, "s", "l")
)

     # Plot for numeric features
      slopesteps <- coefs$coef.LD
      if (is.null(slopesteps)) slopesteps <- 0

      x <- var_info$LD_info$breaks
      y <- cumsum(cumsum(slopesteps))
      type <- "l"

Change names?

This problem is not at all crucial. But, anyway, it seems that the term "intersection" is sometimes mistakenly used for "interaction". Especially, the argument "add_intersection_columns" should be renamed to "add_interaction_columns", shouldn't it? Or is that intentional?

predict.AccurateGLM with type = "coefficients" or "nonzero"

In aglm, exactly as in glmnet, you should be able to use
predict(model, type = "coefficients") and predict(model, type = "nonzero")
without specifying any newx. But, at the moment, they are not available in that way.
(This issue is much less crucial than #18, though.)

Tests for more datasets

Planning to test with datasets of various types and families.
By testing, I can find bugs and check performance.
Moreover, verifying that their results are not changed before releasing seems useful.
Maybe surveying 'CASdatasets' is nice.

Error with predict function

predict.AccurateGLM <- function(model,

Without setting type = "response" , predict function failed in binary classification task.
I used "binomial" family to train aglm.

Could you check this and if necessary fix ?

Another additional option of plot.AccurateGLM

In the current version, there is no other choice but we get all the plots when we use the plot.AccurateGLM function. Can you, however, make this function possible to produce the plot of only a chosen variable by something like plot(aglm(x, y,...), var_name = var_name) in which var_name is the name (or the number?) of the variable?

Partial residuals

Can you make plot.accurateGLM possible to include partial residuals as in mgcv::plot.gam?

Keep fit.preval, etc. in cv.aglm()

When keep = TRUE, fit.preval, etc. must be provided. Two more lines are needed in the following, right?

aglm/R/cv-aglm.R

Lines 127 to 137 in abcd0ce

return(new("AccurateGLM", backend_models=list(cv.glmnet=cv.glmnet_result$glmnet.fit),
lambda=cv.glmnet_result$lambda,
cvm=cv.glmnet_result$cvm,
cvsd=cv.glmnet_result$cvsd,
cvup=cv.glmnet_result$cvup,
cvlo=cv.glmnet_result$cvlo,
nzero=cv.glmnet_result$nzero,
name=cv.glmnet_result$name,
lambda.min=cv.glmnet_result$lambda.min,
lambda.1se=cv.glmnet_result$lambda.1se,
vars_info=x@vars_info))

Enhance cv.aglm() for tuning alpha

I believe it's better for us to investigate external libraries such as caret and try to provide some interfaces for them which is enough to tune alpha of aglm, because writing grid search from scratch is slightly troublesome. If such approach is not likely to work well, we would try to write it by ourself from scratch.

install_github in case of not installing glmnet

have to set dependencies...

devtools::install_github("kkondo1981/aglm", build_vignettes=TRUE)
Downloading GitHub repo kkondo1981/aglm@master
✔ checking for file ‘/private/var/folders/3w/zy8n5vsd1cqcfjd42sq_ynk40000gn/T/RtmpMsenAR/remotes2996349476dc/kkondo1981-aglm-16bcb01/DESCRIPTION’ ...
─ preparing ‘aglm’:
✔ checking DESCRIPTION meta-information ...
as.POSIXlt.POSIXct(x, tz) で警告がありました:
unknown timezone 'zone/tz/2019a.1.0/zoneinfo/Asia/Tokyo'
─ checking for LF line-endings in source and make files
─ checking for empty or unneeded directories
─ building ‘aglm_0.3.0.tar.gz’

strptime(xx, f <- "%Y-%m-%d %H:%M:%OS", tz = tz) で警告がありました:
unknown timezone 'zone/tz/2019a.1.0/zoneinfo/Asia/Tokyo'

  • installing source package ‘aglm’ ...
    ** R
    ** preparing package for lazy loading
    Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :
    there is no package called ‘glmnet’
    ERROR: lazy loading failed for package ‘aglm’
  • removing ‘/Library/Frameworks/R.framework/Versions/3.4/Resources/library/aglm’
    i.p(...) でエラー:
    (警告から変換されました) installation of package ‘/var/folders/3w/zy8n5vsd1cqcfjd42sq_ynk40000gn/T//RtmpMsenAR/file299670b35ce3/aglm_0.3.0.tar.gz’ had non-zero exit status

logical features

cv.aglm <- function(x, y,

One coin.
When throwing logical features into cv.aglm ( maybe also aglm ? ) function,
this seemed not working property.

Could you check the implementation and if necessary, fix this ?

Heavy vignettes

I wrote a vignettes for parallel cross-validations, but it's slightly too heavy to cope with when building and checking.
Furthermore, it is troublesome when users install the package from GitHub with devtools::install_github(..., build_vignettes=TRUE).
Should I move the real run part to other folder like demo, and set eval=FALSE for vignettes?
Need some more think...

Doesn't "add_linear_columns = FALSE" work?

Try:
library(MASS) # For Boston
library(aglm)
xy <- Boston # xy is a data.frame to be processed.
colnames(xy)[ncol(xy)] <- "y" # Let medv be the objective variable, y.
x <- xy[-ncol(xy)]
y <- xy$y
str(aglm(x, y, add_linear_columns = TRUE)) # Works
str(aglm(x, y, add_linear_columns = FALSE)) # Error?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.