kkondo1981 / aglm Goto Github PK

A handy tool for actuarial modeling, which is designed to achieve both accuracy and accountability.

License: GNU General Public License v2.0

R 100.00%

aglm's Introduction

What is this?

Accurate Generalized Linear Model (AGLM) is defined as a regularized GLM which applying a sort of feature transformations using a discretization of numerical features and specific coding methodologies of dummy variables. More details can be found in our paper.

2021/6/6: Now our paper won Charles A. Hachemeister Prize.

Installation

# The simplest way:
install.packages("aglm")

# Or the development version from GitHub:
# install.packages("devtools")
devtools::install_github("kkondo1981/aglm", ref="develop")

Usage

See the help as below after installing aglm.

library(aglm)
?"aglm-package"

aglm's People

Contributors

Stargazers

Watchers

Forkers

minghao2016 jpactuario deltavml learningasigoxyz mastermindml

aglm's Issues

Implement plot() for numerical and ordered data

What to plot (candidates) for one variable:

2-way sample plot (scatterplot)
linear effects for each x (lineplot)
OD effects for each x (lineplot)

(not to plot)

UD effects

Formula input

Can you make aglm possible to accept a formula input as in glmnetUtils::glmnet?

Small changes

aglm/examples/boston.R

Line 27 in 633d9d5

plot(y_pred, y_true)

aglm/examples/boston2.R

Line 37 in 2904927

plot(y_pred, y_true)

Can you change both of them to "plot(y_true, y_pred)"?

An additional option of plot.AccurateGLM

As we can get the coefs for a particular lambda by either coef(aglm(x, y,...), s = lambda) or coef(cv.aglm(x, y,...), s = lambda), can you make plot.AccurateGLM possible to produce the plots for a particular lambda by plot(aglm(x, y,...), s = lambda) or plot(cv.aglm(x, y,...), s = lambda)?

predict.aglm should accept a cv.aglm object

predict.glmnet accepts a cv.glmnet object. But it seems that predict.aglm is not yet ready for a cv.aglm object. See an example code below.
It is problematic because the glmnet package recommends avoiding a usage like glmnet_pred1 below in the example but recommends a usage like glmnet_pred2 below.

Preamble starts

library(MASS) # For Boston
library(glmnet)
library(aglm)

Read data

xy <- Boston # xy is a data.frame to be processed.
colnames(xy)[ncol(xy)] <- "y" # Let medv be the objective variable, y.

Split data into train and test

n <- nrow(xy) # Sample size.
set.seed(2018) # For reproducibility.
test.id <- sample(n, round(n/4)) # ID numbers for test data.
test <- xy[test.id,] # test is the data.frame for testing.
train <- xy[-test.id,] # train is the data.frame for training.
x <- train[-ncol(xy)]
y <- train$y
newx <- test[-ncol(xy)]
y_true <- test$y

Preamble ends. No new information so far

predict.glmnet accepts a cv.glmnet object:

set.seed(2018)
glmnet_CV <- cv.glmnet(as.matrix(x), y)
glmnet_lambda <- glmnet_CV$lambda.min
glmnet_model <- glmnet(as.matrix(x), y, lambda = glmnet_lambda)
glmnet_pred1 <- predict(glmnet_model, newx = as.matrix(newx), type = "response")
cat("RMSE: ", sqrt(mean((y_true - glmnet_pred1)^2)), "\n")
#RMSE: 3.67907: Not recommended
glmnet_pred2 <- predict(glmnet_CV, s = glmnet_lambda,
newx = as.matrix(newx), type = "response")
cat("RMSE: ", sqrt(mean((y_true - glmnet_pred2)^2)), "\n")
#RMSE: 3.678105: Recommended

predict.aglm doesn't accept a cv.aglm object:

set.seed(2018)
aglm_CV <- cv.aglm(x, y)
aglm_lambda <- aglm_CV$lambda.min
aglm_model <- aglm(x, y, lambda = aglm_lambda)
aglm_pred1 <- predict(aglm_model, newx = newx, type = "response")
cat("RMSE: ", sqrt(mean((y_true - aglm_pred1)^2)), "\n")
#RMSE: 3.10334: Not recommended?
aglme_pred2 <- predict(aglm_CV, s = aglm_lambda,
newx = newx, type = "response")
#Error
cat("RMSE: ", sqrt(mean((y_true - aglm_pred2)^2)), "\n")
#Error

Enable custom bins for OD vars

Add arguments mentioned below to aglm():

bins_list: A list of numeric vectors, each element of which indicates breaks of binning of one OD var.
bins_names: A list or column names or column index numbers, which specify how elements of bins_list correspond to columns of x. The default value is NULL, and in that case, elements of bins_list correspond to columns of x in order.

Add equal frequency binning and make it default way

Currently, the default and only way for aglm() to make bins is Equal Width Binning, but we should add Equal Frequency Binning and make it the default option.
Furthermore, increase default bin count to 100 from current 20.

newoffset for predict.glmnet

In below, "newoffset=newoffset" should be added, right?

aglm/R/predict-aglm.R

Lines 34 to 39 in 2904927

    
           glmnet_result <- predict(model@backend_models[[1]], 
        
                                    x_for_backend, 
        
                                    s=s, 
        
                                    type=type, 
        
                                    exact=exact, 
        
                                    ...)

Refactoring Signatures of aglm() and predict()

Because current interfaces of these functions are early drafted version, we should fix them and modify aglm() and predict().

Expand width of UD mat

Modify newInput() to use drop_last=FALSE option when calling getUDummyMatForOneVec().
It's because results could be different when optimizing with regularization terms, depending on using last columns or not.

Defaults of type.measure

Defaults of type.measure seem to have some trouble.

First, try:

install.packages('CASdatasets',
repos = 'http://cas.uqam.ca/pub/R/',
type='source',
dependencies = TRUE
)
library(CASdatasets)
data(ausprivauto0405)
exy <- ausprivauto0405[1:500, c(1:6, 8)]
expo <- exy[,1]
x <- exy[2:6]
y <- exy[,7]
model <- cv.aglm(x = x, y = y, offset = log(expo), #type.measure = "deviance",
add_interaction_columns = FALSE, family = "poisson")
model@name

When I tried, the output was:

mse
"Mean-Squared Error"

But it should have been:

deviance
"Poisson Deviance"

Compare:

model <- cv.glmnet(x = sapply(x, as.numeric), y = y, offset = log(expo),
family = "poisson")
model$name

Its output:

deviance
"Poisson Deviance"

Line 97 in 2904927

Not important issue.

aglm/R/binning.R

Line 97 in 2904927

labels <- cut(x_vec, breaks=c(-Inf, breaks, Inf), labels=FALSE, right=FALSE)

In breaks=c(-Inf, breaks, Inf), the original breaks may have -Inf and/or Inf, which will cause an error.
In fact, I once got an error message for this reason in using aglm.
So I think the right hand is better to be unique(c(-Inf, breaks, Inf)).

Rewrite the start guide

fill later

Set license

Ref)
https://cran.r-project.org/web/licenses/

The “GNU Affero General Public License” version 3 (AGPL-3)
The “Artistic License” version 2.0 (Artistic-2.0)
The “BSD 2-clause License” (BSD_2_clause)
The “BSD 3-clause License” (BSD_3_clause)
The “GNU General Public License” version 2 (GPL-2)
The “GNU General Public License” version 3 (GPL-3)
The “GNU Library General Public License” version 2 (LGPL-2)
The “GNU Lesser General Public License” version 2.1 (LGPL-2.1)
The “GNU Lesser General Public License” version 3 (LGPL-3)
The “MIT License” (MIT)

Add Cross Validation

To write the first version of cv.aglm() function, which has almost same interfaces as cv.glmnet().
See https://github.com/cran/glmnet/blob/master/R/cv.glmnet.R for details of cv.glmnet().

L dummy option

By using O dummies, the current version implements "fusion" (fused LASSO) not only for ordered factor variables but also for numeric variables. But, for numeric variables, "linear interpolation" may be a good, possibly better, alternative.

My idea is this:

Outline
For numeric variables, add an option of using new dummies, say, L dummies, instead of using O dummies and linear terms, where L dummies are dummies for linear interpolation. (Don't use linear terms and linear interpolation at the same time.)
Details
(i) L dummies can be implemented by adding a new function, say, getLDummyMatForOneVec.
The code is to be something like the following. (It's same as getODummyMatForOneVec but with only differences in the lines with #'s on the right.)

getLDummyMatForOneVec <- function(x_vec, breaks=NULL, nbin.max=100, only_info=FALSE) {
 # Check arguments. only integer or numerical or ordered vectors are allowed.
 assert_that(is.integer(x_vec) | is.numeric(x_vec) | is.ordered(x_vec))

 # Execute binning
 binned_x <- executeBinning(x_vec, breaks=breaks, nbin.max=nbin.max, method="width") #

 # create dummy matrix for x_vec
 nrow <- length(x_vec)
 ncol <- length(binned_x$breaks)
 dummy_mat <- (binned_x$labels - t(matrix(1:ncol, ncol, nrow))) * (binned_x$labels > t(matrix(1:ncol, ncol, nrow))) #

 if (only_info) return(list(breaks=binned_x$breaks))
 else return(list(breaks=binned_x$breaks, dummy_mat=dummy_mat))
}

(ii) For plot-aglm.R, the corresponding part for L dummies is to be something like this (see.

aglm/R/plot-aglm.R

Lines 23 to 31 in c6570d7

    
           # Plot for numeric features 
        
           slope <- coefs$coef.linear 
        
           steps <- coefs$coef.OD 
        
           if (is.null(slope)) slope <- 0 
        
           if (is.null(steps)) steps <- 0 
        
           x <- var_info$OD_info$breaks 
        
           y <- slope * x + cumsum(steps) 
        
           type <- ifelse(slope == 0, "s", "l")

)

     # Plot for numeric features
      slopesteps <- coefs$coef.LD
      if (is.null(slopesteps)) slopesteps <- 0

      x <- var_info$LD_info$breaks
      y <- cumsum(cumsum(slopesteps))
      type <- "l"

Change names?

This problem is not at all crucial. But, anyway, it seems that the term "intersection" is sometimes mistakenly used for "interaction". Especially, the argument "add_intersection_columns" should be renamed to "add_interaction_columns", shouldn't it? Or is that intentional?

predict.AccurateGLM with type = "coefficients" or "nonzero"

In aglm, exactly as in glmnet, you should be able to use
predict(model, type = "coefficients") and predict(model, type = "nonzero")
without specifying any newx. But, at the moment, they are not available in that way.
(This issue is much less crucial than #18, though.)

Make tests aglm() and predict() by example codes.

fill later

Tests for more datasets

Planning to test with datasets of various types and families.
By testing, I can find bugs and check performance.
Moreover, verifying that their results are not changed before releasing seems useful.
Maybe surveying 'CASdatasets' is nice.

Error with predict function

aglm/R/predict-aglm.R

Line 22 in 9ba4281

predict.AccurateGLM <- function(model,

Without setting type = "response" , predict function failed in binary classification task.
I used "binomial" family to train aglm.

Could you check this and if necessary fix ?

Another additional option of plot.AccurateGLM

In the current version, there is no other choice but we get all the plots when we use the plot.AccurateGLM function. Can you, however, make this function possible to produce the plot of only a chosen variable by something like plot(aglm(x, y,...), var_name = var_name) in which var_name is the name (or the number?) of the variable?

Partial residuals

Can you make plot.accurateGLM possible to include partial residuals as in mgcv::plot.gam?

Keep fit.preval, etc. in cv.aglm()

When keep = TRUE, fit.preval, etc. must be provided. Two more lines are needed in the following, right?

aglm/R/cv-aglm.R

Lines 127 to 137 in abcd0ce

    
           return(new("AccurateGLM", backend_models=list(cv.glmnet=cv.glmnet_result$glmnet.fit), 
        
                      lambda=cv.glmnet_result$lambda, 
        
                      cvm=cv.glmnet_result$cvm, 
        
                      cvsd=cv.glmnet_result$cvsd, 
        
                      cvup=cv.glmnet_result$cvup, 
        
                      cvlo=cv.glmnet_result$cvlo, 
        
                      nzero=cv.glmnet_result$nzero, 
        
                      name=cv.glmnet_result$name, 
        
                      lambda.min=cv.glmnet_result$lambda.min, 
        
                      lambda.1se=cv.glmnet_result$lambda.1se, 
        
                      vars_info=x@vars_info))

Enhance cv.aglm() for tuning alpha

I believe it's better for us to investigate external libraries such as caret and try to provide some interfaces for them which is enough to tune alpha of aglm, because writing grid search from scratch is slightly troublesome. If such approach is not likely to work well, we would try to write it by ourself from scratch.

install_github in case of not installing glmnet

have to set dependencies...

devtools::install_github("kkondo1981/aglm", build_vignettes=TRUE)
Downloading GitHub repo kkondo1981/aglm@master
✔ checking for file ‘/private/var/folders/3w/zy8n5vsd1cqcfjd42sq_ynk40000gn/T/RtmpMsenAR/remotes2996349476dc/kkondo1981-aglm-16bcb01/DESCRIPTION’ ...
─ preparing ‘aglm’:
✔ checking DESCRIPTION meta-information ...
as.POSIXlt.POSIXct(x, tz) で警告がありました:
unknown timezone 'zone/tz/2019a.1.0/zoneinfo/Asia/Tokyo'
─ checking for LF line-endings in source and make files
─ checking for empty or unneeded directories
─ building ‘aglm_0.3.0.tar.gz’

strptime(xx, f <- "%Y-%m-%d %H:%M:%OS", tz = tz) で警告がありました:
unknown timezone 'zone/tz/2019a.1.0/zoneinfo/Asia/Tokyo'

installing source package ‘aglm’ ...
** R
** preparing package for lazy loading
Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :
there is no package called ‘glmnet’
ERROR: lazy loading failed for package ‘aglm’
removing ‘/Library/Frameworks/R.framework/Versions/3.4/Resources/library/aglm’
i.p(...) でエラー:
(警告から変換されました) installation of package ‘/var/folders/3w/zy8n5vsd1cqcfjd42sq_ynk40000gn/T//RtmpMsenAR/file299670b35ce3/aglm_0.3.0.tar.gz’ had non-zero exit status

logical features

aglm/R/cv-aglm.R

Line 26 in 049aa58

cv.aglm <- function(x, y,

One coin.
When throwing logical features into cv.aglm ( maybe also aglm ? ) function,
this seemed not working property.

Could you check the implementation and if necessary, fix this ?

Heavy vignettes

I wrote a vignettes for parallel cross-validations, but it's slightly too heavy to cope with when building and checking.
Furthermore, it is troublesome when users install the package from GitHub with devtools::install_github(..., build_vignettes=TRUE).
Should I move the real run part to other folder like demo, and set eval=FALSE for vignettes?
Need some more think...

Doesn't "add_linear_columns = FALSE" work?

Try:
library(MASS) # For Boston
library(aglm)
xy <- Boston # xy is a data.frame to be processed.
colnames(xy)[ncol(xy)] <- "y" # Let medv be the objective variable, y.
x <- xy[-ncol(xy)]
y <- xy$y
str(aglm(x, y, add_linear_columns = TRUE)) # Works
str(aglm(x, y, add_linear_columns = FALSE)) # Error?

	glmnet_result <- predict(model@backend_models[[1]],
	x_for_backend,
	s=s,
	type=type,
	exact=exact,
	...)

	# Plot for numeric features
	slope <- coefs$coef.linear
	steps <- coefs$coef.OD
	if (is.null(slope)) slope <- 0
	if (is.null(steps)) steps <- 0

	x <- var_info$OD_info$breaks
	y <- slope * x + cumsum(steps)
	type <- ifelse(slope == 0, "s", "l")

	return(new("AccurateGLM", backend_models=list(cv.glmnet=cv.glmnet_result$glmnet.fit),
	lambda=cv.glmnet_result$lambda,
	cvm=cv.glmnet_result$cvm,
	cvsd=cv.glmnet_result$cvsd,
	cvup=cv.glmnet_result$cvup,
	cvlo=cv.glmnet_result$cvlo,
	nzero=cv.glmnet_result$nzero,
	name=cv.glmnet_result$name,
	lambda.min=cv.glmnet_result$lambda.min,
	lambda.1se=cv.glmnet_result$lambda.1se,
	vars_info=x@vars_info))