kkondo1981 / aglm Goto Github PK
View Code? Open in Web Editor NEWA handy tool for actuarial modeling, which is designed to achieve both accuracy and accountability.
License: GNU General Public License v2.0
A handy tool for actuarial modeling, which is designed to achieve both accuracy and accountability.
License: GNU General Public License v2.0
fill later
I believe it's better for us to investigate external libraries such as caret and try to provide some interfaces for them which is enough to tune alpha of aglm, because writing grid search from scratch is slightly troublesome. If such approach is not likely to work well, we would try to write it by ourself from scratch.
To write the first version of cv.aglm() function, which has almost same interfaces as cv.glmnet().
See https://github.com/cran/glmnet/blob/master/R/cv.glmnet.R for details of cv.glmnet().
Modify newInput() to use drop_last=FALSE
option when calling getUDummyMatForOneVec()
.
It's because results could be different when optimizing with regularization terms, depending on using last columns or not.
Can you make aglm
possible to accept a formula input as in glmnetUtils::glmnet
?
Add arguments mentioned below to aglm()
:
bins_list
: A list of numeric vectors, each element of which indicates breaks of binning of one OD var.bins_names
: A list or column names or column index numbers, which specify how elements of bins_list correspond to columns of x
. The default value is NULL, and in that case, elements of bins_list
correspond to columns of x
in order.Try:
library(MASS) # For Boston
library(aglm)
xy <- Boston # xy is a data.frame to be processed.
colnames(xy)[ncol(xy)] <- "y" # Let medv be the objective variable, y.
x <- xy[-ncol(xy)]
y <- xy$y
str(aglm(x, y, add_linear_columns = TRUE)) # Works
str(aglm(x, y, add_linear_columns = FALSE)) # Error?
Planning to test with datasets of various types and families.
By testing, I can find bugs and check performance.
Moreover, verifying that their results are not changed before releasing seems useful.
Maybe surveying 'CASdatasets' is nice.
In aglm, exactly as in glmnet, you should be able to use
predict(model, type = "coefficients")
and predict(model, type = "nonzero")
without specifying any newx
. But, at the moment, they are not available in that way.
(This issue is much less crucial than #18, though.)
Currently, the default and only way for aglm() to make bins is Equal Width Binning, but we should add Equal Frequency Binning and make it the default option.
Furthermore, increase default bin count to 100 from current 20.
In below, "newoffset=newoffset" should be added, right?
Lines 34 to 39 in 2904927
Ref)
https://cran.r-project.org/web/licenses/
This problem is not at all crucial. But, anyway, it seems that the term "intersection" is sometimes mistakenly used for "interaction". Especially, the argument "add_intersection_columns" should be renamed to "add_interaction_columns", shouldn't it? Or is that intentional?
Defaults of type.measure
seem to have some trouble.
First, try:
install.packages('CASdatasets',
repos = 'http://cas.uqam.ca/pub/R/',
type='source',
dependencies = TRUE
)
library(CASdatasets)
data(ausprivauto0405)
exy <- ausprivauto0405[1:500, c(1:6, 8)]
expo <- exy[,1]
x <- exy[2:6]
y <- exy[,7]
model <- cv.aglm(x = x, y = y, offset = log(expo), #type.measure = "deviance",
add_interaction_columns = FALSE, family = "poisson")
model@name
When I tried, the output was:
mse
"Mean-Squared Error"
But it should have been:
deviance
"Poisson Deviance"
Compare:
model <- cv.glmnet(x = sapply(x, as.numeric), y = y, offset = log(expo),
family = "poisson")
model$name
Its output:
deviance
"Poisson Deviance"
What to plot (candidates) for one variable:
(not to plot)
Not important issue.
Line 97 in 2904927
breaks=c(-Inf, breaks, Inf)
, the original breaks
may have -Inf
and/or Inf
, which will cause an error.unique(c(-Inf, breaks, Inf))
.predict.glmnet accepts a cv.glmnet object. But it seems that predict.aglm is not yet ready for a cv.aglm object. See an example code below.
It is problematic because the glmnet package recommends avoiding a usage like glmnet_pred1 below in the example but recommends a usage like glmnet_pred2 below.
library(MASS) # For Boston
library(glmnet)
library(aglm)
xy <- Boston # xy is a data.frame to be processed.
colnames(xy)[ncol(xy)] <- "y" # Let medv be the objective variable, y.
n <- nrow(xy) # Sample size.
set.seed(2018) # For reproducibility.
test.id <- sample(n, round(n/4)) # ID numbers for test data.
test <- xy[test.id,] # test is the data.frame for testing.
train <- xy[-test.id,] # train is the data.frame for training.
x <- train[-ncol(xy)]
y <- train$y
newx <- test[-ncol(xy)]
y_true <- test$y
set.seed(2018)
glmnet_CV <- cv.glmnet(as.matrix(x), y)
glmnet_lambda <- glmnet_CV$lambda.min
glmnet_model <- glmnet(as.matrix(x), y, lambda = glmnet_lambda)
glmnet_pred1 <- predict(glmnet_model, newx = as.matrix(newx), type = "response")
cat("RMSE: ", sqrt(mean((y_true - glmnet_pred1)^2)), "\n")
#RMSE: 3.67907: Not recommended
glmnet_pred2 <- predict(glmnet_CV, s = glmnet_lambda,
newx = as.matrix(newx), type = "response")
cat("RMSE: ", sqrt(mean((y_true - glmnet_pred2)^2)), "\n")
#RMSE: 3.678105: Recommended
set.seed(2018)
aglm_CV <- cv.aglm(x, y)
aglm_lambda <- aglm_CV$lambda.min
aglm_model <- aglm(x, y, lambda = aglm_lambda)
aglm_pred1 <- predict(aglm_model, newx = newx, type = "response")
cat("RMSE: ", sqrt(mean((y_true - aglm_pred1)^2)), "\n")
#RMSE: 3.10334: Not recommended?
aglme_pred2 <- predict(aglm_CV, s = aglm_lambda,
newx = newx, type = "response")
#Error
cat("RMSE: ", sqrt(mean((y_true - aglm_pred2)^2)), "\n")
#Error
I wrote a vignettes for parallel cross-validations, but it's slightly too heavy to cope with when building and checking.
Furthermore, it is troublesome when users install the package from GitHub with devtools::install_github(..., build_vignettes=TRUE)
.
Should I move the real run part to other folder like demo, and set eval=FALSE
for vignettes?
Need some more think...
fill later
In the current version, there is no other choice but we get all the plots when we use the plot.AccurateGLM
function. Can you, however, make this function possible to produce the plot of only a chosen variable by something like plot(aglm(x, y,...), var_name = var_name)
in which var_name
is the name (or the number?) of the variable?
Line 22 in 9ba4281
Without setting type = "response"
, predict function failed in binary classification task.
I used "binomial" family to train aglm.
Could you check this and if necessary fix ?
As we can get the coefs for a particular lambda by either coef(aglm(x, y,...), s = lambda)
or coef(cv.aglm(x, y,...), s = lambda)
, can you make plot.AccurateGLM
possible to produce the plots for a particular lambda by plot(aglm(x, y,...), s = lambda)
or plot(cv.aglm(x, y,...), s = lambda)
?
When keep = TRUE
, fit.preval
, etc. must be provided. Two more lines are needed in the following, right?
Lines 127 to 137 in abcd0ce
Because current interfaces of these functions are early drafted version, we should fix them and modify aglm() and predict().
Can you make plot.accurateGLM
possible to include partial residuals as in mgcv::plot.gam
?
By using O dummies, the current version implements "fusion" (fused LASSO) not only for ordered factor variables but also for numeric variables. But, for numeric variables, "linear interpolation" may be a good, possibly better, alternative.
My idea is this:
getLDummyMatForOneVec
.getODummyMatForOneVec
but with only differences in the lines with #'s on the right.)getLDummyMatForOneVec <- function(x_vec, breaks=NULL, nbin.max=100, only_info=FALSE) {
# Check arguments. only integer or numerical or ordered vectors are allowed.
assert_that(is.integer(x_vec) | is.numeric(x_vec) | is.ordered(x_vec))
# Execute binning
binned_x <- executeBinning(x_vec, breaks=breaks, nbin.max=nbin.max, method="width") #
# create dummy matrix for x_vec
nrow <- length(x_vec)
ncol <- length(binned_x$breaks)
dummy_mat <- (binned_x$labels - t(matrix(1:ncol, ncol, nrow))) * (binned_x$labels > t(matrix(1:ncol, ncol, nrow))) #
if (only_info) return(list(breaks=binned_x$breaks))
else return(list(breaks=binned_x$breaks, dummy_mat=dummy_mat))
}
(ii) For plot-aglm.R, the corresponding part for L dummies is to be something like this (see.
Lines 23 to 31 in c6570d7
# Plot for numeric features
slopesteps <- coefs$coef.LD
if (is.null(slopesteps)) slopesteps <- 0
x <- var_info$LD_info$breaks
y <- cumsum(cumsum(slopesteps))
type <- "l"
Line 26 in 049aa58
One coin.
When throwing logical features
into cv.aglm
( maybe also aglm
? ) function,
this seemed not working property.
Could you check the implementation and if necessary, fix this ?
have to set dependencies...
devtools::install_github("kkondo1981/aglm", build_vignettes=TRUE)
Downloading GitHub repo kkondo1981/aglm@master
✔ checking for file ‘/private/var/folders/3w/zy8n5vsd1cqcfjd42sq_ynk40000gn/T/RtmpMsenAR/remotes2996349476dc/kkondo1981-aglm-16bcb01/DESCRIPTION’ ...
─ preparing ‘aglm’:
✔ checking DESCRIPTION meta-information ...
as.POSIXlt.POSIXct(x, tz) で警告がありました:
unknown timezone 'zone/tz/2019a.1.0/zoneinfo/Asia/Tokyo'
─ checking for LF line-endings in source and make files
─ checking for empty or unneeded directories
─ building ‘aglm_0.3.0.tar.gz’
strptime(xx, f <- "%Y-%m-%d %H:%M:%OS", tz = tz) で警告がありました:
unknown timezone 'zone/tz/2019a.1.0/zoneinfo/Asia/Tokyo'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.