shichenxie / scorecard Goto Github PK

View Code? Open in Web Editor NEW

158.0 18.0 63.0 16.17 MB

Scorecard Development in R, 评分卡

Home Page: http://shichen.name/scorecard

License: Other

R 100.00%

r scorecard credit-scoring release woebinning woe binning

scorecard's Introduction

scorecard

The goal of scorecard package is to make the development of the traditional credit risk scorecard model easier and efficient by providing functions for some common tasks that summarized in below. This package can also used in the development of machine learning models on binary classification.

data preprocessing (split_df, replace_na, one_hot, var_scale)
weight of evidence (woe) binning (woebin, woebin_plot, woebin_adj, woebin_ply)
variable selection (var_filter, iv, vif)
performance evaluation (perf_eva, perf_cv, perf_psi)
scorecard scaling (scorecard, scorecard2, scorecard_ply)
scorecard report (gains_table, report)

Installation

Install the release version of scorecard from CRAN with:

install.packages("scorecard")

Install the latest version of scorecard from github with:

# install.packages("devtools")
devtools::install_github("shichenxie/scorecard")

Example

This is a basic example which shows you how to develop a common credit risk scorecard:

# Traditional Credit Scoring Using Logistic Regression
library(scorecard)

# data preparing ------
# load germancredit data
data("germancredit")
# filter variable via missing rate, iv, identical value rate
dt_f = var_filter(germancredit, y="creditability")
# breaking dt into train and test
dt_list = split_df(dt_f, y="creditability", ratios = c(0.6, 0.4), seed = 30)
label_list = lapply(dt_list, function(x) x$creditability)

# woe binning ------
bins = woebin(dt_f, y="creditability")
# woebin_plot(bins)

# binning adjustment
## adjust breaks interactively
# breaks_adj = woebin_adj(dt_f, "creditability", bins) 
## or specify breaks manually
breaks_adj = list(
  age.in.years=c(26, 35, 40),
  other.debtors.or.guarantors=c("none", "co-applicant%,%guarantor"))
bins_adj = woebin(dt_f, y="creditability", breaks_list=breaks_adj)

# converting train and test into woe values
dt_woe_list = lapply(dt_list, function(x) woebin_ply(x, bins_adj))

# glm / selecting variables ------
m1 = glm( creditability ~ ., family = binomial(), data = dt_woe_list$train)
# vif(m1, merge_coef = TRUE) # summary(m1)
# Select a formula-based model by AIC (or by LASSO for large dataset)
m_step = step(m1, direction="both", trace = FALSE)
m2 = eval(m_step$call)
# vif(m2, merge_coef = TRUE) # summary(m2)

# performance ks & roc ------
## predicted proability
pred_list = lapply(dt_woe_list, function(x) predict(m2, x, type='response'))
## Adjusting for oversampling (support.sas.com/kb/22/601.html)
# card_prob_adj = scorecard2(bins_adj, dt=dt_list$train, y='creditability', 
#                x=sub('_woe$','',names(coef(m2))[-1]), badprob_pop=0.03, return_prob=TRUE)
                
## performance
perf = perf_eva(pred = pred_list, label = label_list)
# perf_adj = perf_eva(pred = card_prob_adj$prob, label = label_list$train)

# score ------
## scorecard
card = scorecard(bins_adj, m2)
## credit score
score_list = lapply(dt_list, function(x) scorecard_ply(x, card))
## psi
perf_psi(score = score_list, label = label_list)

# make cutoff decisions -----
## gains table
gtbl = gains_table(score = unlist(score_list), label = unlist(label_list))

scorecard's People

Contributors

Stargazers

Watchers

Forkers

zhwj7552 comodr tiansong1991 mindis rajeshkusuma jlwang233 yanheluke lxyea faithlxy luckygjx aephidayatuloh commonll kstepanmpmg stephankalika frankzhougq hope-data-science armgong yuhuaizheng snowdj leo-lee15 wudouble simashanhe rockkb emsinko suseeamar richard-h-wang zaenalium hery-168 18600597055 minghao2016 ttxs402 vicsal2019 yaozibin zhengkai0919 xuming1986 aaronwwy magirui ycaihua lwang535 dataabel borishouenou ezekielolugbami kingcredit juantrma kivuhubdrc susantjg yraincy miroslavftn aaronwong12 mthomas-ketchbrook robocova mrquiet lixiao126 enriczhang asitav-sen teuffy zhaodaolimeng ouyangjustino wesleywff rnaimehaom lifeindreamhouse nyashkn glisteny

scorecard's Issues

Merge `missing` bin with few counts with another bin

Hi Shichen,

Thanks so much for your package and all your time and work in it!

Is there any way in woebin or other function to merge a missing category (NA values) with very less counts (< count_distr_limit) in a category with relative same badprob?

For example I would like to merge the missing bin with [0.54,0.8) (the 4th one) due have similiar bad rates.

set.seed(123)

N <- 1000
p <- runif(N)
y <- rbinom(N, 1, p)

p[runif(N) < 0.01] <- NA

scorecard::woebin(data.frame(p, y), y = "y")
#> [INFO] creating woe binning ...
#> $p
#>    variable         bin count count_distr good bad    badprob        woe
#> 1:        p     missing     7       0.007    2   5 0.71428571  0.8362480
#> 2:        p [-Inf,0.14)   139       0.139  133   6 0.04316547 -3.1786324
#> 3:        p [0.14,0.54)   403       0.403  253 150 0.37220844 -0.6027969
#> 4:        p  [0.54,0.8)   254       0.254   75 179 0.70472441  0.7898550
#> 5:        p  [0.8,0.94)   137       0.137   15 122 0.89051095  2.0159281
#> 6:        p [0.94, Inf)    60       0.060    2  58 0.96666667  3.2872531
#>        bin_iv total_iv  breaks is_special_values
#> 1: 0.00455648 1.903872 missing              TRUE
#> 2: 0.84406952 1.903872    0.14             FALSE
#> 3: 0.14384048 1.903872    0.54             FALSE
#> 4: 0.14847755 1.903872     0.8             FALSE
#> 5: 0.40997000 1.903872    0.94             FALSE
#> 6: 0.35295827 1.903872     Inf             FALSE

^{Created on 2019-12-24 by the reprex package (v0.3.0)}

Thanks in advance for your response.

cc @jm448

a bug in woebin ?

today using woebin to bin a factor variable but the result is strange

pflag<-rep(0,12278)
cus_cus_class<-rep("1",12278)

pflag1<-rep(1,241213)
cus_cus_class1<-rep("1",241213)

pflag2<-rep(0,3646)
cus_cus_class2<-rep("3",3646)

pflag3<-rep(1,1762)
cus_cus_class3<-rep("3",1762)

pflagall<-c(pflag,pflag1,pflag2,pflag3)
cus_cus_classall<-c(cus_cus_class,cus_cus_class1,cus_cus_class2,cus_cus_class3)

cus_cus_classall<-as.factor(cus_cus_classall)

df=data.frame(pflagall,cus_cus_classall)

table(df)


library(scorecard)
library(smbinning)

iv(df,"pflagall","cus_cus_classall")

woebin(df,"pflagall","cus_cus_classall")

smbinning.factor(df,"pflagall","cus_cus_classall")$ivtable

the result of iv and smbinning is 0.84 , but woebin can't even binning .

report 函数记录的结果和 perf_eva, perf_psi 给出的不同

perf_eva trian的ks是23 但报告中是17. 分数的psi也不同。
下面是个可再现的例子。包括数据和代码。期待您的帮助

temp.zip

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936 LC_CTYPE=Chinese (Simplified)_China.936 LC_MONETARY=Chinese (Simplified)_China.936
[4] LC_NUMERIC=C LC_TIME=Chinese (Simplified)_China.936

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] scorecard_0.3.1 forcats_0.5.0 stringr_1.4.0 dplyr_1.0.0 purrr_0.3.4 readr_1.3.1 tidyr_1.1.0 tibble_3.0.3
[9] ggplot2_3.3.2 tidyverse_1.3.0

loaded via a namespace (and not attached):
[1] tidyselect_1.1.0 haven_2.3.1 colorspace_1.4-1 vctrs_0.3.1 generics_0.0.2 blob_1.2.1 rlang_0.4.7
[8] pillar_1.4.6 withr_2.2.0 glue_1.4.1 DBI_1.1.0 dbplyr_1.4.4 modelr_0.1.8 readxl_1.3.1
[15] foreach_1.5.0 lifecycle_0.2.0 munsell_0.5.0 gtable_0.3.0 cellranger_1.1.0 rvest_0.3.5 zip_2.0.4
[22] codetools_0.2-16 doParallel_1.0.15 parallel_4.0.2 fansi_0.4.1 broom_0.7.0 Rcpp_1.0.5 backports_1.1.8
[29] scales_1.1.1 jsonlite_1.7.0 farver_2.0.3 fs_1.4.2 gridExtra_2.3 digest_0.6.25 hms_0.5.3
[36] packrat_0.5.0 stringi_1.4.6 openxlsx_4.1.5 grid_4.0.2 cli_2.0.2 tools_4.0.2 magrittr_1.5
[43] crayon_1.3.4 pkgconfig_2.0.3 ellipsis_0.3.1 data.table_1.12.8 xml2_1.3.2 reprex_0.3.0 lubridate_1.7.9
[50] assertthat_0.2.1 httr_1.4.1 rstudioapi_0.11 iterators_1.0.12 R6_2.4.1 compiler_4.0.2

生成的report中的variable woe binning的train图的图表名称也是test开头的

用的是3.4.4的scorecard。

attachment.pdf

Adjusted WoE

First of all, thank you for your useful package. I have a question: I don't quite understand what kind of adjustment the function woebin applies when there is a zero frequency class in a category?. I mean How woe's are calculated then?

Issue with the counts in the woebin results

Hi @ShichenXie,
I have the following question regarding the counts and the cut points of the variables. In this example, replicating the record count for variable “x” using the base :: cut function does not get the same results as with the woebin function.
Also, I have verified that when using the woebin_ply function, the counts match the base :: cut calculation.
Thank you,

library(readr)
library(scorecard)
#> Warning: package 'scorecard' was built under R version 4.0.5
suppressPackageStartupMessages(library(dplyr))

packageVersion("scorecard")
#> [1] '0.3.2'

d <- read_csv("https://gist.githubusercontent.com/jm448/a8edc0f3a89c6797c52aa84f978eca6f/raw/4ca39c576a23ae5b94b19c5829149d6800b75991/data.txt")
#> 
#> -- Column specification --------------------------------------------------------
#> cols(
#>   x = col_double(),
#>   response = col_double()
#> )

glimpse(d)
#> Rows: 253
#> Columns: 2
#> $ x        <dbl> 1.0000000, -999.0000000, 0.3639344, 0.9988413, 0.7696078, ...
#> $ response <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

bin <- woebin(
  d, 
  x = "x",
  y = "response",
  method = "tree",
  count_distr_limit = 0.05 
)
#> [INFO] creating woe binning ...

bin
#> $x
#>    variable        bin count count_distr neg pos    posprob         woe
#> 1:        x   [-Inf,0)    24  0.09486166  15   9 0.37500000  1.16158709
#> 2:        x   [0,0.52)    88  0.34782609  80   8 0.09090909 -0.63017238
#> 3:        x [0.52,0.6)    21  0.08300395  16   5 0.23809524  0.50926190
#> 4:        x [0.6, Inf)   120  0.47430830 102  18 0.15000000 -0.06218834
#>         bin_iv  total_iv breaks is_special_values
#> 1: 0.179555187 0.3174041      0             FALSE
#> 2: 0.110649986 0.3174041   0.52             FALSE
#> 3: 0.025403323 0.3174041    0.6             FALSE
#> 4: 0.001795579 0.3174041    Inf             FALSE

# counts:

bin$x$count
#> [1]  24  88  21 120

brks <- bin$x$breaks
brks <- as.numeric(brks)
brks <- c(-Inf, brks)

brks
#> [1] -Inf 0.00 0.52 0.60  Inf

dc <- d %>% 
  mutate(x_bin = cut(x, brks, right = FALSE)) %>% 
  count(x_bin)

dc
#> # A tibble: 4 x 2
#>   x_bin          n
#>   <fct>      <int>
#> 1 [-Inf,0)      24
#> 2 [0,0.52)      88
#> 3 [0.52,0.6)    20
#> 4 [0.6, Inf)   121

# the counts using cut doesn't match with the woebin results  

dc$n
#> [1]  24  88  20 121

# the counts match using woebin_ply

woebin_ply(d, bins = bin, to = "bin") %>%
  as_tibble() %>% 
  mutate(
    x = d$x,
    x_bin2 = cut(x, brks, right = FALSE)
  ) %>% 
  filter(x_bin != x_bin2)
#> [INFO] converting into woe values ...
#> # A tibble: 0 x 4
#> # ... with 4 variables: response <dbl>, x_bin <chr>, x <dbl>, x_bin2 <fct>

# the woe values match

bin$x %>% 
  mutate(
    neg_porc = neg / sum(neg),
    pos_porc = pos / sum(pos),
    woe2 = log(pos_porc / neg_porc)
  ) %>% 
  select(bin, count, pos, neg, woe, woe2) %>% 
  mutate(woe == woe2)
#>           bin count pos neg         woe        woe2 woe == woe2
#> 1:   [-Inf,0)    24   9  15  1.16158709  1.16158709        TRUE
#> 2:   [0,0.52)    88   8  80 -0.63017238 -0.63017238        TRUE
#> 3: [0.52,0.6)    21   5  16  0.50926190  0.50926190        TRUE
#> 4: [0.6, Inf)   120  18 102 -0.06218834 -0.06218834        TRUE

# woe values using base::cut

bin2 <- d %>% 
  mutate(x_bin = cut(x, brks, right = FALSE)) %>% 
  count(x_bin, response) %>% 
  mutate(response = if_else(response == 1, "pos", "neg")) %>% 
  tidyr::pivot_wider(names_from = "response", values_from = "n") %>% 
  mutate(
    neg_porc = neg / sum(neg),
    pos_porc = pos / sum(pos),
    woe2 = log(pos_porc / neg_porc)
  ) %>% 
  select(x_bin, woe2)
  
# the woe values using base::cut doesn't match with woe values from woebin

bin$x %>% 
  select(bin, woe) %>% 
  left_join(bin2, by = c("bin" = "x_bin")) %>% 
  mutate(woe == woe2)
#>           bin         woe        woe2 woe == woe2
#> 1:   [-Inf,0)  1.16158709  1.16158709        TRUE
#> 2:   [0,0.52) -0.63017238 -0.63017238        TRUE
#> 3: [0.52,0.6)  0.50926190  0.57380042       FALSE
#> 4: [0.6, Inf) -0.06218834 -0.07194452       FALSE

^{Created on 2021-05-19 by the reprex package (v0.3.0)}

function rmcol_datetime_unique1 suppose dt must have a character col

when i use package scorecard which is a great tool to use, i encounter an error
"Error in matrix(unlist(value, recursive = FALSE, use.names = FALSE), nrow = nr, :
'data' must be of a vector type, was 'NULL'"
when i am binning var use woebin() or woebin_ply().
i dive into it found that in internal function rmcol_datetime_unique1() , it suppose dt must have a character col,which is not the condition in my work.

Scaling for "Good" model

Hi,

Can I confirm one thing?

If I want to generate a scorecard that predict 'good' (i.e. the positive is 'good'), the scaling formula (i.e. score = A - B*ln(odd)) embedded will not be correct.

Any knows a work around? Thanks.

I figured out the work around.

Different IV values

Hi,
how come the value of IV is different in function IV and woebin.

I notice IV is used in function var_filter thus if the value is wrong we would catch wrong variables.

I compare IV result with smbinning package and the value is closer to your woebin function.

Can the console outputs when using graphics (such as woebin_plot & perf_eva) be suppressed?

This is particularly an issue when using the scorecard function in RMarkdown chunks and document generating. Is there a way to suppress all outputs when using graphics? An example would be when using the woebin_plot() function, there will be a printed ## $variable line in the R Markdown (or console in R) which I would like to remove. Similarly, with perf_eva, there are a few lines of console output I would like to suppress. Is there an option that will only output the graph?

chimerge error

Running the woebin function with method = "chimerge" produces an error.
In addition, if I use the method = "tree" and set the bin_num_limit = 10, some of the variables still have more than 10 bins.

bins <- scorecard::woebin(dt = TrainingData, y= "StatusbinnenT12",
+                x =xvars,positive = 1, count_distr_limit = 0.01,
+                bin_num_limit = 10, method = "chimerge")

Result:
[INFO] creating woe binning ...
Error in checkForRemoteErrors(val) :
one node produced an error: Error in match.arg(type) : 'arg' must be NULL or a character vector

The result output of gains_table needs to be added []

Hello, Mr. Xie, as a risk control officer, I like your package very much, but I think this may be a small bug. The data.table obtained by gains_table must be directly output with [], otherwise the name of the output result will be printed twice In order to output, you may need to add [] to the data.table obtained at the end of the gains_table function.

关于woebin分组的问题

谢老师您好，我关注这个包很久了。目前存在一个不能算bug的争议点希望能跟你探讨一下。我有一个变量，有70%的值是0，这个变量我自己用python决策树分箱做woe后iv可以达到1.5，用smbinning的话也是相同的结果，算是一个很重要的变量。但是用scorecard分不出组（也就是【-inf，inf】）。我猜测是因为R语言里的区间是左闭右开，所以0这个值很容易被合并掉，我想问下您这里有没有什么可以解决的方案或者好的建议？
test.xlsx

Problem with perf_eva density plot

Shichen

Sometimes the density plot in the perf_eva function comes out with the y-axis not properly scaled resulting in the lines not showing.

See below for sample code and data to recreate the problem and the resulting plot. I have tested using scorecard version 0.2.4 on Windows and version 0.3.0 on Linux, same result both times.

Thanks

Tomas

require(dplyr)
require(scorecard)
x <- read.csv("M7_WOE Binomial_632.csv")
xx <- x %>% filter(modelsample=="Hold out sample")
perf_eva(xx$modelscore,xx$Churn,show_plot = "density")

M7_WOE Binomial_632.csv

Rplot.pdf

a bug in weobin when column only have special value and missing?

y<-ifelse(runif(10000)>0.99,1,0)
x<-rep(NA,10000)
x[1:2000]<-9999
sdf<-as.data.frame(cbind(y,x))
names(sdf)<-c("y1","x1")
woebin(sdf,y="y1",x="x1") #ok
woebin(sdf,y="y1",x="x1",
           breaks_list = list(x1=c(0)),
           special_values = list(x1=c(9999))) #ok
woebin(sdf,y="y1",x="x1",
       special_values = list(x1=c(9999))) #warning message and not result

the lastest line show warning and no result of binning

`woebin`: some count are NA, but neg/pos counts are ok

Hi @ShichenXie

If I want a missing/no missing bin in a numeric variable. I should use c(Inf) in the breaks_list argument (3r example)?

The last example brk <- c(0, 1, Inf) have the issue mentioned in the title.

library(scorecard)
library(readr)

packageVersion("scorecard")
#> [1] '0.3.2.999'

.Platform$OS.type
#> [1] "unix"

data <- read_csv("https://gist.githubusercontent.com/jbkunst/4e8b58d2ffca1b5ca4496f1443aec032/raw/66bf72435e1c6cb7bda32c41dd7a1d3e4e1690cb/test")
#> Parsed with column specification:
#> cols(
#>   y = col_character(),
#>   variable = col_double()
#> )

str(data)
#> spec_tbl_df [80,000 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ y       : chr [1:80000] "good" "good" "good" "bad" ...
#>  $ variable: num [1:80000] NA NA NA 0 NA NA NA NA 0 NA ...
#>  - attr(*, "spec")=
#>   .. cols(
#>   ..   y = col_character(),
#>   ..   variable = col_double()
#>   .. )


brk <- c(0, Inf)
brk
#> [1]   0 Inf
scorecard::woebin(data, y = "y", breaks_list = list(variable = brk))[[1]][, c(3, 5, 6)]
#> [INFO] creating woe binning ...
#>    count   neg  pos
#> 1: 75280 67543 7737
#> 2:    NA  1762 2958

brk <- c(0)
brk
#> [1] 0
scorecard::woebin(data, y = "y", breaks_list = list(variable = brk))[[1]][, c(3, 5, 6)]
#> [INFO] creating woe binning ...
#>    count   neg  pos
#> 1: 75280 67543 7737
#> 2:    NA  1762 2958

brk <- c(Inf)
brk
#> [1] Inf
scorecard::woebin(data, y = "y", breaks_list = list(variable = brk))[[1]][, c(3, 5, 6)]
#> [INFO] creating woe binning ...
#>    count   neg  pos
#> 1: 75280 67543 7737
#> 2:  4720  1762 2958

brk <- c(0, 1, Inf)
brk
#> [1]   0   1 Inf
scorecard::woebin(data, y = "y", breaks_list = list(variable = brk))[[1]][, c(3, 5, 6)]
#> [INFO] creating woe binning ...
#>    count   neg  pos
#> 1: 75280 67543 7737
#> 2:    NA   860 2520
#> 3:    NA   902  438

^{Created on 2021-06-02 by the reprex package (v2.0.0.9000)}

Bug in `germancredit` data

Hi Shichen,
for the factor personal.status.and.sex, I find that the germancredit data erroneously classifies all cases from the factor level male : divorced/separated as female : divorced/separated/married. The female : single category appears to be indeed empty even in the original data, but the male : divorced/separated category is not.
Best, Ulrike

Use of odds0

Hi,

I have a question on the default value and use of parameter odds0.
So if I want to set a target score of 600 where the odds of good/bad = 50; does it mean I should set odds0=1/50 in the function? (I am trying to replicate this scorecard building method with your package https://towardsdatascience.com/intro-to-credit-scorecard-9afeaaa3725f).

Thanks in advance for the help.

like this program. but i do not see traditional woe graphs for variables. that is, the woe for each bin, bins on the x-axis and the WOE on the y-axis. those quick visuals are very useful for analysis.

Automated variable selection considering highly correlated variables

VIF had been included as a function in the scorecard package, why not include it in the var_filter to filter out
the hight correlated variables(assume all the variables are numeric)? Another way might be using caret::findCorrelation to do the filter work.

scorecard function scores vs bad probablities

Error for some outcome levels

If the positive outcome is other than "bad" or 1 and the positive option is used to define it, you can get an error under some condition when using woebin. For example, if levels are are "bad" and "not bad" instead of "bad" and "good" the code will fail as the recoding uses string match and every record gets coded as 1 in this example.

Also it would be nice if the WOE tables and plots would reflect the levels of the dependent variable as labeling everything as good or bad does not always make sense. For example, I use scorecard for marketing and the positive outcome is a response so would be nice to be able to label the plats and tables accordingly.

Love your package, find it very useful.

Tomas

Already Exporting Variable

Hi and thanks for your useful package. I get the below warning message when running with my own data. The final scorecard seems fine so at the moment I am ignoring the message, but what is the root cause?

bins = woebin(dt_s, y="IsDef", no_cores = 8)
Warning message:
In e$fun(obj, substitute(ex), parent.frame(), e$data) :
already exporting variable(s): dt, xs, y, breaks_list, min_perc_fine_bin, stop_limit, max_num_bin

woe value problem

in recent use we find woe_bin' s result bins woe value is not equal to log(p_bad/p_good)

`> bins
$score1
variable bin count count_distr good bad badprob woe bin_iv total_iv breaks
1: score1 missing 1 0.0001557632 1 0 0.00000000 1.7041 254 0.001408331 0.0725454 missing
2: score1 [-Inf,475) 401 0.0624610592 300 101 0.25187032 0.6255138 0.029977971 0.0725454 475
3: score1 [475,515) 2191 0.3412772586 1800 391 0.17845733 0.1873414 0.012769587 0.0725454 515
4: score1 [515,570) 3486 0.5429906542 3033 453 0.12994836 -0.1872396 0.017822342 0.0725454 570
5: score1 [570, Inf) 341 0.0531152648 307 34 0.09970674 -0.4863114 0.010567168 0.0725454 Inf
is_special_values
1: TRUE
2: FALSE
3: FALSE
4: FALSE
5: FALSE

bins1$score1%>%

mutate(pg = good / sum(good),
+ pb = bad / sum(bad),
+ woe1=log(pb/pg))%>%
select( bin,woe,woe1)
bin woe woe1
1: missing 1.7041254 -Inf
2: [-Inf,475) 0.6255138 0.6265245
3: [475,515) 0.1873414 0.1883521
4: [515,570) -0.1872396 -0.1862289
5: [570, Inf) -0.4863114 -0.4853007`

why this happen is there something wrong?

Column ID Disappears

In our model data, we have ID column.
(ex: PERSON_ID) (the main distinct ID that we need the scores of)

The package disappears the ID_COLUMN,
How can we identify it in the code? How does the code know which column is ID?
(it rejects that column at the beginning assuming that it is a feature)

In the code I could not see anywhere to clarify the ID column.
(it disappears the ID after the code :
dt_sel = var_filter(germancredit, "creditability")

So it causes a problem that, we do not know which score belongs to which PERSON_ID
(it just gives rows and scores...)

I hope my question is clear :)

May be that final scorecard code should include the column ID :
(or var filter code may include a column ıd like :
var_filter(germancredit, "creditability","person_id")

credit score, only_total_score = FALSE
score_list2 = lapply(dt_list, function(x) scorecard_ply(x,card, only_total_score=FALSE))

Thanks for that great work!

Formulas

Hi. Hope you are well.
I got stuck while computing various calculations manually to cross check the results obtained using your library. Therefore, I would like you to share the exact formulas that you've used to perform following calculations:

baseline score
score against each bin of a variable

Looking forward to your response. Thanks in advance!

report function error

I have been using scorecard for any time, it is excellent, but yesterday I ran a R program that I run monthly and give the following error:

report(list(train = dt_list$train, test = dt_list$test), y = 'Reclasificado',
                         x = cols_not_remove, breaks_list = breaks_adj, special_values = NULL,
                         seed = seed,  save_report='report1', show_plot = c('ks', 'lift', 'gain',
                         'roc', 'lz', 'pr', 'f1', 'density'),
                         bin_type = 'width')

[INFO] sheet1-dataset information
[INFO] sheet2-model coefficients
[INFO] sheet3-model performance
[INFO] sheet4-variable woe binning
[INFO] sheet5-scorecard
[INFO] sheet6-population stability
Error in setnames(psi_tbl, gains_table_cols) :
Can't assign 12 names to a 13 column data.table

Thank you in advance.

回归系数为负数时，scorecard输出特征评分项符号错误

ShichenXie，您好。我英文不够好，但愿您能看懂中文。
首先非常感谢您开发的scorecard包,这里我报告下我遇到的一个问题：
当logit回归非截距系数为负时，输出评分卡评分项的符号有问题。
我使用R版本的scorecard_0.19。

具体:
TD_CREDITSCORE的回归权重为-0.5396,且odds0 =1/19，points=600,pdo=50

1	variable	bin	woe	points	count	count_distr	good	bad	badprob	bin_iv	total_iv	breaks	is_special_values
33	TD_CREDITSCORE	missing	0.250670111	10	509	0.14753623	434	75	0.14734774	1.019171e-02	0.09441740	missing	FALSE
34	TD_CREDITSCORE	[-Inf,25)	-0.292529584	-11	1721	0.49884058	1564	157	0.09122603	3.815798e-02	0.09441740	25	FALSE
35	TD_CREDITSCORE	[25,35)	0.001571926	0	716	0.20753623	631	85	0.11871508	5.131192e-07	0.09441740	35	FALSE
36	TD_CREDITSCORE	[35,50)	0.465781491	18	357	0.10347826	294	63	0.17647059	2.671513e-02	0.09441740	50	FALSE
37	TD_CREDITSCORE	[50, Inf)	0.602837737	23	147	0.04260870	118	29	0.19727891	1.935207e-02	0.09441740	Inf	FALSE

上表是scorecard输出结果，经过手工计算发现points符号有问题

cut argument right= FALSE

Hi @ShichenXie

First, thank so much for your work. This package help me a lot!

What is the reason to use the cut(....right = FALSE) in woebin being that the default value in base::cut is TRUE?

For example

scorecard/R/woebin.R

Line 344 in 65b9ca1

    
           , bstbin := cut(brkp, c(-Inf, bestbreaks, Inf), right = FALSE, dig.lab = 10, ordered_result = FALSE)

Why my question? Because I'm trying to create an interface woebin_ctree to mix the scorecard::woebin output using the breaks given by partykit::ctree function. This tree algorithm make the split using <=. So I can't replicate the counts in each node. For example:

The tree:

> ctree_breaks
[1] 11 15 33

But (obviously) when I use woebin with that beaks I don't have the same counts.

I tried to make similar breaks adding a small value 0.000001 but this is not quite elegant 😅 :

Do you think is possible add an optional argument woebin(..., right = FALSE) to modify this behaviour if is necessary.

Thanks in advance,
Kind regards,

单行数据 woe替换慢

您好，我想把开发的评分卡直接上线到生产的R服务，但发现单行数据的woe替换速度太慢要1.5秒左右，这是不能满足直接上线需求的。下面是示例只有 7个变量一行数据。耗时1.39秒。我可以手写这个替换步骤让它单行替换耗时下降到几十毫秒。但还是希望这能成为这个优秀包的特性。不知道是否可行。

woebin_ply(input,sc$bins)%>%rename_all(.x%>%str_remove("_woe"))
[INFO] converting into woe values ...
no als_m12_cell_nbank_finlea_orgnum als_m3_id_nbank_cf_orgnum r_m01_cell_pdl_0allnumorgnum r_m03_id_caon_0allnumorgnum
1: 1 0.05253662 -0.2286684 -0.006342389 -0.1508286
r_m12_cell_0sloannbank_allnum r_m12_cell_nbank_0weekall_allnum r_m12_id_0avgmax_monnum
1: -0.06246272 -0.112977 -0.4454012
system.time( woebin_ply(input,sc$bins)%>%rename_all(.x%>%str_remove("_woe")))
[INFO] converting into woe values ...
用户系统流逝
0.00 0.02 1.39

`woebin`: break_list doesn't work

Hi @ShichenXie

I'm using scorecard 0.3.3 but break_list from woebin function doesn't work.

# Issue scorecard 0.3.3
library(readr)
library(scorecard)
suppressPackageStartupMessages(library(dplyr))

packageVersion("scorecard")
#> [1] '0.3.3'

path <- "https://gist.githubusercontent.com/ijrossi/b864820a14fd2b51ac21574841faaa3e/raw/5299e08b3bd68f6854a92979c716421d5bb5ba1e/data_issue_woebin.txt"
data <- read.csv(path, sep=";")
head(data)
#>    y x
#> 1 69 0
#> 2 69 0
#> 3 68 0
#> 4 68 0
#> 5 53 0
#> 6 69 0

new_brks <-  list(
  y = c("25", "40", "Inf")
)

scorecard::woebin(dt = data, y = "x",  x = "y")
#> [INFO] creating woe binning ...
#> $y
#>    variable       bin  count count_distr    neg   pos    posprob         woe
#> 1:        y [-Inf,28)  18116  0.05540536  14843  3273 0.18066902  0.59207398
#> 2:        y   [28,32)  33390  0.10211884  28616  4774 0.14297694  0.31311388
#> 3:        y   [32,40)  59948  0.18334292  52451  7497 0.12505838  0.15851890
#> 4:        y   [40,52)  86865  0.26566495  77730  9135 0.10516318 -0.03723274
#> 5:        y [52, Inf) 128653  0.39346794 117784 10869 0.08448307 -0.27904238
#>          bin_iv   total_iv breaks is_special_values
#> 1: 0.0243579405 0.06838722     28             FALSE
#> 2: 0.0113045355 0.06838722     32             FALSE
#> 3: 0.0049008012 0.06838722     40             FALSE
#> 4: 0.0003629555 0.06838722     52             FALSE
#> 5: 0.0274609867 0.06838722    Inf             FALSE

# 'breaks_list' works fine with scorecard 0.3.2
# but it doesn't work with scorecard 0.3.3
scorecard::woebin(dt = data, y = "x",  x = "y", breaks_list = new_brks)
#> [INFO] creating woe binning ...
#> $y
#>    variable         bin  count count_distr    neg   pos   posprob woe bin_iv
#> 1:        y [-Inf, Inf) 326972           1 291424 35548 0.1087188   0      0
#>    total_iv breaks is_special_values
#> 1:        0    Inf             FALSE


sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19042)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=Spanish_Chile.1252  LC_CTYPE=Spanish_Chile.1252   
#> [3] LC_MONETARY=Spanish_Chile.1252 LC_NUMERIC=C                  
#> [5] LC_TIME=Spanish_Chile.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.0.6     scorecard_0.3.3 readr_1.4.0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] zip_2.2.0         Rcpp_1.0.7        compiler_4.1.0    pillar_1.6.1     
#>  [5] highr_0.9         iterators_1.0.13  tools_4.1.0       digest_0.6.27    
#>  [9] evaluate_0.14     lifecycle_1.0.0   tibble_3.1.2      gtable_0.3.0     
#> [13] pkgconfig_2.0.3   rlang_0.4.11      openxlsx_4.2.3    reprex_2.0.0     
#> [17] foreach_1.5.1     DBI_1.1.1         cli_2.5.0         rstudioapi_0.13  
#> [21] parallel_4.1.0    yaml_2.2.1        xfun_0.23         gridExtra_2.3    
#> [25] withr_2.4.2       stringr_1.4.0     knitr_1.33        generics_0.1.0   
#> [29] fs_1.5.0          vctrs_0.3.8       hms_1.1.0         tidyselect_1.1.1 
#> [33] grid_4.1.0        glue_1.4.2        data.table_1.14.0 R6_2.5.0         
#> [37] fansi_0.5.0       rmarkdown_2.8     purrr_0.3.4       ggplot2_3.3.3    
#> [41] magrittr_2.0.1    scales_1.1.1      ps_1.6.0          codetools_0.2-18 
#> [45] ellipsis_0.3.2    htmltools_0.5.1.1 assertthat_0.2.1  colorspace_2.0-1 
#> [49] utf8_1.2.1        stringi_1.6.1     doParallel_1.0.16 munsell_0.5.0    
#> [53] crayon_1.4.1

Now, same code using older version of scorecard. Parameter breaks_list works fine!

# Issue scorecard 0.3.3
library(readr)
library(scorecard)
suppressPackageStartupMessages(library(dplyr))



packageVersion("scorecard")
#> [1] '0.3.2'

path <- "https://gist.githubusercontent.com/ijrossi/b864820a14fd2b51ac21574841faaa3e/raw/5299e08b3bd68f6854a92979c716421d5bb5ba1e/data_issue_woebin.txt"
data <- read.csv(path, sep=";")
head(data)
#>    y x
#> 1 69 0
#> 2 69 0
#> 3 68 0
#> 4 68 0
#> 5 53 0
#> 6 69 0

new_brks <-  list(
  y = c("25", "40", "Inf")
)

scorecard::woebin(dt = data, y = "x",  x = "y")
#> [INFO] creating woe binning ...
#> $y
#>    variable       bin  count count_distr    neg   pos    posprob         woe
#> 1:        y [-Inf,28)  18116  0.05540536  14843  3273 0.18066902  0.59207398
#> 2:        y   [28,32)  33390  0.10211884  28616  4774 0.14297694  0.31311388
#> 3:        y   [32,40)  59948  0.18334292  52451  7497 0.12505838  0.15851890
#> 4:        y   [40,52)  86865  0.26566495  77730  9135 0.10516318 -0.03723274
#> 5:        y [52, Inf) 128653  0.39346794 117784 10869 0.08448307 -0.27904238
#>          bin_iv   total_iv breaks is_special_values
#> 1: 0.0243579405 0.06838722     28             FALSE
#> 2: 0.0113045355 0.06838722     32             FALSE
#> 3: 0.0049008012 0.06838722     40             FALSE
#> 4: 0.0003629555 0.06838722     52             FALSE
#> 5: 0.0274609867 0.06838722    Inf             FALSE

# 'breaks_list' works fine with scorecard 0.3.2
# but it doesn't work with scorecard 0.3.3
scorecard::woebin(dt = data, y = "x",  x = "y", breaks_list = new_brks)
#> [INFO] creating woe binning ...
#> $y
#>    variable       bin  count count_distr    neg   pos    posprob        woe
#> 1:        y [-Inf,25)   7018  0.02146361   5667  1351 0.19250499  0.6700805
#> 2:        y   [25,40) 104436  0.31940350  90243 14193 0.13590141  0.2541382
#> 3:        y [40, Inf) 215518  0.65913289 195514 20004 0.09281823 -0.1758044
#>        bin_iv   total_iv breaks is_special_values
#> 1: 0.01243606 0.05422201     25             FALSE
#> 2: 0.02277098 0.05422201     40             FALSE
#> 3: 0.01901497 0.05422201    Inf  

sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19042)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=Spanish_Chile.1252  LC_CTYPE=Spanish_Chile.1252   
#> [3] LC_MONETARY=Spanish_Chile.1252 LC_NUMERIC=C                  
#> [5] LC_TIME=Spanish_Chile.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.0.6     scorecard_0.3.2 readr_1.4.0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] zip_2.2.0         Rcpp_1.0.7        compiler_4.1.0    pillar_1.6.1     
#>  [5] highr_0.9         iterators_1.0.13  tools_4.1.0       digest_0.6.27    
#>  [9] evaluate_0.14     lifecycle_1.0.0   tibble_3.1.2      gtable_0.3.0     
#> [13] pkgconfig_2.0.3   rlang_0.4.11      openxlsx_4.2.3    reprex_2.0.0     
#> [17] foreach_1.5.1     DBI_1.1.1         cli_2.5.0         rstudioapi_0.13  
#> [21] parallel_4.1.0    yaml_2.2.1        xfun_0.23         gridExtra_2.3    
#> [25] withr_2.4.2       stringr_1.4.0     knitr_1.33        generics_0.1.0   
#> [29] fs_1.5.0          vctrs_0.3.8       hms_1.1.0         tidyselect_1.1.1 
#> [33] grid_4.1.0        glue_1.4.2        data.table_1.14.0 R6_2.5.0         
#> [37] fansi_0.5.0       rmarkdown_2.8     purrr_0.3.4       ggplot2_3.3.3    
#> [41] magrittr_2.0.1    scales_1.1.1      ps_1.6.0          codetools_0.2-18 
#> [45] ellipsis_0.3.2    htmltools_0.5.1.1 assertthat_0.2.1  colorspace_2.0-1 
#> [49] utf8_1.2.1        stringi_1.6.1     doParallel_1.0.16 munsell_0.5.0    
#> [53] crayon_1.4.1

能否增加unkown的分箱

现在woebin 手工分箱而且变量是字符或者factor的时候，如果数据里边的值在手工分箱breaklist中未指定，那么会单独分到missing 当中这样会跟真正的缺失混淆能否针对这种情况做一个分箱叫做unknown ，而且最好 unkown可以跟missing合并或者不合并也作为关键字可以写出如下：
C(99%,%missing%,%unknown)

woebinning

receiving the following error with version scorecard_0.2.9
if i run the same code on version scorecard_0.2.5.999 it runs without error

bins = woebin(dt_f, y="creditability")
[INFO] creating woe binning ...
Error in check_y(dt, y, positive) :
Incorrect inputs; there is no "creditability" column in dt.
In addition: Warning messages:
1: In setDT(copy(dt)) :
Some columns are a multi-column type (such as a matrix column): [23, 24, 26, 29]. setDT will retain these columns as-is but subsequent operations like grouping and joining may fail. Please consider as.data.table() instead which will create a new column for each embedded column.
2: In setDT(dt) :
Some columns are a multi-column type (such as a matrix column): [23, 24, 26, 29]. setDT will retain these columns as-is but subsequent operations like grouping and joining may fail. Please consider as.data.table() instead which will create a new column for each embedded column.

Error in Code

the below part of code not work : gives the error below :

m1 = glm( creditability ~ ., family = binomial(), data = dt_woe_list$train)

Error in terms.formula(formula, data = data) :
duplicated name 'NA' in data frame using '.'

thank you!

'tree' and 'chimerge' binning issue for binary numeric variable

If the numeric variable only contains two values, it will not output the correct bin when doing 'tree' and 'chimerge' binning. But 'width' and 'freq' binning work well.

tst_dt <- data.table(var = c(rep(10, 20), rep(20, 10)), target = c(sample(c(0,1), 30, replace = TRUE))) tree_bins = woebin(tst_dt, y = 'target', x = 'var', positive = "1", method = 'tree') chi_bins = woebin(tst_dt, y = 'target', x = 'var', positive = "1", method = 'chimerge') width_bins = woebin(tst_dt, y = 'target', x = 'var', positive = "1", method = 'width') freq_bins = woebin(tst_dt, y = 'target', x = 'var', positive = "1", method = 'freq')

From the source code, 'tree' and 'chimerge' binning will call the function 'woebin2_init_bin'.
The following code from this function drops the value of this binary variable. So that causes only one bin [-Inf, Inf].
brk = sort(brk[(brk < max(xvalue, na.rm =TRUE)) & (brk > min(xvalue, na.rm =TRUE))])

Please review.

Package Not Available in R 3.4.1

I cannot install this package in either R studio or Visual Studio with R Tools. THe package is supposedly "unavailable" ? Please advise.

Best,

Diffrence in woebin using v.2.9 and v.2.8.1

I try to use this package when building a scorecard. But when using different version of the package I get different binning and IV using the woebin function. I can´t see any changes in the release notes for the different versions.

perf_eva() is not working in scorecard version 0.2.2

Hi,

Thanks for publishing the package. I'm recently trying to run your example code in the github page using R3.3.3. However, when the package is executing perf_eva, i got an error shown below

Error in perf_eva(pred = pred_list, label = label_list) :
could not find function "isFALSE"

Is there any workaround about it?

My env is as follows

R3.3.3
data.table 1.10.4-3
ggplot2 3.0.0
gridExtra 2.3
foreach 1.4.4
doParallel 1.0.14
parallel 3.3.3
openxlsx 4.0.17

Thanks

weird results of the `woebin_ply()` function

Hi，
I've recently encountered a serious problem using the woebin_ply() function. Here is my code,

model_woe_set <- woebin_ply(select(mod_data, -user, -creation_date), bins =model_woe, print_step = 1)

The output in the Rstudio console is

[INFO] Woe transformating on 88120 rows and 904 columns in 00:05:08

However, when I inspect the data.frame model_woe_set, I get the following results,

model_woe_set %>% dim()
[1] 88120 89023

And furthur, the column names in the model_woe_set data.frame become the following,

[1] "mon_woe"                          "age_woe"                                        
......
[961] "V962"                                                   "V963"                                                   "V964"                                                  
 [964] "V965"                                                   "V966"                                                   "V967"                                                  
 [967] "V968"                                                   "V969"                                                   "V970"                                                  
 [970] "V971"                                                   "V972"                                                   "V973"                                                  
 [973] "V974"                                                   "V975"                                                   "V976"                                                  
 [976] "V977"                                                   "V978"                                                   "V979"                                                  
 [979] "V980"                                                   "V981"                                                   "V982"                                                  
 [982] "V983"                                                   "V984"                                                   "V985"                                                  
 [985] "V986"                                                   "V987"                                                   "V988"                                                  
 [988] "V989"                                                   "V990"                                                   "V991"                                                  
 [991] "V992"                                                   "V993"                                                   "V994"                                                  
 [994] "V995"                                                   "V996"                                                   "V997"                                                  
 [997] "V998"                                                   "V999"                                                   "V1000"                                                 
[1000] "V1001"                                                 
 [ reached getOption("max.print") -- omitted 88023 entries ]

And materialize the model_woe_set would lead to a crash of Rstudio, which I think is the memory is not enough.

In all, this problem is very weird. Sorry I cannot provide a minimal reproducible example since the data cannot be shared.

My session info,

sessioninfo::session_info()
- Session info --------------------------------------------------------------------------------------------------------------------------------------------------------------------
 setting  value                                              
 version  R version 3.5.3 (2019-03-11)                       
 os       Windows 7 x64 SP 1                                 
 system   x86_64, mingw32                                    
 ui       RStudio                                            
 language (EN)                                               
 collate  Chinese (Simplified)_People's Republic of China.936
 ctype    Chinese (Simplified)_People's Republic of China.936
 tz       Asia/Taipei                                        
 date     2019-03-31                                         

- Packages ------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 package       * version    date       lib source                               
 assertthat      0.2.1      2019-03-21 [1] CRAN (R 3.5.3)                       
 backports       1.1.3      2018-12-14 [1] CRAN (R 3.5.1)                       
 bit             1.1-14     2018-05-29 [1] CRAN (R 3.5.0)                       
 bit64           0.9-7      2017-05-08 [1] CRAN (R 3.5.0)                       
 blob            1.1.1      2018-03-25 [1] CRAN (R 3.5.1)                       
 broom           0.5.1      2018-12-05 [1] CRAN (R 3.5.1)                       
 cellranger      1.1.0      2016-07-27 [1] CRAN (R 3.5.1)                       
 cli             1.1.0      2019-03-19 [1] CRAN (R 3.5.3)                       
 clipr         * 0.5.0      2019-01-11 [1] CRAN (R 3.5.2)                       
 codetools       0.2-16     2018-12-24 [1] CRAN (R 3.5.3)                       
 colorspace      1.4-1      2019-03-18 [1] CRAN (R 3.5.2)                       
 crayon          1.3.4      2017-09-16 [1] CRAN (R 3.5.1)                       
 data.table      1.12.0     2019-01-13 [1] CRAN (R 3.5.3)                       
 DBI             1.0.0      2018-05-02 [1] CRAN (R 3.5.1)                       
 digest          0.6.18     2018-10-10 [1] CRAN (R 3.5.1)                       
 doParallel      1.0.14     2018-09-24 [1] CRAN (R 3.5.1)                       
 dplyr         * 0.8.0.1    2019-02-15 [1] CRAN (R 3.5.2)                       
 DT              0.5        2018-11-05 [1] CRAN (R 3.5.1)                       
 forcats       * 0.4.0      2019-02-17 [1] CRAN (R 3.5.2)                       
 foreach         1.4.4      2017-12-12 [1] CRAN (R 3.5.1)                       
 furrr           0.1.0      2018-05-16 [1] CRAN (R 3.5.1)                       
 future          1.12.0     2019-03-08 [1] CRAN (R 3.5.3)                       
 generics        0.0.2      2018-11-29 [1] CRAN (R 3.5.1)                       
 ggplot2       * 3.1.0      2018-10-25 [1] CRAN (R 3.5.1)                       
 globals         0.12.4     2018-10-11 [1] CRAN (R 3.5.1)                       
 glue            1.3.1      2019-03-12 [1] CRAN (R 3.5.3)                       
 gridExtra       2.3        2017-09-09 [1] CRAN (R 3.5.1)                       
 gtable          0.3.0      2019-03-25 [1] CRAN (R 3.5.3)                       
 haven           2.1.0      2019-02-19 [1] CRAN (R 3.5.2)                       
 hms             0.4.2.9001 2018-09-04 [1] Github (tidyverse/hms@979286f)       
 htmltools       0.3.6.9003 2018-12-11 [1] Github (rstudio/htmltools@99a78d0)   
 htmlwidgets     1.3        2018-09-30 [1] CRAN (R 3.5.1)                       
 httr            1.4.0      2018-12-11 [1] CRAN (R 3.5.1)                       
 iterators       1.0.10     2018-07-13 [1] CRAN (R 3.5.1)                       
 janitor       * 1.1.1      2018-07-31 [1] CRAN (R 3.5.1)                       
 jsonlite        1.6        2018-12-07 [1] CRAN (R 3.5.1)                       
 lattice         0.20-38    2018-11-04 [1] CRAN (R 3.5.3)                       
 lazyeval        0.2.2      2019-03-15 [1] CRAN (R 3.5.3)                       
 listenv         0.7.0      2018-01-21 [1] CRAN (R 3.5.1)                       
 lubridate       1.7.4      2018-04-11 [1] CRAN (R 3.5.1)                       
 magrittr        1.5        2014-11-22 [1] CRAN (R 3.5.1)                       
 modelr          0.1.4      2019-02-18 [1] CRAN (R 3.5.2)                       
 munsell         0.5.0      2018-06-12 [1] CRAN (R 3.5.1)                       
 nlme            3.1-137    2018-04-07 [1] CRAN (R 3.5.3)                       
 odbc          * 1.1.6      2018-06-09 [1] CRAN (R 3.5.1)                       
 openxlsx        4.1.0      2018-05-26 [1] CRAN (R 3.5.1)                       
 patchwork     * 0.0.1      2018-09-04 [1] Github (thomasp85/patchwork@7fb35b1) 
 pillar          1.3.1      2018-12-15 [1] CRAN (R 3.5.1)                       
 pkgconfig       2.0.2      2018-08-16 [1] CRAN (R 3.5.1)                       
 plyr            1.8.4      2016-06-08 [1] CRAN (R 3.5.1)                       
 ppdai         * 0.1.2      2018-11-11 [1] local                                
 ppdai.extra   * 0.2.3.9999 2019-03-13 [1] local                                
 purrr         * 0.3.2      2019-03-15 [1] CRAN (R 3.5.3)                       
 qs              0.14.1     2019-03-02 [1] CRAN (R 3.5.3)                       
 R6              2.4.0      2019-02-14 [1] CRAN (R 3.5.2)                       
 RApiSerialize   0.1.0      2014-04-19 [1] CRAN (R 3.5.2)                       
 Rcpp            1.0.1      2019-03-17 [1] CRAN (R 3.5.3)                       
 readr         * 1.3.1      2018-12-21 [1] CRAN (R 3.5.1)                       
 readxl          1.3.1      2019-03-13 [1] CRAN (R 3.5.3)                       
 rlang           0.3.3      2019-03-29 [1] CRAN (R 3.5.3)                       
 rstudioapi      0.10       2019-03-19 [1] CRAN (R 3.5.3)                       
 rvest           0.3.2      2016-06-17 [1] CRAN (R 3.5.1)                       
 scales          1.0.0      2018-08-09 [1] CRAN (R 3.5.1)                       
 scorecard     * 0.2.4      2019-03-29 [1] Github (ShichenXie/scorecard@5b45fb8)
 sessioninfo     1.1.1      2018-11-05 [1] CRAN (R 3.5.1)                       
 stringi         1.4.3      2019-03-12 [1] CRAN (R 3.5.3)                       
 stringr       * 1.4.0      2019-02-10 [1] CRAN (R 3.5.2)                       
 tibble        * 2.1.1      2019-03-16 [1] CRAN (R 3.5.3)                       
 tidyr         * 0.8.3      2019-03-01 [1] CRAN (R 3.5.2)                       
 tidyselect      0.2.5      2018-10-11 [1] CRAN (R 3.5.1)                       
 tidyverse     * 1.2.1      2017-11-14 [1] CRAN (R 3.5.3)                       
 withr           2.1.2      2018-03-15 [1] CRAN (R 3.5.1)                       
 writexl         1.1        2018-12-02 [1] CRAN (R 3.5.1)                       
 xml2            1.2.0      2018-01-24 [1] CRAN (R 3.5.1)                       
 yaml            2.2.0      2018-07-25 [1] CRAN (R 3.5.1)                       
 zip             2.0.1      2019-03-11 [1] CRAN (R 3.5.3)                       

[1] C:/Program Files/R/R-3.5.3/library

Hope a quick fix. Thanks!

Error in setDT(dt)

`perf_eva()` function is very slow

Hello,

I've found the perf_eva() function very slow with big dataset. Maybe it would be better to extract the show_plot arguement from perf_eva() function and write a generic plot.* function to do the plot part?

Thanks!

How does it identifies breakpoints in gainstable /perf_psi ?

Hi , my score list contains - test and train and I want to create breakpoints based on only train data and apply those breaks points to test data . Is it possible ?
What I have found out is , it makes breakpoints based on both the data . Isn't this wrong ?
Ideally , for PSI , the bins should be created only on the basis of train data and then we would want to see how test data is distributed on those bins and how much it has shifted.

gains table function

Hi,

I am getting error in gains table function ;

Error in .subset2(x, i, exact = exact) :
recursive indexing failed at level 2

thanks!

A suggestion on the `report()` function

Hi,

I've used the newly-added report() function, which is awesome! Thanks for the nice work.

But I'am a little confused about the results of the function in the Example II in your pkgdown website.

# Example II
# input dt is a list
# multiple datasets
report(list(dt1=germancredit[sample(1000,500)],
            dt2=germancredit[sample(1000,500)],
            dt3=germancredit[sample(1000,500)]), y, x,
 breaks_list, special_values, seed=NULL, save_report='report5')

The model coefficients sheet in the results only contains one model as the following table shows.

Model coefficients based on dt1 dataset
variable	Estimate	Std. Error	z value	Pr(>\|z\|)	gvif	info_value
(Intercept)	-0.881272674	0.1215	-7.2541	0
age.in.years_woe	0.668220936	0.2807	2.3806	0.0173	1.1049	0.22
credit.amount_woe	0.967266281	0.3634	2.6614	0.0078	1.0981	0.114
credit.history_woe	0.757056585	0.246	3.0779	0.0021	1.0486	0.2446
duration.in.month_woe	1.099128399	0.3058	3.5943	0.0003	1.1218	0.1909
housing_woe	0.643615146	0.3308	1.9454	0.0517	1.1837	0.1448
installment.rate.in.percentage.of.disposable.income_woe	3.036753444	1.185	2.5627	0.0104	1.0749	0.0112
other.installment.plans_woe	1.09219108	0.9865	1.1071	0.2682	1.0651	0.0138
personal.status.and.sex_woe	0.891955405	0.3022	2.9514	0.0032	1.1737	0.173
present.employment.since_woe	0.500596126	0.3533	1.4171	0.1565	1.0975	0.116
property_woe	-0.026460202	0.5042	-0.0525	0.9581	1.2567	0.066
purpose_woe	1.099060268	0.3181	3.4549	0.0006	1.0602	0.1507
savings.account.and.bonds_woe	0.815971484	0.3136	2.6017	0.0093	1.0375	0.1591
status.of.existing.checking.account_woe	0.835177462	0.1528	5.4651	0	1.0402	0.6804

Based on the above result, we can infer that the model is based on the dt1 data. The other two dataset are based on the glm model of dt1 data. We can consider the dt1 dataset as the train set, dt2 as the test set and dt3 as the out of time sample (although it is not).

Am I right? If so, I would advize you to point out it in the function's documentation. For example,

dt A data frame with both x (predictor/feature) and y (response/label) variables; or a list of dataframes. If a list of dataframes provided, only the first dataframe would be used for training. Other dataframes would be used for testing.

Thanks again for this awesome package!

‘scorecard’ version 0.3.2

describe(data)

Error in setnames(sum_dtnum, c("min", "p25", "p50", "mean", "p75", "max", :
Can't assign 8 names to a 9 column data.table
In addition: Warning message:
In (function (..., deparse.level = 1) :
number of columns of result is not a multiple of vector length (arg 1)

and it causes many other problems of scorecard package funtions

report - graphics of train dataset and test dataset are from test dataset

data("germancredit")

y = 'creditability'
x = c(
  "status.of.existing.checking.account",
  "duration.in.month",
  "credit.history",
  "purpose",
  "credit.amount",
  "savings.account.and.bonds",
  "present.employment.since",
  "installment.rate.in.percentage.of.disposable.income",
  "personal.status.and.sex",
  "property",
  "age.in.years",
  "other.installment.plans",
  "housing"
)

special_values=NULL
breaks_list=list(
  status.of.existing.checking.account=c("... < 0 DM%,%0 <= ... < 200 DM",
                                        "... >= 200 DM / salary assignments for at least 1 year", "no checking account"),
  duration.in.month=c(8, 16, 34, 44),
  credit.history=c(
    "no credits taken/ all credits paid back duly%,%all credits at this bank paid back duly",
    "existing credits paid back duly till now", "delay in paying off in the past",
    "critical account/ other credits existing (not at this bank)"),
  purpose=c("retraining%,%car (used)", "radio/television",
            "furniture/equipment%,%domestic appliances%,%business%,%repairs",
            "car (new)%,%others%,%education"),
  credit.amount=c(1400, 1800, 4000, 9200),
  savings.account.and.bonds=c("... < 100 DM", "100 <= ... < 500 DM",
                              "500 <= ... < 1000 DM%,%... >= 1000 DM%,%unknown/ no savings account"),
  present.employment.since=c("unemployed%,%... < 1 year", "1 <= ... < 4 years",
                             "4 <= ... < 7 years", "... >= 7 years"),
  installment.rate.in.percentage.of.disposable.income=c(2, 3),
  personal.status.and.sex=c("female : divorced/separated/married", "male : single",
                            "male : married/widowed"),
  property=c("real estate", "building society savings agreement/ life insurance",
             "car or other, not in attribute Savings account/bonds", "unknown / no property"),
  age.in.years=c(26, 28, 35, 37),
  other.installment.plans=c("bank%,%stores", "none"),
  housing=c("rent", "own", "for free")
)

Example I
input dt is a data frame
split input data frame into two

report(germancredit, y, x, breaks_list, special_values, seed=618, save_report='report1',
       show_plot = c('ks', 'lift', 'gain', 'roc', 'lz', 'pr', 'f1', 'density'))

how to deal with mixed data ?

great package on this subject! very nice job! here i have a problem.for example,I have a variable, such as ,dat<-data.frame(y=c(0,0,0,1,1,1,1,1,0,0,1,1,1,0,0,0,1,1,1,0),x=c(1,2,3,4,5,888,888,888,9,10,666,666,666,666,15,16,17,18,19,20)). In this case,i want regard '888' and '666'as special two class such as missing value have own woe, and i want to get two woe for '888' and '666' separately. other values are computed as usual. How to handle this type data. Thanks!

Monotonic-WOE-Binning-Algorithm

Hello,

I just discover a Github repo, jstephenj14/Monotonic-WOE-Binning-Algorithm, which provides a Python implementation of a variable binning algorithm that optimizes information value (IV) monotonicity and representativeness.

I think it would be great to include this algorithm is your fantastic package scorecard. Since the author provides the Python version, I wonder if it could be incorporated into you scorecard R package.

Thanks!

How does function `iv` calculates the information values?

Thanks for writing such a fantastic package!

I am curious about how the function iv() calculates the information value for a continuous variable. Look at the source code of iv():

ivlist = dt[, sapply(.SD, iv_xy, label), .SDcols = x]
iv_xy = function(x, y) {
  . = DistrBad = DistrGood = bad = good = NULL

  data.table(x=x, y=y)[
    , .(good = sum(y==0), bad = sum(y==1)), keyby="x"
    ][, (c("good", "bad")) := lapply(.SD, function(x) ifelse(x==0, 0.99, x)), .SDcols = c("good", "bad")# replace 0 by 0.99 in good/bad columns
    ][, `:=`(
      DistrGood = good/sum(good), DistrBad = bad/sum(bad)
   )][, sum((DistrBad-DistrGood)*log(DistrBad/DistrGood)) ]

}

I am not very familiar with data.table package. Based on the above code, it seems that you consider every unique value as a group and count the number of "bad" and "good" in each group respectively. From this perspective, I use package dplyr to calculate the information value of variable age.in.years in dataset germancredit.
Based on the iv() function in your scorecard package, we can obtain that the IV of variable age.in.years is

ivs <- iv(germancredit, y = "creditability")
# Warning message:
# In check_y(dt, y, positive) :
# The positive value in "creditability" was replaced by 1 and negative value by 0.
ivs[variable=="age.in.years",]

#       variable info_value
# 1: age.in.years  0.2596514

While the results of my dplyr solution is

library(dplyr)
library(tidyr)
germancredit %>% 
  count(age.in.years, creditability) %>% 
  spread(key = creditability, value = n) %>% 
  # delete groups including only one class: "good" or "bad"
  na.omit() %>% 
  mutate(total_good = germancredit %>% 
           count(creditability) %>% 
           filter(creditability == "good") %>% 
           pull(n),
         total_bad = germancredit %>% 
           count(creditability) %>% 
           filter(creditability == "bad") %>% 
           pull(n)) %>% 
  mutate(bad_distr = bad / total_bad,
         good_distr = good / total_good,
         woe = log(bad_distr / good_distr),
         bin_iv = (bad_distr - good_distr) * woe,
         total_iv = sum(bin_iv))

# A tibble: 47 x 10
   age.in.years   bad  good total_good total_bad bad_distr good_distr     woe   bin_iv total_iv
          <dbl> <int> <int>      <int>     <int>     <dbl>      <dbl>   <dbl>    <dbl>    <dbl>
 1          19.     1     1        700       300   0.00333    0.00143  0.847  0.00161     0.257
 2          20.     5     9        700       300   0.0167     0.0129   0.260  0.000989    0.257
 3          21.     5     9        700       300   0.0167     0.0129   0.260  0.000989    0.257
 4          22.    11    16        700       300   0.0367     0.0229   0.473  0.00653     0.257
 5          23.    20    28        700       300   0.0667     0.0400   0.511  0.0136      0.257
 6          24.    19    25        700       300   0.0633     0.0357   0.573  0.0158      0.257
 7          25.    19    22        700       300   0.0633     0.0314   0.701  0.0224      0.257
 8          26.    14    36        700       300   0.0467     0.0514  -0.0972 0.000463    0.257
 9          27.    13    38        700       300   0.0433     0.0543  -0.225  0.00247     0.257
10          28.    15    28        700       300   0.0500     0.0400   0.223  0.00223     0.257
# ... with 37 more rows

It is clear that my results are a little different from yours. Maybe it is owing to your usage of ifelse(x==0, 0.99, x)? I feel very perplexed at this line of code.

In sum, my question is
-- How does the function iv() works in continuous variables?
-- If the iv() function does not use the optimal bins for continuous variables, are the results of the iv() function reliable?

Thanks again for this awesome package. B.T.W, the slides in your website is super helpful!

including frequency weights

Hi, useful package. Are there any plans to include frequency weights throughout the package? i.e. where the data is a sample taken from a large population (say 1 in 2 bads and 1 in 10 goods) and as such the data contains a weight field, containing 2 (for bads) or 10 (for goods). The weights should influence the binning, woe calculations and glm processes. However this may causes issues with glm, which can have issues with weights, and therefore the package may need to be adapted to use svyglm. thanks.

shichenxie / scorecard Goto Github PK

scorecard's Introduction

scorecard

Installation

Example

scorecard's People

Contributors

Stargazers

Watchers

Forkers

scorecard's Issues

Recommend Projects

Recommend Topics

Recommend Org