shichenxie / scorecard Goto Github PK
View Code? Open in Web Editor NEWScorecard Development in R, 评分卡
Home Page: http://shichen.name/scorecard
License: Other
Scorecard Development in R, 评分卡
Home Page: http://shichen.name/scorecard
License: Other
Hi. Hope you are well.
I got stuck while computing various calculations manually to cross check the results obtained using your library. Therefore, I would like you to share the exact formulas that you've used to perform following calculations:
Looking forward to your response. Thanks in advance!
Hello,
I've found the perf_eva()
function very slow with big dataset. Maybe it would be better to extract the show_plot
arguement from perf_eva()
function and write a generic plot.*
function to do the plot part?
Thanks!
--
y<-ifelse(runif(10000)>0.99,1,0)
x<-rep(NA,10000)
x[1:2000]<-9999
sdf<-as.data.frame(cbind(y,x))
names(sdf)<-c("y1","x1")
woebin(sdf,y="y1",x="x1") #ok
woebin(sdf,y="y1",x="x1",
breaks_list = list(x1=c(0)),
special_values = list(x1=c(9999))) #ok
woebin(sdf,y="y1",x="x1",
special_values = list(x1=c(9999))) #warning message and not result
the lastest line show warning and no result of binning
现在woebin 手工分箱而且变量是字符或者factor的时候, 如果数据里边的值在手工分箱breaklist中未指定 ,那么会单独分到missing 当中 这样会跟真正的缺失混淆 能否针对这种情况 做一个分箱叫做unknown ,而且最好 unkown可以跟missing合并或者不合并 也作为关键字 可以写出如下:
C(99%,%missing%,%unknown)
I try to use this package when building a scorecard. But when using different version of the package I get different binning and IV using the woebin function. I can´t see any changes in the release notes for the different versions.
Shichen
Sometimes the density plot in the perf_eva function comes out with the y-axis not properly scaled resulting in the lines not showing.
See below for sample code and data to recreate the problem and the resulting plot. I have tested using scorecard version 0.2.4 on Windows and version 0.3.0 on Linux, same result both times.
Thanks
Tomas
require(dplyr)
require(scorecard)
x <- read.csv("M7_WOE Binomial_632.csv")
xx <- x %>% filter(modelsample=="Hold out sample")
perf_eva(xx$modelscore,xx$Churn,show_plot = "density")
Hi @ShichenXie
If I want a missing/no missing bin in a numeric variable. I should use c(Inf)
in the breaks_list
argument (3r example)?
The last example brk <- c(0, 1, Inf)
have the issue mentioned in the title.
library(scorecard)
library(readr)
packageVersion("scorecard")
#> [1] '0.3.2.999'
.Platform$OS.type
#> [1] "unix"
data <- read_csv("https://gist.githubusercontent.com/jbkunst/4e8b58d2ffca1b5ca4496f1443aec032/raw/66bf72435e1c6cb7bda32c41dd7a1d3e4e1690cb/test")
#> Parsed with column specification:
#> cols(
#> y = col_character(),
#> variable = col_double()
#> )
str(data)
#> spec_tbl_df [80,000 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#> $ y : chr [1:80000] "good" "good" "good" "bad" ...
#> $ variable: num [1:80000] NA NA NA 0 NA NA NA NA 0 NA ...
#> - attr(*, "spec")=
#> .. cols(
#> .. y = col_character(),
#> .. variable = col_double()
#> .. )
brk <- c(0, Inf)
brk
#> [1] 0 Inf
scorecard::woebin(data, y = "y", breaks_list = list(variable = brk))[[1]][, c(3, 5, 6)]
#> [INFO] creating woe binning ...
#> count neg pos
#> 1: 75280 67543 7737
#> 2: NA 1762 2958
brk <- c(0)
brk
#> [1] 0
scorecard::woebin(data, y = "y", breaks_list = list(variable = brk))[[1]][, c(3, 5, 6)]
#> [INFO] creating woe binning ...
#> count neg pos
#> 1: 75280 67543 7737
#> 2: NA 1762 2958
brk <- c(Inf)
brk
#> [1] Inf
scorecard::woebin(data, y = "y", breaks_list = list(variable = brk))[[1]][, c(3, 5, 6)]
#> [INFO] creating woe binning ...
#> count neg pos
#> 1: 75280 67543 7737
#> 2: 4720 1762 2958
brk <- c(0, 1, Inf)
brk
#> [1] 0 1 Inf
scorecard::woebin(data, y = "y", breaks_list = list(variable = brk))[[1]][, c(3, 5, 6)]
#> [INFO] creating woe binning ...
#> count neg pos
#> 1: 75280 67543 7737
#> 2: NA 860 2520
#> 3: NA 902 438
Created on 2021-06-02 by the reprex package (v2.0.0.9000)
Hi,
Can I confirm one thing?
If I want to generate a scorecard that predict 'good' (i.e. the positive is 'good'), the scaling formula (i.e. score = A - B*ln(odd)) embedded will not be correct.
Any knows a work around? Thanks.
I figured out the work around.
Hi Shichen,
Thanks so much for your package and all your time and work in it!
Is there any way in woebin
or other function to merge a missing category (NA
values) with very less counts (< count_distr_limit
) in a category with relative same badprob
?
For example I would like to merge the missing
bin with [0.54,0.8)
(the 4th one) due have similiar bad rates.
set.seed(123)
N <- 1000
p <- runif(N)
y <- rbinom(N, 1, p)
p[runif(N) < 0.01] <- NA
scorecard::woebin(data.frame(p, y), y = "y")
#> [INFO] creating woe binning ...
#> $p
#> variable bin count count_distr good bad badprob woe
#> 1: p missing 7 0.007 2 5 0.71428571 0.8362480
#> 2: p [-Inf,0.14) 139 0.139 133 6 0.04316547 -3.1786324
#> 3: p [0.14,0.54) 403 0.403 253 150 0.37220844 -0.6027969
#> 4: p [0.54,0.8) 254 0.254 75 179 0.70472441 0.7898550
#> 5: p [0.8,0.94) 137 0.137 15 122 0.89051095 2.0159281
#> 6: p [0.94, Inf) 60 0.060 2 58 0.96666667 3.2872531
#> bin_iv total_iv breaks is_special_values
#> 1: 0.00455648 1.903872 missing TRUE
#> 2: 0.84406952 1.903872 0.14 FALSE
#> 3: 0.14384048 1.903872 0.54 FALSE
#> 4: 0.14847755 1.903872 0.8 FALSE
#> 5: 0.40997000 1.903872 0.94 FALSE
#> 6: 0.35295827 1.903872 Inf FALSE
Created on 2019-12-24 by the reprex package (v0.3.0)
Thanks in advance for your response.
cc @jm448
Hi and thanks for your useful package. I get the below warning message when running with my own data. The final scorecard seems fine so at the moment I am ignoring the message, but what is the root cause?
bins = woebin(dt_s, y="IsDef", no_cores = 8)
Warning message:
In e$fun(obj, substitute(ex), parent.frame(), e$data) :
already exporting variable(s): dt, xs, y, breaks_list, min_perc_fine_bin, stop_limit, max_num_bin
This is particularly an issue when using the scorecard function in RMarkdown chunks and document generating. Is there a way to suppress all outputs when using graphics? An example would be when using the woebin_plot() function, there will be a printed ## $variable
line in the R Markdown (or console in R) which I would like to remove. Similarly, with perf_eva, there are a few lines of console output I would like to suppress. Is there an option that will only output the graph?
in recent use we find woe_bin' s result bins woe value is not equal to log(p_bad/p_good)
`> bins
$score1
variable bin count count_distr good bad badprob woe bin_iv total_iv breaks
1: score1 missing 1 0.0001557632 1 0 0.00000000 1.7041 254 0.001408331 0.0725454 missing
2: score1 [-Inf,475) 401 0.0624610592 300 101 0.25187032 0.6255138 0.029977971 0.0725454 475
3: score1 [475,515) 2191 0.3412772586 1800 391 0.17845733 0.1873414 0.012769587 0.0725454 515
4: score1 [515,570) 3486 0.5429906542 3033 453 0.12994836 -0.1872396 0.017822342 0.0725454 570
5: score1 [570, Inf) 341 0.0531152648 307 34 0.09970674 -0.4863114 0.010567168 0.0725454 Inf
is_special_values
1: TRUE
2: FALSE
3: FALSE
4: FALSE
5: FALSE
bins1$score1%>%
why this happen is there something wrong?
Running the woebin function with method = "chimerge" produces an error.
In addition, if I use the method = "tree" and set the bin_num_limit = 10, some of the variables still have more than 10 bins.
bins <- scorecard::woebin(dt = TrainingData, y= "StatusbinnenT12",
+ x =xvars,positive = 1, count_distr_limit = 0.01,
+ bin_num_limit = 10, method = "chimerge")
Result:
[INFO] creating woe binning ...
Error in checkForRemoteErrors(val) :
one node produced an error: Error in match.arg(type) : 'arg' must be NULL or a character vector
Hi,
I've used the newly-added report()
function, which is awesome! Thanks for the nice work.
But I'am a little confused about the results of the function in the Example II
in your pkgdown website.
# Example II
# input dt is a list
# multiple datasets
report(list(dt1=germancredit[sample(1000,500)],
dt2=germancredit[sample(1000,500)],
dt3=germancredit[sample(1000,500)]), y, x,
breaks_list, special_values, seed=NULL, save_report='report5')
The model coefficients
sheet in the results only contains one model as the following table shows.
Model coefficients based on dt1 dataset | ||||||
---|---|---|---|---|---|---|
variable | Estimate | Std. Error | z value | Pr(>|z|) | gvif | info_value |
(Intercept) | -0.881272674 | 0.1215 | -7.2541 | 0 | ||
age.in.years_woe | 0.668220936 | 0.2807 | 2.3806 | 0.0173 | 1.1049 | 0.22 |
credit.amount_woe | 0.967266281 | 0.3634 | 2.6614 | 0.0078 | 1.0981 | 0.114 |
credit.history_woe | 0.757056585 | 0.246 | 3.0779 | 0.0021 | 1.0486 | 0.2446 |
duration.in.month_woe | 1.099128399 | 0.3058 | 3.5943 | 0.0003 | 1.1218 | 0.1909 |
housing_woe | 0.643615146 | 0.3308 | 1.9454 | 0.0517 | 1.1837 | 0.1448 |
installment.rate.in.percentage.of.disposable.income_woe | 3.036753444 | 1.185 | 2.5627 | 0.0104 | 1.0749 | 0.0112 |
other.installment.plans_woe | 1.09219108 | 0.9865 | 1.1071 | 0.2682 | 1.0651 | 0.0138 |
personal.status.and.sex_woe | 0.891955405 | 0.3022 | 2.9514 | 0.0032 | 1.1737 | 0.173 |
present.employment.since_woe | 0.500596126 | 0.3533 | 1.4171 | 0.1565 | 1.0975 | 0.116 |
property_woe | -0.026460202 | 0.5042 | -0.0525 | 0.9581 | 1.2567 | 0.066 |
purpose_woe | 1.099060268 | 0.3181 | 3.4549 | 0.0006 | 1.0602 | 0.1507 |
savings.account.and.bonds_woe | 0.815971484 | 0.3136 | 2.6017 | 0.0093 | 1.0375 | 0.1591 |
status.of.existing.checking.account_woe | 0.835177462 | 0.1528 | 5.4651 | 0 | 1.0402 | 0.6804 |
Based on the above result, we can infer that the model is based on the dt1
data. The other two dataset are based on the glm model of dt1
data. We can consider the dt1
dataset as the train set, dt2
as the test set and dt3
as the out of time sample (although it is not).
Am I right? If so, I would advize you to point out it in the function's documentation. For example,
dt A data frame with both x (predictor/feature) and y (response/label) variables; or a list of dataframes. If a list of dataframes provided, only the first dataframe would be used for training. Other dataframes would be used for testing.
Thanks again for this awesome package!
Hi,
I've recently encountered a serious problem using the woebin_ply()
function. Here is my code,
model_woe_set <- woebin_ply(select(mod_data, -user, -creation_date), bins =model_woe, print_step = 1)
The output in the Rstudio console is
[INFO] Woe transformating on 88120 rows and 904 columns in 00:05:08
However, when I inspect the data.frame model_woe_set
, I get the following results,
model_woe_set %>% dim()
[1] 88120 89023
And furthur, the column names in the model_woe_set
data.frame become the following,
[1] "mon_woe" "age_woe"
......
[961] "V962" "V963" "V964"
[964] "V965" "V966" "V967"
[967] "V968" "V969" "V970"
[970] "V971" "V972" "V973"
[973] "V974" "V975" "V976"
[976] "V977" "V978" "V979"
[979] "V980" "V981" "V982"
[982] "V983" "V984" "V985"
[985] "V986" "V987" "V988"
[988] "V989" "V990" "V991"
[991] "V992" "V993" "V994"
[994] "V995" "V996" "V997"
[997] "V998" "V999" "V1000"
[1000] "V1001"
[ reached getOption("max.print") -- omitted 88023 entries ]
And materialize the model_woe_set
would lead to a crash of Rstudio, which I think is the memory is not enough.
In all, this problem is very weird. Sorry I cannot provide a minimal reproducible example since the data cannot be shared.
My session info,
sessioninfo::session_info()
- Session info --------------------------------------------------------------------------------------------------------------------------------------------------------------------
setting value
version R version 3.5.3 (2019-03-11)
os Windows 7 x64 SP 1
system x86_64, mingw32
ui RStudio
language (EN)
collate Chinese (Simplified)_People's Republic of China.936
ctype Chinese (Simplified)_People's Republic of China.936
tz Asia/Taipei
date 2019-03-31
- Packages ------------------------------------------------------------------------------------------------------------------------------------------------------------------------
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.5.3)
backports 1.1.3 2018-12-14 [1] CRAN (R 3.5.1)
bit 1.1-14 2018-05-29 [1] CRAN (R 3.5.0)
bit64 0.9-7 2017-05-08 [1] CRAN (R 3.5.0)
blob 1.1.1 2018-03-25 [1] CRAN (R 3.5.1)
broom 0.5.1 2018-12-05 [1] CRAN (R 3.5.1)
cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.5.1)
cli 1.1.0 2019-03-19 [1] CRAN (R 3.5.3)
clipr * 0.5.0 2019-01-11 [1] CRAN (R 3.5.2)
codetools 0.2-16 2018-12-24 [1] CRAN (R 3.5.3)
colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.5.2)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.1)
data.table 1.12.0 2019-01-13 [1] CRAN (R 3.5.3)
DBI 1.0.0 2018-05-02 [1] CRAN (R 3.5.1)
digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.1)
doParallel 1.0.14 2018-09-24 [1] CRAN (R 3.5.1)
dplyr * 0.8.0.1 2019-02-15 [1] CRAN (R 3.5.2)
DT 0.5 2018-11-05 [1] CRAN (R 3.5.1)
forcats * 0.4.0 2019-02-17 [1] CRAN (R 3.5.2)
foreach 1.4.4 2017-12-12 [1] CRAN (R 3.5.1)
furrr 0.1.0 2018-05-16 [1] CRAN (R 3.5.1)
future 1.12.0 2019-03-08 [1] CRAN (R 3.5.3)
generics 0.0.2 2018-11-29 [1] CRAN (R 3.5.1)
ggplot2 * 3.1.0 2018-10-25 [1] CRAN (R 3.5.1)
globals 0.12.4 2018-10-11 [1] CRAN (R 3.5.1)
glue 1.3.1 2019-03-12 [1] CRAN (R 3.5.3)
gridExtra 2.3 2017-09-09 [1] CRAN (R 3.5.1)
gtable 0.3.0 2019-03-25 [1] CRAN (R 3.5.3)
haven 2.1.0 2019-02-19 [1] CRAN (R 3.5.2)
hms 0.4.2.9001 2018-09-04 [1] Github (tidyverse/hms@979286f)
htmltools 0.3.6.9003 2018-12-11 [1] Github (rstudio/htmltools@99a78d0)
htmlwidgets 1.3 2018-09-30 [1] CRAN (R 3.5.1)
httr 1.4.0 2018-12-11 [1] CRAN (R 3.5.1)
iterators 1.0.10 2018-07-13 [1] CRAN (R 3.5.1)
janitor * 1.1.1 2018-07-31 [1] CRAN (R 3.5.1)
jsonlite 1.6 2018-12-07 [1] CRAN (R 3.5.1)
lattice 0.20-38 2018-11-04 [1] CRAN (R 3.5.3)
lazyeval 0.2.2 2019-03-15 [1] CRAN (R 3.5.3)
listenv 0.7.0 2018-01-21 [1] CRAN (R 3.5.1)
lubridate 1.7.4 2018-04-11 [1] CRAN (R 3.5.1)
magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.1)
modelr 0.1.4 2019-02-18 [1] CRAN (R 3.5.2)
munsell 0.5.0 2018-06-12 [1] CRAN (R 3.5.1)
nlme 3.1-137 2018-04-07 [1] CRAN (R 3.5.3)
odbc * 1.1.6 2018-06-09 [1] CRAN (R 3.5.1)
openxlsx 4.1.0 2018-05-26 [1] CRAN (R 3.5.1)
patchwork * 0.0.1 2018-09-04 [1] Github (thomasp85/patchwork@7fb35b1)
pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.1)
pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.1)
plyr 1.8.4 2016-06-08 [1] CRAN (R 3.5.1)
ppdai * 0.1.2 2018-11-11 [1] local
ppdai.extra * 0.2.3.9999 2019-03-13 [1] local
purrr * 0.3.2 2019-03-15 [1] CRAN (R 3.5.3)
qs 0.14.1 2019-03-02 [1] CRAN (R 3.5.3)
R6 2.4.0 2019-02-14 [1] CRAN (R 3.5.2)
RApiSerialize 0.1.0 2014-04-19 [1] CRAN (R 3.5.2)
Rcpp 1.0.1 2019-03-17 [1] CRAN (R 3.5.3)
readr * 1.3.1 2018-12-21 [1] CRAN (R 3.5.1)
readxl 1.3.1 2019-03-13 [1] CRAN (R 3.5.3)
rlang 0.3.3 2019-03-29 [1] CRAN (R 3.5.3)
rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.5.3)
rvest 0.3.2 2016-06-17 [1] CRAN (R 3.5.1)
scales 1.0.0 2018-08-09 [1] CRAN (R 3.5.1)
scorecard * 0.2.4 2019-03-29 [1] Github (ShichenXie/scorecard@5b45fb8)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.1)
stringi 1.4.3 2019-03-12 [1] CRAN (R 3.5.3)
stringr * 1.4.0 2019-02-10 [1] CRAN (R 3.5.2)
tibble * 2.1.1 2019-03-16 [1] CRAN (R 3.5.3)
tidyr * 0.8.3 2019-03-01 [1] CRAN (R 3.5.2)
tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.5.1)
tidyverse * 1.2.1 2017-11-14 [1] CRAN (R 3.5.3)
withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.1)
writexl 1.1 2018-12-02 [1] CRAN (R 3.5.1)
xml2 1.2.0 2018-01-24 [1] CRAN (R 3.5.1)
yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.1)
zip 2.0.1 2019-03-11 [1] CRAN (R 3.5.3)
[1] C:/Program Files/R/R-3.5.3/library
Hope a quick fix. Thanks!
Hi,
Thanks for publishing the package. I'm recently trying to run your example code in the github page using R3.3.3. However, when the package is executing perf_eva, i got an error shown below
Error in perf_eva(pred = pred_list, label = label_list) :
could not find function "isFALSE"
Is there any workaround about it?
My env is as follows
R3.3.3
data.table 1.10.4-3
ggplot2 3.0.0
gridExtra 2.3
foreach 1.4.4
doParallel 1.0.14
parallel 3.3.3
openxlsx 4.0.17
Thanks
ShichenXie,您好。我英文不够好,但愿您能看懂中文。
首先非常感谢您开发的scorecard包,这里我报告下我遇到的一个问题:
当logit回归非截距系数为负时,输出评分卡评分项的符号有问题。
我使用R版本的scorecard_0.19。
具体:
TD_CREDITSCORE的回归权重为-0.5396,且odds0 =1/19,points=600,pdo=50
1 | variable | bin | woe | points | count | count_distr | good | bad | badprob | bin_iv | total_iv | breaks | is_special_values |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
33 | TD_CREDITSCORE | missing | 0.250670111 | 10 | 509 | 0.14753623 | 434 | 75 | 0.14734774 | 1.019171e-02 | 0.09441740 | missing | FALSE |
34 | TD_CREDITSCORE | [-Inf,25) | -0.292529584 | -11 | 1721 | 0.49884058 | 1564 | 157 | 0.09122603 | 3.815798e-02 | 0.09441740 | 25 | FALSE |
35 | TD_CREDITSCORE | [25,35) | 0.001571926 | 0 | 716 | 0.20753623 | 631 | 85 | 0.11871508 | 5.131192e-07 | 0.09441740 | 35 | FALSE |
36 | TD_CREDITSCORE | [35,50) | 0.465781491 | 18 | 357 | 0.10347826 | 294 | 63 | 0.17647059 | 2.671513e-02 | 0.09441740 | 50 | FALSE |
37 | TD_CREDITSCORE | [50, Inf) | 0.602837737 | 23 | 147 | 0.04260870 | 118 | 29 | 0.19727891 | 1.935207e-02 | 0.09441740 | Inf | FALSE |
上表是scorecard输出结果,经过手工计算发现points符号有问题
Hi,
I have a question on the default value and use of parameter odds0.
So if I want to set a target score of 600 where the odds of good/bad = 50; does it mean I should set odds0=1/50 in the function? (I am trying to replicate this scorecard building method with your package https://towardsdatascience.com/intro-to-credit-scorecard-9afeaaa3725f).
Thanks in advance for the help.
data("germancredit")
y = 'creditability'
x = c(
"status.of.existing.checking.account",
"duration.in.month",
"credit.history",
"purpose",
"credit.amount",
"savings.account.and.bonds",
"present.employment.since",
"installment.rate.in.percentage.of.disposable.income",
"personal.status.and.sex",
"property",
"age.in.years",
"other.installment.plans",
"housing"
)
special_values=NULL
breaks_list=list(
status.of.existing.checking.account=c("... < 0 DM%,%0 <= ... < 200 DM",
"... >= 200 DM / salary assignments for at least 1 year", "no checking account"),
duration.in.month=c(8, 16, 34, 44),
credit.history=c(
"no credits taken/ all credits paid back duly%,%all credits at this bank paid back duly",
"existing credits paid back duly till now", "delay in paying off in the past",
"critical account/ other credits existing (not at this bank)"),
purpose=c("retraining%,%car (used)", "radio/television",
"furniture/equipment%,%domestic appliances%,%business%,%repairs",
"car (new)%,%others%,%education"),
credit.amount=c(1400, 1800, 4000, 9200),
savings.account.and.bonds=c("... < 100 DM", "100 <= ... < 500 DM",
"500 <= ... < 1000 DM%,%... >= 1000 DM%,%unknown/ no savings account"),
present.employment.since=c("unemployed%,%... < 1 year", "1 <= ... < 4 years",
"4 <= ... < 7 years", "... >= 7 years"),
installment.rate.in.percentage.of.disposable.income=c(2, 3),
personal.status.and.sex=c("female : divorced/separated/married", "male : single",
"male : married/widowed"),
property=c("real estate", "building society savings agreement/ life insurance",
"car or other, not in attribute Savings account/bonds", "unknown / no property"),
age.in.years=c(26, 28, 35, 37),
other.installment.plans=c("bank%,%stores", "none"),
housing=c("rent", "own", "for free")
)
Example I
input dt is a data frame
split input data frame into two
report(germancredit, y, x, breaks_list, special_values, seed=618, save_report='report1',
show_plot = c('ks', 'lift', 'gain', 'roc', 'lz', 'pr', 'f1', 'density'))
您好,我想把开发的评分卡直接上线到生产的R服务,但发现单行数据的woe替换速度太慢要1.5秒左右,这是不能满足直接上线需求的。 下面是示例 只有 7个变量 一行数据。耗时1.39秒。 我可以手写这个替换步骤让它单行替换耗时下降到几十毫秒。但还是希望这能成为这个优秀包的特性。不知道是否可行。
woebin_ply(input,sc$bins)%>%rename_all(
.x%>%str_remove("_woe")).x%>%str_remove("_woe")))
[INFO] converting into woe values ...
no als_m12_cell_nbank_finlea_orgnum als_m3_id_nbank_cf_orgnum r_m01_cell_pdl_0allnumorgnum r_m03_id_caon_0allnumorgnum
1: 1 0.05253662 -0.2286684 -0.006342389 -0.1508286
r_m12_cell_0sloannbank_allnum r_m12_cell_nbank_0weekall_allnum r_m12_id_0avgmax_monnum
1: -0.06246272 -0.112977 -0.4454012
system.time( woebin_ply(input,sc$bins)%>%rename_all(
[INFO] converting into woe values ...
用户 系统 流逝
0.00 0.02 1.39
If the positive outcome is other than "bad" or 1 and the positive option is used to define it, you can get an error under some condition when using woebin. For example, if levels are are "bad" and "not bad" instead of "bad" and "good" the code will fail as the recoding uses string match and every record gets coded as 1 in this example.
Also it would be nice if the WOE tables and plots would reflect the levels of the dependent variable as labeling everything as good or bad does not always make sense. For example, I use scorecard for marketing and the positive outcome is a response so would be nice to be able to label the plats and tables accordingly.
Love your package, find it very useful.
Tomas
describe(data)
Error in setnames(sum_dtnum, c("min", "p25", "p50", "mean", "p75", "max", :
Can't assign 8 names to a 9 column data.table
In addition: Warning message:
In (function (..., deparse.level = 1) :
number of columns of result is not a multiple of vector length (arg 1)
and it causes many other problems of scorecard package funtions
In our model data, we have ID column.
(ex: PERSON_ID) (the main distinct ID that we need the scores of)
The package disappears the ID_COLUMN,
How can we identify it in the code? How does the code know which column is ID?
(it rejects that column at the beginning assuming that it is a feature)
In the code I could not see anywhere to clarify the ID column.
(it disappears the ID after the code :
dt_sel = var_filter(germancredit, "creditability")
So it causes a problem that, we do not know which score belongs to which PERSON_ID
(it just gives rows and scores...)
I hope my question is clear :)
May be that final scorecard code should include the column ID :
(or var filter code may include a column ıd like :
var_filter(germancredit, "creditability","person_id")
credit score, only_total_score = FALSE
score_list2 = lapply(dt_list, function(x) scorecard_ply(x,card, only_total_score=FALSE))
Thanks for that great work!
Hi @ShichenXie,
I have the following question regarding the counts and the cut points of the variables. In this example, replicating the record count for variable “x” using the base :: cut function does not get the same results as with the woebin function.
Also, I have verified that when using the woebin_ply function, the counts match the base :: cut calculation.
Thank you,
library(readr)
library(scorecard)
#> Warning: package 'scorecard' was built under R version 4.0.5
suppressPackageStartupMessages(library(dplyr))
packageVersion("scorecard")
#> [1] '0.3.2'
d <- read_csv("https://gist.githubusercontent.com/jm448/a8edc0f3a89c6797c52aa84f978eca6f/raw/4ca39c576a23ae5b94b19c5829149d6800b75991/data.txt")
#>
#> -- Column specification --------------------------------------------------------
#> cols(
#> x = col_double(),
#> response = col_double()
#> )
glimpse(d)
#> Rows: 253
#> Columns: 2
#> $ x <dbl> 1.0000000, -999.0000000, 0.3639344, 0.9988413, 0.7696078, ...
#> $ response <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
bin <- woebin(
d,
x = "x",
y = "response",
method = "tree",
count_distr_limit = 0.05
)
#> [INFO] creating woe binning ...
bin
#> $x
#> variable bin count count_distr neg pos posprob woe
#> 1: x [-Inf,0) 24 0.09486166 15 9 0.37500000 1.16158709
#> 2: x [0,0.52) 88 0.34782609 80 8 0.09090909 -0.63017238
#> 3: x [0.52,0.6) 21 0.08300395 16 5 0.23809524 0.50926190
#> 4: x [0.6, Inf) 120 0.47430830 102 18 0.15000000 -0.06218834
#> bin_iv total_iv breaks is_special_values
#> 1: 0.179555187 0.3174041 0 FALSE
#> 2: 0.110649986 0.3174041 0.52 FALSE
#> 3: 0.025403323 0.3174041 0.6 FALSE
#> 4: 0.001795579 0.3174041 Inf FALSE
# counts:
bin$x$count
#> [1] 24 88 21 120
brks <- bin$x$breaks
brks <- as.numeric(brks)
brks <- c(-Inf, brks)
brks
#> [1] -Inf 0.00 0.52 0.60 Inf
dc <- d %>%
mutate(x_bin = cut(x, brks, right = FALSE)) %>%
count(x_bin)
dc
#> # A tibble: 4 x 2
#> x_bin n
#> <fct> <int>
#> 1 [-Inf,0) 24
#> 2 [0,0.52) 88
#> 3 [0.52,0.6) 20
#> 4 [0.6, Inf) 121
# the counts using cut doesn't match with the woebin results
dc$n
#> [1] 24 88 20 121
# the counts match using woebin_ply
woebin_ply(d, bins = bin, to = "bin") %>%
as_tibble() %>%
mutate(
x = d$x,
x_bin2 = cut(x, brks, right = FALSE)
) %>%
filter(x_bin != x_bin2)
#> [INFO] converting into woe values ...
#> # A tibble: 0 x 4
#> # ... with 4 variables: response <dbl>, x_bin <chr>, x <dbl>, x_bin2 <fct>
# the woe values match
bin$x %>%
mutate(
neg_porc = neg / sum(neg),
pos_porc = pos / sum(pos),
woe2 = log(pos_porc / neg_porc)
) %>%
select(bin, count, pos, neg, woe, woe2) %>%
mutate(woe == woe2)
#> bin count pos neg woe woe2 woe == woe2
#> 1: [-Inf,0) 24 9 15 1.16158709 1.16158709 TRUE
#> 2: [0,0.52) 88 8 80 -0.63017238 -0.63017238 TRUE
#> 3: [0.52,0.6) 21 5 16 0.50926190 0.50926190 TRUE
#> 4: [0.6, Inf) 120 18 102 -0.06218834 -0.06218834 TRUE
# woe values using base::cut
bin2 <- d %>%
mutate(x_bin = cut(x, brks, right = FALSE)) %>%
count(x_bin, response) %>%
mutate(response = if_else(response == 1, "pos", "neg")) %>%
tidyr::pivot_wider(names_from = "response", values_from = "n") %>%
mutate(
neg_porc = neg / sum(neg),
pos_porc = pos / sum(pos),
woe2 = log(pos_porc / neg_porc)
) %>%
select(x_bin, woe2)
# the woe values using base::cut doesn't match with woe values from woebin
bin$x %>%
select(bin, woe) %>%
left_join(bin2, by = c("bin" = "x_bin")) %>%
mutate(woe == woe2)
#> bin woe woe2 woe == woe2
#> 1: [-Inf,0) 1.16158709 1.16158709 TRUE
#> 2: [0,0.52) -0.63017238 -0.63017238 TRUE
#> 3: [0.52,0.6) 0.50926190 0.57380042 FALSE
#> 4: [0.6, Inf) -0.06218834 -0.07194452 FALSE
Created on 2021-05-19 by the reprex package (v0.3.0)
Hi , my score list contains - test and train and I want to create breakpoints based on only train data and apply those breaks points to test data . Is it possible ?
What I have found out is , it makes breakpoints based on both the data . Isn't this wrong ?
Ideally , for PSI , the bins should be created only on the basis of train data and then we would want to see how test data is distributed on those bins and how much it has shifted.
VIF had been included as a function in the scorecard package, why not include it in the var_filter to filter out
the hight correlated variables(assume all the variables are numeric)? Another way might be using caret::findCorrelation to do the filter work.
Hello,
I just discover a Github repo, jstephenj14/Monotonic-WOE-Binning-Algorithm, which provides a Python implementation of a variable binning algorithm that optimizes information value (IV) monotonicity and representativeness.
I think it would be great to include this algorithm is your fantastic package scorecard
. Since the author provides the Python version, I wonder if it could be incorporated into you scorecard
R package.
Thanks!
today using woebin to bin a factor variable but the result is strange
pflag<-rep(0,12278)
cus_cus_class<-rep("1",12278)
pflag1<-rep(1,241213)
cus_cus_class1<-rep("1",241213)
pflag2<-rep(0,3646)
cus_cus_class2<-rep("3",3646)
pflag3<-rep(1,1762)
cus_cus_class3<-rep("3",1762)
pflagall<-c(pflag,pflag1,pflag2,pflag3)
cus_cus_classall<-c(cus_cus_class,cus_cus_class1,cus_cus_class2,cus_cus_class3)
cus_cus_classall<-as.factor(cus_cus_classall)
df=data.frame(pflagall,cus_cus_classall)
table(df)
library(scorecard)
library(smbinning)
iv(df,"pflagall","cus_cus_classall")
woebin(df,"pflagall","cus_cus_classall")
smbinning.factor(df,"pflagall","cus_cus_classall")$ivtable
the result of iv and smbinning is 0.84 , but woebin can't even binning .
great package on this subject! very nice job! here i have a problem.for example,I have a variable, such as ,dat<-data.frame(y=c(0,0,0,1,1,1,1,1,0,0,1,1,1,0,0,0,1,1,1,0),x=c(1,2,3,4,5,888,888,888,9,10,666,666,666,666,15,16,17,18,19,20)). In this case,i want regard '888' and '666'as special two class such as missing value have own woe, and i want to get two woe for '888' and '666' separately. other values are computed as usual. How to handle this type data. Thanks!
perf_eva trian的ks是23 但报告中是17. 分数的psi也不同。
下面是个可再现的例子。包括数据和代码。期待您的帮助
sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936 LC_CTYPE=Chinese (Simplified)_China.936 LC_MONETARY=Chinese (Simplified)_China.936
[4] LC_NUMERIC=C LC_TIME=Chinese (Simplified)_China.936
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] scorecard_0.3.1 forcats_0.5.0 stringr_1.4.0 dplyr_1.0.0 purrr_0.3.4 readr_1.3.1 tidyr_1.1.0 tibble_3.0.3
[9] ggplot2_3.3.2 tidyverse_1.3.0
loaded via a namespace (and not attached):
[1] tidyselect_1.1.0 haven_2.3.1 colorspace_1.4-1 vctrs_0.3.1 generics_0.0.2 blob_1.2.1 rlang_0.4.7
[8] pillar_1.4.6 withr_2.2.0 glue_1.4.1 DBI_1.1.0 dbplyr_1.4.4 modelr_0.1.8 readxl_1.3.1
[15] foreach_1.5.0 lifecycle_0.2.0 munsell_0.5.0 gtable_0.3.0 cellranger_1.1.0 rvest_0.3.5 zip_2.0.4
[22] codetools_0.2-16 doParallel_1.0.15 parallel_4.0.2 fansi_0.4.1 broom_0.7.0 Rcpp_1.0.5 backports_1.1.8
[29] scales_1.1.1 jsonlite_1.7.0 farver_2.0.3 fs_1.4.2 gridExtra_2.3 digest_0.6.25 hms_0.5.3
[36] packrat_0.5.0 stringi_1.4.6 openxlsx_4.1.5 grid_4.0.2 cli_2.0.2 tools_4.0.2 magrittr_1.5
[43] crayon_1.3.4 pkgconfig_2.0.3 ellipsis_0.3.1 data.table_1.12.8 xml2_1.3.2 reprex_0.3.0 lubridate_1.7.9
[50] assertthat_0.2.1 httr_1.4.1 rstudioapi_0.11 iterators_1.0.12 R6_2.4.1 compiler_4.0.2
谢老师您好,我关注这个包很久了。目前存在一个不能算bug的争议点希望能跟你探讨一下。我有一个变量,有70%的值是0,这个变量我自己用python决策树分箱做woe后iv可以达到1.5, 用smbinning的话也是相同的结果,算是一个很重要的变量。但是用scorecard分不出组(也就是【-inf,inf】)。我猜测是因为R语言里的区间是左闭右开,所以0这个值很容易被合并掉,我想问下您这里有没有什么可以解决的方案或者好的建议?
test.xlsx
If the numeric variable only contains two values, it will not output the correct bin when doing 'tree' and 'chimerge' binning. But 'width' and 'freq' binning work well.
tst_dt <- data.table(var = c(rep(10, 20), rep(20, 10)), target = c(sample(c(0,1), 30, replace = TRUE))) tree_bins = woebin(tst_dt, y = 'target', x = 'var', positive = "1", method = 'tree') chi_bins = woebin(tst_dt, y = 'target', x = 'var', positive = "1", method = 'chimerge') width_bins = woebin(tst_dt, y = 'target', x = 'var', positive = "1", method = 'width') freq_bins = woebin(tst_dt, y = 'target', x = 'var', positive = "1", method = 'freq')
From the source code, 'tree' and 'chimerge' binning will call the function 'woebin2_init_bin'.
The following code from this function drops the value of this binary variable. So that causes only one bin [-Inf, Inf].
brk = sort(brk[(brk < max(xvalue, na.rm =TRUE)) & (brk > min(xvalue, na.rm =TRUE))])
Please review.
Thanks for writing such a fantastic package!
I am curious about how the function iv()
calculates the information value for a continuous variable. Look at the source code of iv()
:
ivlist = dt[, sapply(.SD, iv_xy, label), .SDcols = x]
iv_xy = function(x, y) {
. = DistrBad = DistrGood = bad = good = NULL
data.table(x=x, y=y)[
, .(good = sum(y==0), bad = sum(y==1)), keyby="x"
][, (c("good", "bad")) := lapply(.SD, function(x) ifelse(x==0, 0.99, x)), .SDcols = c("good", "bad")# replace 0 by 0.99 in good/bad columns
][, `:=`(
DistrGood = good/sum(good), DistrBad = bad/sum(bad)
)][, sum((DistrBad-DistrGood)*log(DistrBad/DistrGood)) ]
}
I am not very familiar with data.table
package. Based on the above code, it seems that you consider every unique value as a group and count the number of "bad" and "good" in each group respectively. From this perspective, I use package dplyr
to calculate the information value of variable age.in.years
in dataset germancredit
.
Based on the iv()
function in your scorecard
package, we can obtain that the IV of variable age.in.years
is
ivs <- iv(germancredit, y = "creditability")
# Warning message:
# In check_y(dt, y, positive) :
# The positive value in "creditability" was replaced by 1 and negative value by 0.
ivs[variable=="age.in.years",]
# variable info_value
# 1: age.in.years 0.2596514
While the results of my dplyr
solution is
library(dplyr)
library(tidyr)
germancredit %>%
count(age.in.years, creditability) %>%
spread(key = creditability, value = n) %>%
# delete groups including only one class: "good" or "bad"
na.omit() %>%
mutate(total_good = germancredit %>%
count(creditability) %>%
filter(creditability == "good") %>%
pull(n),
total_bad = germancredit %>%
count(creditability) %>%
filter(creditability == "bad") %>%
pull(n)) %>%
mutate(bad_distr = bad / total_bad,
good_distr = good / total_good,
woe = log(bad_distr / good_distr),
bin_iv = (bad_distr - good_distr) * woe,
total_iv = sum(bin_iv))
# A tibble: 47 x 10
age.in.years bad good total_good total_bad bad_distr good_distr woe bin_iv total_iv
<dbl> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 19. 1 1 700 300 0.00333 0.00143 0.847 0.00161 0.257
2 20. 5 9 700 300 0.0167 0.0129 0.260 0.000989 0.257
3 21. 5 9 700 300 0.0167 0.0129 0.260 0.000989 0.257
4 22. 11 16 700 300 0.0367 0.0229 0.473 0.00653 0.257
5 23. 20 28 700 300 0.0667 0.0400 0.511 0.0136 0.257
6 24. 19 25 700 300 0.0633 0.0357 0.573 0.0158 0.257
7 25. 19 22 700 300 0.0633 0.0314 0.701 0.0224 0.257
8 26. 14 36 700 300 0.0467 0.0514 -0.0972 0.000463 0.257
9 27. 13 38 700 300 0.0433 0.0543 -0.225 0.00247 0.257
10 28. 15 28 700 300 0.0500 0.0400 0.223 0.00223 0.257
# ... with 37 more rows
It is clear that my results are a little different from yours. Maybe it is owing to your usage of ifelse(x==0, 0.99, x)
? I feel very perplexed at this line of code.
In sum, my question is
-- How does the function iv()
works in continuous variables?
-- If the iv()
function does not use the optimal bins for continuous variables, are the results of the iv()
function reliable?
Thanks again for this awesome package. B.T.W, the slides in your website is super helpful!
when i use package scorecard which is a great tool to use, i encounter an error
"Error in matrix(unlist(value, recursive = FALSE, use.names = FALSE), nrow = nr, :
'data' must be of a vector type, was 'NULL'"
when i am binning var use woebin() or woebin_ply().
i dive into it found that in internal function rmcol_datetime_unique1() , it suppose dt must have a character col,which is not the condition in my work.
Hi @ShichenXie
First, thank so much for your work. This package help me a lot!
What is the reason to use the cut(....right = FALSE)
in woebin
being that the default value in base::cut
is TRUE
?
For example
Line 344 in 65b9ca1
Why my question? Because I'm trying to create an interface woebin_ctree
to mix the scorecard::woebin
output using the breaks given by partykit::ctree
function. This tree algorithm make the split using <=
. So I can't replicate the counts in each node. For example:
> ctree_breaks
[1] 11 15 33
But (obviously) when I use woebin
with that beaks I don't have the same counts.
I tried to make similar breaks adding a small value 0.000001
but this is not quite elegant 😅 :
Do you think is possible add an optional argument woebin(..., right = FALSE)
to modify this behaviour if is necessary.
Thanks in advance,
Kind regards,
receiving the following error with version scorecard_0.2.9
if i run the same code on version scorecard_0.2.5.999 it runs without error
bins = woebin(dt_f, y="creditability")
[INFO] creating woe binning ...
Error in check_y(dt, y, positive) :
Incorrect inputs; there is no "creditability" column in dt.
In addition: Warning messages:
1: In setDT(copy(dt)) :
Some columns are a multi-column type (such as a matrix column): [23, 24, 26, 29]. setDT will retain these columns as-is but subsequent operations like grouping and joining may fail. Please consider as.data.table() instead which will create a new column for each embedded column.
2: In setDT(dt) :
Some columns are a multi-column type (such as a matrix column): [23, 24, 26, 29]. setDT will retain these columns as-is but subsequent operations like grouping and joining may fail. Please consider as.data.table() instead which will create a new column for each embedded column.
Hello, Mr. Xie, as a risk control officer, I like your package very much, but I think this may be a small bug. The data.table obtained by gains_table must be directly output with [], otherwise the name of the output result will be printed twice In order to output, you may need to add [] to the data.table obtained at the end of the gains_table function.
Hi Shichen,
for the factor personal.status.and.sex
, I find that the germancredit
data erroneously classifies all cases from the factor level male : divorced/separated
as female : divorced/separated/married
. The female : single
category appears to be indeed empty even in the original data, but the male : divorced/separated
category is not.
Best, Ulrike
Hi, useful package. Are there any plans to include frequency weights throughout the package? i.e. where the data is a sample taken from a large population (say 1 in 2 bads and 1 in 10 goods) and as such the data contains a weight field, containing 2 (for bads) or 10 (for goods). The weights should influence the binning, woe calculations and glm processes. However this may causes issues with glm, which can have issues with weights, and therefore the package may need to be adapted to use svyglm. thanks.
the below part of code not work : gives the error below :
m1 = glm( creditability ~ ., family = binomial(), data = dt_woe_list$train)
Error in terms.formula(formula, data = data) :
duplicated name 'NA' in data frame using '.'
thank you!
Hi,
I am getting error in gains table function ;
Error in .subset2(x, i, exact = exact) :
recursive indexing failed at level 2
thanks!
I have been using scorecard for any time, it is excellent, but yesterday I ran a R program that I run monthly and give the following error:
report(list(train = dt_list$train, test = dt_list$test), y = 'Reclasificado',
x = cols_not_remove, breaks_list = breaks_adj, special_values = NULL,
seed = seed, save_report='report1', show_plot = c('ks', 'lift', 'gain',
'roc', 'lz', 'pr', 'f1', 'density'),
bin_type = 'width')
[INFO] sheet1-dataset information
[INFO] sheet2-model coefficients
[INFO] sheet3-model performance
[INFO] sheet4-variable woe binning
[INFO] sheet5-scorecard
[INFO] sheet6-population stability
Error in setnames(psi_tbl, gains_table_cols) :
Can't assign 12 names to a 13 column data.table
Thank you in advance.
Hi @ShichenXie
I'm using scorecard 0.3.3 but break_list
from woebin
function doesn't work.
# Issue scorecard 0.3.3
library(readr)
library(scorecard)
suppressPackageStartupMessages(library(dplyr))
packageVersion("scorecard")
#> [1] '0.3.3'
path <- "https://gist.githubusercontent.com/ijrossi/b864820a14fd2b51ac21574841faaa3e/raw/5299e08b3bd68f6854a92979c716421d5bb5ba1e/data_issue_woebin.txt"
data <- read.csv(path, sep=";")
head(data)
#> y x
#> 1 69 0
#> 2 69 0
#> 3 68 0
#> 4 68 0
#> 5 53 0
#> 6 69 0
new_brks <- list(
y = c("25", "40", "Inf")
)
scorecard::woebin(dt = data, y = "x", x = "y")
#> [INFO] creating woe binning ...
#> $y
#> variable bin count count_distr neg pos posprob woe
#> 1: y [-Inf,28) 18116 0.05540536 14843 3273 0.18066902 0.59207398
#> 2: y [28,32) 33390 0.10211884 28616 4774 0.14297694 0.31311388
#> 3: y [32,40) 59948 0.18334292 52451 7497 0.12505838 0.15851890
#> 4: y [40,52) 86865 0.26566495 77730 9135 0.10516318 -0.03723274
#> 5: y [52, Inf) 128653 0.39346794 117784 10869 0.08448307 -0.27904238
#> bin_iv total_iv breaks is_special_values
#> 1: 0.0243579405 0.06838722 28 FALSE
#> 2: 0.0113045355 0.06838722 32 FALSE
#> 3: 0.0049008012 0.06838722 40 FALSE
#> 4: 0.0003629555 0.06838722 52 FALSE
#> 5: 0.0274609867 0.06838722 Inf FALSE
# 'breaks_list' works fine with scorecard 0.3.2
# but it doesn't work with scorecard 0.3.3
scorecard::woebin(dt = data, y = "x", x = "y", breaks_list = new_brks)
#> [INFO] creating woe binning ...
#> $y
#> variable bin count count_distr neg pos posprob woe bin_iv
#> 1: y [-Inf, Inf) 326972 1 291424 35548 0.1087188 0 0
#> total_iv breaks is_special_values
#> 1: 0 Inf FALSE
sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19042)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=Spanish_Chile.1252 LC_CTYPE=Spanish_Chile.1252
#> [3] LC_MONETARY=Spanish_Chile.1252 LC_NUMERIC=C
#> [5] LC_TIME=Spanish_Chile.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dplyr_1.0.6 scorecard_0.3.3 readr_1.4.0
#>
#> loaded via a namespace (and not attached):
#> [1] zip_2.2.0 Rcpp_1.0.7 compiler_4.1.0 pillar_1.6.1
#> [5] highr_0.9 iterators_1.0.13 tools_4.1.0 digest_0.6.27
#> [9] evaluate_0.14 lifecycle_1.0.0 tibble_3.1.2 gtable_0.3.0
#> [13] pkgconfig_2.0.3 rlang_0.4.11 openxlsx_4.2.3 reprex_2.0.0
#> [17] foreach_1.5.1 DBI_1.1.1 cli_2.5.0 rstudioapi_0.13
#> [21] parallel_4.1.0 yaml_2.2.1 xfun_0.23 gridExtra_2.3
#> [25] withr_2.4.2 stringr_1.4.0 knitr_1.33 generics_0.1.0
#> [29] fs_1.5.0 vctrs_0.3.8 hms_1.1.0 tidyselect_1.1.1
#> [33] grid_4.1.0 glue_1.4.2 data.table_1.14.0 R6_2.5.0
#> [37] fansi_0.5.0 rmarkdown_2.8 purrr_0.3.4 ggplot2_3.3.3
#> [41] magrittr_2.0.1 scales_1.1.1 ps_1.6.0 codetools_0.2-18
#> [45] ellipsis_0.3.2 htmltools_0.5.1.1 assertthat_0.2.1 colorspace_2.0-1
#> [49] utf8_1.2.1 stringi_1.6.1 doParallel_1.0.16 munsell_0.5.0
#> [53] crayon_1.4.1
Now, same code using older version of scorecard. Parameter breaks_list
works fine!
# Issue scorecard 0.3.3
library(readr)
library(scorecard)
suppressPackageStartupMessages(library(dplyr))
packageVersion("scorecard")
#> [1] '0.3.2'
path <- "https://gist.githubusercontent.com/ijrossi/b864820a14fd2b51ac21574841faaa3e/raw/5299e08b3bd68f6854a92979c716421d5bb5ba1e/data_issue_woebin.txt"
data <- read.csv(path, sep=";")
head(data)
#> y x
#> 1 69 0
#> 2 69 0
#> 3 68 0
#> 4 68 0
#> 5 53 0
#> 6 69 0
new_brks <- list(
y = c("25", "40", "Inf")
)
scorecard::woebin(dt = data, y = "x", x = "y")
#> [INFO] creating woe binning ...
#> $y
#> variable bin count count_distr neg pos posprob woe
#> 1: y [-Inf,28) 18116 0.05540536 14843 3273 0.18066902 0.59207398
#> 2: y [28,32) 33390 0.10211884 28616 4774 0.14297694 0.31311388
#> 3: y [32,40) 59948 0.18334292 52451 7497 0.12505838 0.15851890
#> 4: y [40,52) 86865 0.26566495 77730 9135 0.10516318 -0.03723274
#> 5: y [52, Inf) 128653 0.39346794 117784 10869 0.08448307 -0.27904238
#> bin_iv total_iv breaks is_special_values
#> 1: 0.0243579405 0.06838722 28 FALSE
#> 2: 0.0113045355 0.06838722 32 FALSE
#> 3: 0.0049008012 0.06838722 40 FALSE
#> 4: 0.0003629555 0.06838722 52 FALSE
#> 5: 0.0274609867 0.06838722 Inf FALSE
# 'breaks_list' works fine with scorecard 0.3.2
# but it doesn't work with scorecard 0.3.3
scorecard::woebin(dt = data, y = "x", x = "y", breaks_list = new_brks)
#> [INFO] creating woe binning ...
#> $y
#> variable bin count count_distr neg pos posprob woe
#> 1: y [-Inf,25) 7018 0.02146361 5667 1351 0.19250499 0.6700805
#> 2: y [25,40) 104436 0.31940350 90243 14193 0.13590141 0.2541382
#> 3: y [40, Inf) 215518 0.65913289 195514 20004 0.09281823 -0.1758044
#> bin_iv total_iv breaks is_special_values
#> 1: 0.01243606 0.05422201 25 FALSE
#> 2: 0.02277098 0.05422201 40 FALSE
#> 3: 0.01901497 0.05422201 Inf
sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19042)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=Spanish_Chile.1252 LC_CTYPE=Spanish_Chile.1252
#> [3] LC_MONETARY=Spanish_Chile.1252 LC_NUMERIC=C
#> [5] LC_TIME=Spanish_Chile.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dplyr_1.0.6 scorecard_0.3.2 readr_1.4.0
#>
#> loaded via a namespace (and not attached):
#> [1] zip_2.2.0 Rcpp_1.0.7 compiler_4.1.0 pillar_1.6.1
#> [5] highr_0.9 iterators_1.0.13 tools_4.1.0 digest_0.6.27
#> [9] evaluate_0.14 lifecycle_1.0.0 tibble_3.1.2 gtable_0.3.0
#> [13] pkgconfig_2.0.3 rlang_0.4.11 openxlsx_4.2.3 reprex_2.0.0
#> [17] foreach_1.5.1 DBI_1.1.1 cli_2.5.0 rstudioapi_0.13
#> [21] parallel_4.1.0 yaml_2.2.1 xfun_0.23 gridExtra_2.3
#> [25] withr_2.4.2 stringr_1.4.0 knitr_1.33 generics_0.1.0
#> [29] fs_1.5.0 vctrs_0.3.8 hms_1.1.0 tidyselect_1.1.1
#> [33] grid_4.1.0 glue_1.4.2 data.table_1.14.0 R6_2.5.0
#> [37] fansi_0.5.0 rmarkdown_2.8 purrr_0.3.4 ggplot2_3.3.3
#> [41] magrittr_2.0.1 scales_1.1.1 ps_1.6.0 codetools_0.2-18
#> [45] ellipsis_0.3.2 htmltools_0.5.1.1 assertthat_0.2.1 colorspace_2.0-1
#> [49] utf8_1.2.1 stringi_1.6.1 doParallel_1.0.16 munsell_0.5.0
#> [53] crayon_1.4.1
用的是3.4.4的scorecard。
First of all, thank you for your useful package. I have a question: I don't quite understand what kind of adjustment the function woebin applies when there is a zero frequency class in a category?. I mean How woe's are calculated then?
I cannot install this package in either R studio or Visual Studio with R Tools. THe package is supposedly "unavailable" ? Please advise.
Best,
ML
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.