modeloriented / drifter Goto Github PK

View Code? Open in Web Editor NEW

18.0 18.0 8.0 973 KB

Concept Drift and Concept Shift Detection for Predictive Models

Home Page: https://modeloriented.github.io/drifter/

R 100.00%

concept-drift concept-shift drwhy predictive-modeling

drifter's People

Contributors

Stargazers

Watchers

Forkers

nrebeiz srayagarwal cgpu kasiapekala dsnoor kexin997 minghao2016 jamespassion99

drifter's Issues

calculate_distance() possible improvement...

Hi Przemek,

For data.frames with more than 200k lines, there is an important opportunity to improve the speed in the calculation of this function, which is the core of calculate_covariate_fit() function.

Instead of using rank() here:

calculate_distance <- function(variable_old, variable_new, bins = 20) {
  if ("factor" %in% class(variable_old)) {
    after_cuts <- c(variable_old, variable_new)
  } else {
    after_cuts <- cut(rank(c(variable_old, variable_new)), bins)
  }

It would improve a lot if you use frank() from data.table package.

calculate_distance <- function(variable_old, variable_new, bins = 20) {
  if ("factor" %in% class(variable_old)) {
    after_cuts <- c(variable_old, variable_new)
  } else {
   after_cuts <- cut(frank(c(variable_old, variable_new)), bins)
  }

Well, after that calculation there is another calculation based on table() that also can be improved significantly by using a data.table calculation. If you accept to add data.tabledependency in shifter I will open a PR with these calculations. with these changes, I was able to pass from some minutes calculations in my comparison data.frame to barely 30 secs.

Thanks,
Carlos.

question about calculate_distance()

I was wondering about calculate_distance() and after_cut object.

How it will behave if one wants to calculate_distance() for factor vectors with different number of levels?

For example, if a train and test samples are provided from outside of R. One of the factor variables in test sample (variable_new) is missing some types of values and it won't have the same number of levels as the variable in train sample (variable_old). Then, if I'm not mistaken, c() that creates after_cuts, will encode them differently than it should.

Example:

variable_old <- apartments[, 6]
variable_new <- filter(apartments_test, district != "Praga")[, 6]
variable_new2 <- droplevels(variable_new)

length(levels(variable_new))
[1] 10
length(levels(variable_new2))
[1] 9

calculate_distance(variable_old,variable_new)
[1] 0.092
calculate_distance(variable_old,variable_new2)
[1] 0.097

If that's indeed a problem, than maybe the change proposed below would solve it?

after_cuts <- as.factor(c(as.character(variable_old),as.character(variable_new)))

#bins recommendation for large dataset...

Hello,

Firstly thanks a lot for your package.

I think it is just the only one that allows to calulate drift concept between two dataset.
I plan to use it for a large dataset (some millions rows) and I would like to know which is your recommendation for the # bins parameter in either calculate_covariate_drift() or calculate_distance() functions.

Thanks in anticipation,
Carlos

modeloriented / drifter Goto Github PK

drifter's People

Contributors

Stargazers

Watchers

Forkers

drifter's Issues

calculate_distance() possible improvement...

question about calculate_distance()

#bins recommendation for large dataset...

examples for calculate_residuals_drift()

drifter is going to CRAN

data for explainers in calculate_model_drift()

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent