modeloriented / drifter Goto Github PK
View Code? Open in Web Editor NEWConcept Drift and Concept Shift Detection for Predictive Models
Home Page: https://modeloriented.github.io/drifter/
Concept Drift and Concept Shift Detection for Predictive Models
Home Page: https://modeloriented.github.io/drifter/
Hi Przemek,
For data.frames with more than 200k lines, there is an important opportunity to improve the speed in the calculation of this function, which is the core of calculate_covariate_fit()
function.
Instead of using rank()
here:
calculate_distance <- function(variable_old, variable_new, bins = 20) {
if ("factor" %in% class(variable_old)) {
after_cuts <- c(variable_old, variable_new)
} else {
after_cuts <- cut(rank(c(variable_old, variable_new)), bins)
}
It would improve a lot if you use frank()
from data.table
package.
calculate_distance <- function(variable_old, variable_new, bins = 20) {
if ("factor" %in% class(variable_old)) {
after_cuts <- c(variable_old, variable_new)
} else {
after_cuts <- cut(frank(c(variable_old, variable_new)), bins)
}
Well, after that calculation there is another calculation based on table()
that also can be improved significantly by using a data.table
calculation. If you accept to add data.table
dependency in shifter
I will open a PR with these calculations. with these changes, I was able to pass from some minutes calculations in my comparison data.frame to barely 30 secs.
Thanks,
Carlos.
I was wondering about calculate_distance() and after_cut object.
How it will behave if one wants to calculate_distance() for factor vectors with different number of levels?
For example, if a train and test samples are provided from outside of R. One of the factor variables in test sample (variable_new) is missing some types of values and it won't have the same number of levels as the variable in train sample (variable_old). Then, if I'm not mistaken, c() that creates after_cuts, will encode them differently than it should.
Example:
variable_old <- apartments[, 6]
variable_new <- filter(apartments_test, district != "Praga")[, 6]
variable_new2 <- droplevels(variable_new)
length(levels(variable_new))
[1] 10
length(levels(variable_new2))
[1] 9
calculate_distance(variable_old,variable_new)
[1] 0.092
calculate_distance(variable_old,variable_new2)
[1] 0.097
If that's indeed a problem, than maybe the change proposed below would solve it?
after_cuts <- as.factor(c(as.character(variable_old),as.character(variable_new)))
Hello,
Firstly thanks a lot for your package.
I think it is just the only one that allows to calulate drift concept
between two dataset.
I plan to use it for a large dataset (some millions rows) and I would like to know which is your recommendation for the # bins
parameter in either calculate_covariate_drift()
or calculate_distance()
functions.
Thanks in anticipation,
Carlos
In calculate_residuals_drift() documentation, first example is for calculate_model_drift() instead for residuals drift.
Calculate_model_drift() creates two explainers - old and new.
Explainer_new is based on model_old, shouldn't it be model_new?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.