Code Monkey home page Code Monkey logo

drifter's People

Contributors

kasiapekala avatar pbiecek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

drifter's Issues

calculate_distance() possible improvement...

Hi Przemek,

For data.frames with more than 200k lines, there is an important opportunity to improve the speed in the calculation of this function, which is the core of calculate_covariate_fit() function.

Instead of using rank() here:

calculate_distance <- function(variable_old, variable_new, bins = 20) {
  if ("factor" %in% class(variable_old)) {
    after_cuts <- c(variable_old, variable_new)
  } else {
    after_cuts <- cut(rank(c(variable_old, variable_new)), bins)
  }

It would improve a lot if you use frank() from data.table package.

calculate_distance <- function(variable_old, variable_new, bins = 20) {
  if ("factor" %in% class(variable_old)) {
    after_cuts <- c(variable_old, variable_new)
  } else {
   after_cuts <- cut(frank(c(variable_old, variable_new)), bins)
  }

Well, after that calculation there is another calculation based on table() that also can be improved significantly by using a data.table calculation. If you accept to add data.tabledependency in shifter I will open a PR with these calculations. with these changes, I was able to pass from some minutes calculations in my comparison data.frame to barely 30 secs.

Thanks,
Carlos.

question about calculate_distance()

I was wondering about calculate_distance() and after_cut object.

How it will behave if one wants to calculate_distance() for factor vectors with different number of levels?

For example, if a train and test samples are provided from outside of R. One of the factor variables in test sample (variable_new) is missing some types of values and it won't have the same number of levels as the variable in train sample (variable_old). Then, if I'm not mistaken, c() that creates after_cuts, will encode them differently than it should.

Example:

variable_old <- apartments[, 6]
variable_new <- filter(apartments_test, district != "Praga")[, 6]
variable_new2 <- droplevels(variable_new)

length(levels(variable_new))
[1] 10
length(levels(variable_new2))
[1] 9

calculate_distance(variable_old,variable_new)
[1] 0.092
calculate_distance(variable_old,variable_new2)
[1] 0.097

If that's indeed a problem, than maybe the change proposed below would solve it?

after_cuts <- as.factor(c(as.character(variable_old),as.character(variable_new)))

#bins recommendation for large dataset...

Hello,

Firstly thanks a lot for your package.

I think it is just the only one that allows to calulate drift concept between two dataset.
I plan to use it for a large dataset (some millions rows) and I would like to know which is your recommendation for the # bins parameter in either calculate_covariate_drift() or calculate_distance() functions.

Thanks in anticipation,
Carlos

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.