Code Monkey home page Code Monkey logo

micefast's Introduction

πŸ’« About Me:

Developer, Statistician, Data Scientist and Agile Team Player.

🌐 Socials:

LinkedIn Stack Overflow

πŸ’» Tech Stack:

R Python C++ Shell Script Git Rust Apache Airflow Docker

πŸ•ΈοΈ Tech Stack:

HTML5 CSS3 TailwindCSS Bootstrap JavaScript React Django Hugo


micefast's People

Contributors

ol-oxy avatar polkas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

micefast's Issues

Plot for NA values

For each observation NA at once, for big sets sample frac
For all set

Incomplete predictor fields cause NAs in prediction

Hi there,

I'm not sure if this is intended behaviour, but when I'm trying to predict a field I only get predictions when all the other fields for that record are complete (when using fill_NA or fill_NA_N). Here's my reproducible example. I'm expecting air_miss$Solar.R_imp[5] not to be NA. This gets filled when using naive_fill_NA() but your documentation suggests not to use that function:

library(miceFast)
library(data.table)
library(dplyr)

data(air_miss)

air_miss <- air_miss %>% 
  select(Ozone:Temp) %>% 
  head(10)


air_miss[, Solar.R_imp := fill_NA(.SD,
                                  model = "lm_bayes",
                                  posit_y = "Solar.R",
                                  posit_x = c("Ozone", "Wind", "Temp"))]

print(air_miss)

>     Ozone Solar.R Wind Temp Solar.R_imp
>  1:    41     190  7.4   67      190.00
>  2:    36     118  8.0   72      118.00
>  3:    12     149 12.6   74      149.00
>  4:    18     313 11.5   62      313.00
>  5:    NA      NA 14.3   56          NA
>  6:    28      NA 14.9   66    -1187.08
>  7:    23     299  8.6   65      299.00
>  8:    19      99 13.8   59       99.00
>  9:     8      19 20.1   61       19.00
> 10:    NA     194  8.6   69      194.00

naive_fill_NA(air_miss)

>        Ozone  Solar.R Wind Temp Solar.R_imp
>  1: 41.00000 190.0000  7.4   67    190.0000
>  2: 36.00000 118.0000  8.0   72    118.0000
>  3: 12.00000 149.0000 12.6   74    149.0000
>  4: 18.00000 313.0000 11.5   62    313.0000
>  5: 15.28918 144.9681 14.3   56    312.1653
>  6: 28.00000 501.6784 14.9   66  -1187.0801
>  7: 23.00000 299.0000  8.6   65    299.0000
>  8: 19.00000  99.0000 13.8   59     99.0000
>  9:  8.00000  19.0000 20.1   61     19.0000
> 10: 21.29695 194.0000  8.6   69    194.0000

Here's my session info:

- Session info -------------------------------------------------------------------------------
 setting  value                       
 version  R version 4.0.2 (2020-06-22)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_United Kingdom.1252 
 ctype    English_United Kingdom.1252 
 tz       Europe/London               
 date     2020-07-09                  

- Packages -----------------------------------------------------------------------------------
 package     * version date       lib source        
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.0)
 cli           2.0.2   2020-02-28 [1] CRAN (R 4.0.0)
 codetools     0.2-16  2018-12-24 [1] CRAN (R 4.0.2)
 crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.0)
 data.table  * 1.12.8  2019-12-09 [1] CRAN (R 4.0.0)
 dplyr       * 1.0.0   2020-05-29 [1] CRAN (R 4.0.0)
 ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.0)
 fansi         0.4.1   2020-01-08 [1] CRAN (R 4.0.0)
 generics      0.0.2   2018-11-29 [1] CRAN (R 4.0.0)
 glue          1.4.1   2020-05-13 [1] CRAN (R 4.0.0)
 lifecycle     0.2.0   2020-03-06 [1] CRAN (R 4.0.0)
 magrittr      1.5     2014-11-22 [1] CRAN (R 4.0.0)
 miceFast    * 0.6.1   2020-07-06 [1] CRAN (R 4.0.2)
 pillar        1.4.4   2020-05-05 [1] CRAN (R 4.0.0)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.0)
 purrr         0.3.4   2020-04-17 [1] CRAN (R 4.0.0)
 R6            2.4.1   2019-11-12 [1] CRAN (R 4.0.0)
 Rcpp          1.0.5   2020-07-06 [1] CRAN (R 4.0.2)
 rlang         0.4.6   2020-05-02 [1] CRAN (R 4.0.0)
 rstudioapi    0.11    2020-02-07 [1] CRAN (R 4.0.0)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.2)
 tibble        3.0.1   2020-04-20 [1] CRAN (R 4.0.0)
 tidyselect    1.1.0   2020-05-11 [1] CRAN (R 4.0.0)
 vctrs         0.3.1   2020-06-05 [1] CRAN (R 4.0.0)
 withr         2.2.0   2020-04-20 [1] CRAN (R 4.0.0)

Any help would be great.
Thank you

Faster implementation of PMM even more than 1000x faster than the mice solution

vector is pre-sorted and then a binary search


#include <Rcpp.h>
#include <algorithm>
using namespace std;
using namespace Rcpp;

int findCrossOver(NumericVector arr, double low, double high, double x) 
{ 
  if (arr[high] <= x) // x is greater than all 
    return high; 
  if (arr[low] > x) // x is smaller than all 
    return low; 
  
  // Find the middle point 
  int mid = (low + high)/2; /* low + (high - low)/2 */
  
  /* If x is same as middle element, then return mid */
  if (arr[mid] <= x && arr[mid+1] > x) 
    return mid; 
  
  /* If x is greater than arr[mid], then either arr[mid + 1] 
   is ceiling of x or ceiling lies in arr[mid+1...high] */
  if(arr[mid] < x) 
    return findCrossOver(arr, mid+1, high, x); 
  
  return findCrossOver(arr, low, mid - 1, x); 
} 


double Kclosestrand(NumericVector arr, double x, int k) 
{ 
  int n = arr.size();
  // Find the crossover point 
  int l = findCrossOver(arr, 0, n-1, x); 
  int r = l; // Right index to search 
  int count = 0; // To keep track of count of elements already printed 
  NumericVector resus(k);

  // If x is present in arr[], then reduce left index 
  // Assumption: all elements in arr[] are distinct 
  if (arr[l] == x) l--; 
  
  // Compare elements on left and right of crossover 
  // point to find the k closest elements 
  while (l >= 0 && r < n && count < k) 
  { 
    if (x - arr[l] < arr[r] - x) 
      resus[count] = arr[l--]; 
    else
      resus[count] = arr[r++]; 
    count++; 
  } 
  
  // If there are no more elements on right side, then 
  // print left elements 
  while (count < k && l >= 0) 
    resus[count] = arr[l--], count++; 
  
  // If there are no more elements on left side, then 
  // print right elements 
  while (count < k && r < n) 
   resus[count] = arr[r++], count++; 
  
  int goal = rand()%k;
  
  return resus[goal];
 
}


// [[Rcpp::export]]
NumericVector neibo(NumericVector y, NumericVector miss, int k) {
  int n_y = y.size();
  k = (k <= n_y) ? k : n_y;
  k = (k >= 1) ? k : 1;
  
  NumericVector y_new = clone(y);
  
  sort(y_new.begin(),y_new.end());
  
  unsigned int n_miss = miss.size();
  
  NumericVector resus(n_miss);
  
  for(int i=0; i<n_miss ;i++){
    double mm = miss[i];
    resus[i] = Kclosestrand(y_new,mm,k);
  } 
    
  return resus ;
  
}

/* Driver program to check above functions */

/*** R

vals = rnorm(100)

ss = rnorm(100)

neibo(vals,ss,2)[1:10]

vals[mice:::matcher(vals,ss,2)][1:10]

microbenchmark::microbenchmark(neibo(vals,ss,2),
                               
                               mice:::matcher(vals,ss,2)
)

vals = rnorm(10000)

ss = rnorm(1000)

neibo(vals,ss,2)[1:10]

vals[mice:::matcher(vals,ss,2)][1:10]

microbenchmark::microbenchmark(neibo(vals,ss,2),
                               
                               mice:::matcher(vals,ss,2)
)


vals = rnorm(10000)

ss = rnorm(1000)

neibo(vals,ss,200)[1:10]

vals[mice:::matcher(vals,ss,200)][1:10]

microbenchmark::microbenchmark(neibo(vals,ss,200),
                               
                               mice:::matcher(vals,ss,200)
)


vals = 1:10000

ss = 1:100

neibo(vals,ss,2)[1:10]

vals[mice:::matcher(vals,ss,2)][1:10]

microbenchmark::microbenchmark(neibo(vals,ss,2),
                               
                               mice:::matcher(vals,ss,2)
)
*/

Lack of an 'auto' function

Many users expecting less accurate but easier to implement solutions. auto_fill_NA could be a well suited proposition.

Incomplete predictor fields cause NAs in prediction

Hi there,

I'm not sure if this is intended behaviour, but when I'm trying to predict a field I only get predictions when all the other fields for that record are complete (when using fill_NA or fill_NA_N). Here's my reproducible example. I'm expecting air_miss$Solar.R_imp[5] not to be NA. This gets filled when using naive_fill_NA() but your documentation suggests not to use that function:

library(miceFast)
library(data.table)
library(dplyr)

data(air_miss)

air_miss <- air_miss %>% 
  select(Ozone:Temp) %>% 
  head(10)


air_miss[, Solar.R_imp := fill_NA(.SD,
                                  model = "lm_bayes",
                                  posit_y = "Solar.R",
                                  posit_x = c("Ozone", "Wind", "Temp"))]

print(air_miss)

>     Ozone Solar.R Wind Temp Solar.R_imp
>  1:    41     190  7.4   67      190.00
>  2:    36     118  8.0   72      118.00
>  3:    12     149 12.6   74      149.00
>  4:    18     313 11.5   62      313.00
>  5:    NA      NA 14.3   56          NA
>  6:    28      NA 14.9   66    -1187.08
>  7:    23     299  8.6   65      299.00
>  8:    19      99 13.8   59       99.00
>  9:     8      19 20.1   61       19.00
> 10:    NA     194  8.6   69      194.00

naive_fill_NA(air_miss)

>        Ozone  Solar.R Wind Temp Solar.R_imp
>  1: 41.00000 190.0000  7.4   67    190.0000
>  2: 36.00000 118.0000  8.0   72    118.0000
>  3: 12.00000 149.0000 12.6   74    149.0000
>  4: 18.00000 313.0000 11.5   62    313.0000
>  5: 15.28918 144.9681 14.3   56    312.1653
>  6: 28.00000 501.6784 14.9   66  -1187.0801
>  7: 23.00000 299.0000  8.6   65    299.0000
>  8: 19.00000  99.0000 13.8   59     99.0000
>  9:  8.00000  19.0000 20.1   61     19.0000
> 10: 21.29695 194.0000  8.6   69    194.0000

Here's my session info:

- Session info -------------------------------------------------------------------------------
 setting  value                       
 version  R version 4.0.2 (2020-06-22)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_United Kingdom.1252 
 ctype    English_United Kingdom.1252 
 tz       Europe/London               
 date     2020-07-09                  

- Packages -----------------------------------------------------------------------------------
 package     * version date       lib source        
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.0)
 cli           2.0.2   2020-02-28 [1] CRAN (R 4.0.0)
 codetools     0.2-16  2018-12-24 [1] CRAN (R 4.0.2)
 crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.0)
 data.table  * 1.12.8  2019-12-09 [1] CRAN (R 4.0.0)
 dplyr       * 1.0.0   2020-05-29 [1] CRAN (R 4.0.0)
 ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.0)
 fansi         0.4.1   2020-01-08 [1] CRAN (R 4.0.0)
 generics      0.0.2   2018-11-29 [1] CRAN (R 4.0.0)
 glue          1.4.1   2020-05-13 [1] CRAN (R 4.0.0)
 lifecycle     0.2.0   2020-03-06 [1] CRAN (R 4.0.0)
 magrittr      1.5     2014-11-22 [1] CRAN (R 4.0.0)
 miceFast    * 0.6.1   2020-07-06 [1] CRAN (R 4.0.2)
 pillar        1.4.4   2020-05-05 [1] CRAN (R 4.0.0)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.0)
 purrr         0.3.4   2020-04-17 [1] CRAN (R 4.0.0)
 R6            2.4.1   2019-11-12 [1] CRAN (R 4.0.0)
 Rcpp          1.0.5   2020-07-06 [1] CRAN (R 4.0.2)
 rlang         0.4.6   2020-05-02 [1] CRAN (R 4.0.0)
 rstudioapi    0.11    2020-02-07 [1] CRAN (R 4.0.0)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.2)
 tibble        3.0.1   2020-04-20 [1] CRAN (R 4.0.0)
 tidyselect    1.1.0   2020-05-11 [1] CRAN (R 4.0.0)
 vctrs         0.3.1   2020-06-05 [1] CRAN (R 4.0.0)
 withr         2.2.0   2020-04-20 [1] CRAN (R 4.0.0)

Any help would be great.
Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.