Developer, Statistician, Data Scientist and Agile Team Player.
polkas / micefast Goto Github PK
View Code? Open in Web Editor NEWR enviroment - fast imputations :dragon:
Home Page: https://polkas.github.io/miceFast/
R enviroment - fast imputations :dragon:
Home Page: https://polkas.github.io/miceFast/
For each observation NA at once, for big sets sample frac
For all set
Hi there,
I'm not sure if this is intended behaviour, but when I'm trying to predict a field I only get predictions when all the other fields for that record are complete (when using fill_NA
or fill_NA_N
). Here's my reproducible example. I'm expecting air_miss$Solar.R_imp[5]
not to be NA
. This gets filled when using naive_fill_NA()
but your documentation suggests not to use that function:
library(miceFast)
library(data.table)
library(dplyr)
data(air_miss)
air_miss <- air_miss %>%
select(Ozone:Temp) %>%
head(10)
air_miss[, Solar.R_imp := fill_NA(.SD,
model = "lm_bayes",
posit_y = "Solar.R",
posit_x = c("Ozone", "Wind", "Temp"))]
print(air_miss)
> Ozone Solar.R Wind Temp Solar.R_imp
> 1: 41 190 7.4 67 190.00
> 2: 36 118 8.0 72 118.00
> 3: 12 149 12.6 74 149.00
> 4: 18 313 11.5 62 313.00
> 5: NA NA 14.3 56 NA
> 6: 28 NA 14.9 66 -1187.08
> 7: 23 299 8.6 65 299.00
> 8: 19 99 13.8 59 99.00
> 9: 8 19 20.1 61 19.00
> 10: NA 194 8.6 69 194.00
naive_fill_NA(air_miss)
> Ozone Solar.R Wind Temp Solar.R_imp
> 1: 41.00000 190.0000 7.4 67 190.0000
> 2: 36.00000 118.0000 8.0 72 118.0000
> 3: 12.00000 149.0000 12.6 74 149.0000
> 4: 18.00000 313.0000 11.5 62 313.0000
> 5: 15.28918 144.9681 14.3 56 312.1653
> 6: 28.00000 501.6784 14.9 66 -1187.0801
> 7: 23.00000 299.0000 8.6 65 299.0000
> 8: 19.00000 99.0000 13.8 59 99.0000
> 9: 8.00000 19.0000 20.1 61 19.0000
> 10: 21.29695 194.0000 8.6 69 194.0000
Here's my session info:
- Session info -------------------------------------------------------------------------------
setting value
version R version 4.0.2 (2020-06-22)
os Windows 10 x64
system x86_64, mingw32
ui RStudio
language (EN)
collate English_United Kingdom.1252
ctype English_United Kingdom.1252
tz Europe/London
date 2020-07-09
- Packages -----------------------------------------------------------------------------------
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0)
cli 2.0.2 2020-02-28 [1] CRAN (R 4.0.0)
codetools 0.2-16 2018-12-24 [1] CRAN (R 4.0.2)
crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.0)
data.table * 1.12.8 2019-12-09 [1] CRAN (R 4.0.0)
dplyr * 1.0.0 2020-05-29 [1] CRAN (R 4.0.0)
ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.0)
fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.0)
generics 0.0.2 2018-11-29 [1] CRAN (R 4.0.0)
glue 1.4.1 2020-05-13 [1] CRAN (R 4.0.0)
lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.0)
magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.0)
miceFast * 0.6.1 2020-07-06 [1] CRAN (R 4.0.2)
pillar 1.4.4 2020-05-05 [1] CRAN (R 4.0.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0)
purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.0)
R6 2.4.1 2019-11-12 [1] CRAN (R 4.0.0)
Rcpp 1.0.5 2020-07-06 [1] CRAN (R 4.0.2)
rlang 0.4.6 2020-05-02 [1] CRAN (R 4.0.0)
rstudioapi 0.11 2020-02-07 [1] CRAN (R 4.0.0)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
tibble 3.0.1 2020-04-20 [1] CRAN (R 4.0.0)
tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.0)
vctrs 0.3.1 2020-06-05 [1] CRAN (R 4.0.0)
withr 2.2.0 2020-04-20 [1] CRAN (R 4.0.0)
Any help would be great.
Thank you
vector is pre-sorted and then a binary search
#include <Rcpp.h>
#include <algorithm>
using namespace std;
using namespace Rcpp;
int findCrossOver(NumericVector arr, double low, double high, double x)
{
if (arr[high] <= x) // x is greater than all
return high;
if (arr[low] > x) // x is smaller than all
return low;
// Find the middle point
int mid = (low + high)/2; /* low + (high - low)/2 */
/* If x is same as middle element, then return mid */
if (arr[mid] <= x && arr[mid+1] > x)
return mid;
/* If x is greater than arr[mid], then either arr[mid + 1]
is ceiling of x or ceiling lies in arr[mid+1...high] */
if(arr[mid] < x)
return findCrossOver(arr, mid+1, high, x);
return findCrossOver(arr, low, mid - 1, x);
}
double Kclosestrand(NumericVector arr, double x, int k)
{
int n = arr.size();
// Find the crossover point
int l = findCrossOver(arr, 0, n-1, x);
int r = l; // Right index to search
int count = 0; // To keep track of count of elements already printed
NumericVector resus(k);
// If x is present in arr[], then reduce left index
// Assumption: all elements in arr[] are distinct
if (arr[l] == x) l--;
// Compare elements on left and right of crossover
// point to find the k closest elements
while (l >= 0 && r < n && count < k)
{
if (x - arr[l] < arr[r] - x)
resus[count] = arr[l--];
else
resus[count] = arr[r++];
count++;
}
// If there are no more elements on right side, then
// print left elements
while (count < k && l >= 0)
resus[count] = arr[l--], count++;
// If there are no more elements on left side, then
// print right elements
while (count < k && r < n)
resus[count] = arr[r++], count++;
int goal = rand()%k;
return resus[goal];
}
// [[Rcpp::export]]
NumericVector neibo(NumericVector y, NumericVector miss, int k) {
int n_y = y.size();
k = (k <= n_y) ? k : n_y;
k = (k >= 1) ? k : 1;
NumericVector y_new = clone(y);
sort(y_new.begin(),y_new.end());
unsigned int n_miss = miss.size();
NumericVector resus(n_miss);
for(int i=0; i<n_miss ;i++){
double mm = miss[i];
resus[i] = Kclosestrand(y_new,mm,k);
}
return resus ;
}
/* Driver program to check above functions */
/*** R
vals = rnorm(100)
ss = rnorm(100)
neibo(vals,ss,2)[1:10]
vals[mice:::matcher(vals,ss,2)][1:10]
microbenchmark::microbenchmark(neibo(vals,ss,2),
mice:::matcher(vals,ss,2)
)
vals = rnorm(10000)
ss = rnorm(1000)
neibo(vals,ss,2)[1:10]
vals[mice:::matcher(vals,ss,2)][1:10]
microbenchmark::microbenchmark(neibo(vals,ss,2),
mice:::matcher(vals,ss,2)
)
vals = rnorm(10000)
ss = rnorm(1000)
neibo(vals,ss,200)[1:10]
vals[mice:::matcher(vals,ss,200)][1:10]
microbenchmark::microbenchmark(neibo(vals,ss,200),
mice:::matcher(vals,ss,200)
)
vals = 1:10000
ss = 1:100
neibo(vals,ss,2)[1:10]
vals[mice:::matcher(vals,ss,2)][1:10]
microbenchmark::microbenchmark(neibo(vals,ss,2),
mice:::matcher(vals,ss,2)
)
*/
adding a diag() disturbance to X'X
As not recommended as most efficient solution inside miceFast, fill_NA should be improved to work faster with data.frames.
apply family more stable and faster
Step by step instruction
Installation, auto , OOP, DT and dplyr functions
Should be more concise.
Support other simple methods like previous or next non missing.
Lines 92 to 96 in a2a6bb1
Like ggplot from vigniette
Many users expecting less accurate but easier to implement solutions. auto_fill_NA could be a well suited proposition.
Line 316 in 5131d65
how to check a type
Hi there,
I'm not sure if this is intended behaviour, but when I'm trying to predict a field I only get predictions when all the other fields for that record are complete (when using fill_NA
or fill_NA_N
). Here's my reproducible example. I'm expecting air_miss$Solar.R_imp[5]
not to be NA
. This gets filled when using naive_fill_NA()
but your documentation suggests not to use that function:
library(miceFast)
library(data.table)
library(dplyr)
data(air_miss)
air_miss <- air_miss %>%
select(Ozone:Temp) %>%
head(10)
air_miss[, Solar.R_imp := fill_NA(.SD,
model = "lm_bayes",
posit_y = "Solar.R",
posit_x = c("Ozone", "Wind", "Temp"))]
print(air_miss)
> Ozone Solar.R Wind Temp Solar.R_imp
> 1: 41 190 7.4 67 190.00
> 2: 36 118 8.0 72 118.00
> 3: 12 149 12.6 74 149.00
> 4: 18 313 11.5 62 313.00
> 5: NA NA 14.3 56 NA
> 6: 28 NA 14.9 66 -1187.08
> 7: 23 299 8.6 65 299.00
> 8: 19 99 13.8 59 99.00
> 9: 8 19 20.1 61 19.00
> 10: NA 194 8.6 69 194.00
naive_fill_NA(air_miss)
> Ozone Solar.R Wind Temp Solar.R_imp
> 1: 41.00000 190.0000 7.4 67 190.0000
> 2: 36.00000 118.0000 8.0 72 118.0000
> 3: 12.00000 149.0000 12.6 74 149.0000
> 4: 18.00000 313.0000 11.5 62 313.0000
> 5: 15.28918 144.9681 14.3 56 312.1653
> 6: 28.00000 501.6784 14.9 66 -1187.0801
> 7: 23.00000 299.0000 8.6 65 299.0000
> 8: 19.00000 99.0000 13.8 59 99.0000
> 9: 8.00000 19.0000 20.1 61 19.0000
> 10: 21.29695 194.0000 8.6 69 194.0000
Here's my session info:
- Session info -------------------------------------------------------------------------------
setting value
version R version 4.0.2 (2020-06-22)
os Windows 10 x64
system x86_64, mingw32
ui RStudio
language (EN)
collate English_United Kingdom.1252
ctype English_United Kingdom.1252
tz Europe/London
date 2020-07-09
- Packages -----------------------------------------------------------------------------------
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0)
cli 2.0.2 2020-02-28 [1] CRAN (R 4.0.0)
codetools 0.2-16 2018-12-24 [1] CRAN (R 4.0.2)
crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.0)
data.table * 1.12.8 2019-12-09 [1] CRAN (R 4.0.0)
dplyr * 1.0.0 2020-05-29 [1] CRAN (R 4.0.0)
ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.0)
fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.0)
generics 0.0.2 2018-11-29 [1] CRAN (R 4.0.0)
glue 1.4.1 2020-05-13 [1] CRAN (R 4.0.0)
lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.0)
magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.0)
miceFast * 0.6.1 2020-07-06 [1] CRAN (R 4.0.2)
pillar 1.4.4 2020-05-05 [1] CRAN (R 4.0.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0)
purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.0)
R6 2.4.1 2019-11-12 [1] CRAN (R 4.0.0)
Rcpp 1.0.5 2020-07-06 [1] CRAN (R 4.0.2)
rlang 0.4.6 2020-05-02 [1] CRAN (R 4.0.0)
rstudioapi 0.11 2020-02-07 [1] CRAN (R 4.0.0)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
tibble 3.0.1 2020-04-20 [1] CRAN (R 4.0.0)
tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.0)
vctrs 0.3.1 2020-06-05 [1] CRAN (R 4.0.0)
withr 2.2.0 2020-04-20 [1] CRAN (R 4.0.0)
Any help would be great.
Thank you
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.