mattmills49 / modeler Goto Github PK
View Code? Open in Web Editor NEWR Package for Exploratory Data Analysis
R Package for Exploratory Data Analysis
not sure what the output should be
more inclusive of general data handling, not just modeling focused, also Hadley just released modlr lol
The problem this function is trying to solve is when I replace missing or Na values I also need to do the same to a test set. So, there should be a way to store that information and apply it to new data in one swoop right?
Impact Coding is the traditional categorical variable treatment described here: http://helios.mm.di.uoa.gr/~rouvas/ssi/sigkdd/sigkdd.vol3.1/barreca.pdf
The Impact Coding needs to know:
It will need to return a data frame with the levels of the category and the new estimate.
This function should wrap around model.matrix
but preserve all levels of any character or factor variables.
diff_time
in_table
roll_func
safe_ifelse
acf_by_group
diff_time: removes having difftime
return some lame object that can't be manipulated
diff_time <- function(...) as.numeric(difftime(...))
in_table <- function(..., f = mean){
vectors <- list(...)
shared_counts <- vapply(vectors, function(x, others = vectors){
vapply(others, function(y, orig = x){
f(orig %in% y)
}, numeric(1))
}, numeric(length(vectors)))
l <- as.list(substitute(list(...)))[-1L]
colnames(shared_counts) <- l
return(t(shared_counts))
}
For example here is a quick rolling mean function
roll_mean <- function(n, vec) seq(from = 0, to = n) %>% map(~ lag(vec, n = .x)) %>% do.call(cbind, .) %>% rowMeans(na.rm = T)
however I'd like to have the ability to customize handling of NAs and creating any rolling function, possibly by using purrr
's by_row
function?
There should be an option to overlay a rug plot in partial_plot
. This will allow the user to be weary of when the smooth curves may be extrapolating too much.
basically instead of doing this:
ezknitr::ezspin("Magic Pass Attendance/munge/pass_acceptance_model.R", out_dir = "Magic Pass Attendance/doc", fig_dir = "../graphs/pass_acceptance_model", keep_html = F, chunk_opts = list(cache.path = "Magic Pass Attendance/cache/pass_acceptance_model/"))
I could just do
spin_project("Magic Pass Attendance", "pass_acceptance_model")
and it would know where to put everything in the project_template
directories
I'd like the partial_plot
function to use non-standard evaluation so the function call doesn't have to use character values. Both for the code to be cleaner and because I want to learn how NSE works.
For example here is how the function currently works:
test_model <- mgcv::gam(mpg ~ s(hp), data = mtcars)
partial_plot(test_model, "hp")
But I'd like it to drop the quotes
partial_plot(test_model, hp)
Some important tests need to be added including:
s, te, ti, t2
)balance
but there are also other variables with names like balance_higher_than_500
or whatever. Does it ignore the others?Currently the standard errors take the overall prediction standard errors (from predict(..., se = T)
). I assume the standard errors show in plot.gam
are different but need to verify.
since dplyr
changed the way arrange
handles group sorting implement a function that will both group by variables and arrange them in the order listed.
when you have variables with a value you want to create a dummy variable for (like missing values) automatically create the dummy variable and add it to your data frame
dplyr
includes n_distinct to find the number of unique elements in a data frame but I sometimes want to find the number of rows among distinct combos in a data frame:
head(mtcars)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
nrow(mtcars)
#> [1] 32
n_distinct(mtcars$cyl)
#> [1] 3
n_distinct(mtcars$am)
#> [1] 2
nrow(dplyr::distinct(mtcars, cyl, am))
#> [1] 6
# should be a simpler function call: n_distinct
I don't even know how this would work. Maybe just the ability to return a multiplot object. Is there even a statistical term for partial regression plots with multiple variables. I'll have to play around some.
I'd like to have partial plots for gam
, glm
, lm
, gbm
, and randomForest
models. But instead of using multiple if statements analyzing the class
of the model just make S3 functions for each.
I'd like to be able to calculate time series info (auto correlations, partial auto correlations, differences, stationary, etc...) from grouped data. Here is a quick function for grouped acf
library(dplyr)
library(tidyr)
library(purrr)
sample_data <- dplyr::data_frame(group = sample(c("a", "b", "c"), size = 100, replace = T), value = sample.int(30, size = 100, replace = T))
head(sample_data)
#> # A tibble: 6 ร 2
#> group value
#> <chr> <int>
#> 1 c 28
#> 2 c 9
#> 3 c 13
#> 4 c 11
#> 5 a 9
#> 6 c 9
grouped_acf_values <- sample_data %>%
tidyr::nest(-group) %>%
dplyr::mutate(acf_results = purrr::map(data, ~ acf(.x$value, plot = F)),
acf_values = purrr::map(acf_results, ~ drop(.x$acf))) %>%
tidyr::unnest(acf_values) %>%
dplyr::group_by(group) %>%
dplyr::mutate(lag = seq(0, n() - 1))
head(grouped_acf_values)
#> Source: local data frame [6 x 3]
#> Groups: group [1]
#>
#> group acf_values lag
#> <chr> <dbl> <int>
#> 1 c 1.00000000 0
#> 2 c -0.20192774 1
#> 3 c 0.07191805 2
#> 4 c -0.18440489 3
#> 5 c -0.31817935 4
#> 6 c 0.06368096 5
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.