The modeler from mattmills49

Think of a better name

more inclusive of general data handling, not just modeling focused, also Hadley just released modlr lol

Create a profile/prepare/track function

The problem this function is trying to solve is when I replace missing or Na values I also need to do the same to a test set. So, there should be a way to store that information and apply it to new data in one swoop right?

Impact Code Function Structure

Impact Coding is the traditional categorical variable treatment described here: http://helios.mm.di.uoa.gr/~rouvas/ssi/sigkdd/sigkdd.vol3.1/barreca.pdf

The Impact Coding needs to know:

The Dependent variable you are using and the Independent Variable to use for transformation
The type of transformation to use
If there are any hierarchical relationships (e.g. 3 digit zip code - 4 digit zip code, city - state)

It will need to return a data frame with the levels of the category and the new estimate.

Write a model_matrix function

This function should wrap around model.matrix but preserve all levels of any character or factor variables.

New Helper Functions to Write

diff_time
in_table
roll_func
safe_ifelse
acf_by_group
diff_time: removes having difftime return some lame object that can't be manipulated

diff_time <- function(...) as.numeric(difftime(...))

in_table: compares the number of elements found in combinations of vectors
- could actually use any function that returns a scalar between two vectors....

in_table <- function(..., f = mean){
  vectors <- list(...)
  shared_counts <- vapply(vectors, function(x, others = vectors){
    vapply(others, function(y, orig = x){
      f(orig %in% y)
    }, numeric(1))
  }, numeric(length(vectors)))
   
  l <- as.list(substitute(list(...)))[-1L]
  colnames(shared_counts) <- l
  return(t(shared_counts))  
 }

Create rolling functions

For example here is a quick rolling mean function

roll_mean <- function(n, vec) seq(from = 0, to = n) %>% map(~ lag(vec, n = .x)) %>% do.call(cbind, .) %>% rowMeans(na.rm = T)

however I'd like to have the ability to customize handling of NAs and creating any rolling function, possibly by using purrr's by_row function?

assigning training/test with option for groups

Incorporate a rug plot into partial_plot

There should be an option to overlay a rug plot in partial_plot. This will allow the user to be weary of when the smooth curves may be extrapolating too much.

create wrapper for project template spin calls

basically instead of doing this:

ezknitr::ezspin("Magic Pass Attendance/munge/pass_acceptance_model.R", out_dir = "Magic Pass Attendance/doc", fig_dir = "../graphs/pass_acceptance_model", keep_html = F, chunk_opts = list(cache.path = "Magic Pass Attendance/cache/pass_acceptance_model/"))

I could just do

spin_project("Magic Pass Attendance", "pass_acceptance_model")

and it would know where to put everything in the project_template directories

Make partial_plot use NSE

I'd like the partial_plot function to use non-standard evaluation so the function call doesn't have to use character values. Both for the code to be cleaner and because I want to learn how NSE works.

For example here is how the function currently works:

test_model <- mgcv::gam(mpg ~ s(hp), data = mtcars)
partial_plot(test_model, "hp")

But I'd like it to drop the quotes

partial_plot(test_model, hp)

Add options for plotting in linear or response terms and SEs

Add tests to partial_plot

Some important tests need to be added including:

Make sure it can handle all types of smoothing terms (s, te, ti, t2)
Make sure it can handle cases when variable names are included in other variables. An example would be where you want the partial plot for balance but there are also other variables with names like balance_higher_than_500 or whatever. Does it ignore the others?

The current standard errors are on the overall prediction, not the variable fit itself

Currently the standard errors take the overall prediction standard errors (from predict(..., se = T)). I assume the standard errors show in plot.gam are different but need to verify.

Write function to sort within groups

since dplyr changed the way arrange handles group sorting implement a function that will both group by variables and arrange them in the order listed.

write a one-hot function

when you have variables with a value you want to create a dummy variable for (like missing values) automatically create the dummy variable and add it to your data frame

check_drift changes

change default num_bins to null
include an option for other ggplot2 theme objects
change to percentage scales
Fix scale name
Look into changes for manipulating the variables

Create n_distinct for data frames

dplyr includes n_distinct to find the number of unique elements in a data frame but I sometimes want to find the number of rows among distinct combos in a data frame:

head(mtcars)
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
nrow(mtcars)
#> [1] 32
n_distinct(mtcars$cyl)
#> [1] 3
n_distinct(mtcars$am)
#> [1] 2
nrow(dplyr::distinct(mtcars, cyl, am))
#> [1] 6
# should be a simpler function call: n_distinct

library(dplyr)
library(tidyr)
library(purrr)

sample_data <- dplyr::data_frame(group = sample(c("a", "b", "c"), size = 100, replace = T), value = sample.int(30, size = 100, replace = T)) 
head(sample_data)
#> # A tibble: 6 × 2
#>   group value
#>   <chr> <int>
#> 1     c    28
#> 2     c     9
#> 3     c    13
#> 4     c    11
#> 5     a     9
#> 6     c     9

grouped_acf_values <- sample_data %>%
  tidyr::nest(-group) %>%
  dplyr::mutate(acf_results = purrr::map(data, ~ acf(.x$value, plot = F)),
         acf_values = purrr::map(acf_results, ~ drop(.x$acf))) %>%
  tidyr::unnest(acf_values) %>%
  dplyr::group_by(group) %>%
  dplyr::mutate(lag = seq(0, n() - 1))

head(grouped_acf_values)
#> Source: local data frame [6 x 3]
#> Groups: group [1]
#> 
#>   group  acf_values   lag
#>   <chr>       <dbl> <int>
#> 1     c  1.00000000     0
#> 2     c -0.20192774     1
#> 3     c  0.07191805     2
#> 4     c -0.18440489     3
#> 5     c -0.31817935     4
#> 6     c  0.06368096     5

mattmills49 / modeler Goto Github PK

modeler's People

Contributors

Stargazers

Watchers

Forkers

modeler's Issues

Recommend Projects

Recommend Topics

Recommend Org