Code Monkey home page Code Monkey logo

modeler's People

Contributors

mattmills49 avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

modeler's Issues

Think of a better name

more inclusive of general data handling, not just modeling focused, also Hadley just released modlr lol

Create a profile/prepare/track function

The problem this function is trying to solve is when I replace missing or Na values I also need to do the same to a test set. So, there should be a way to store that information and apply it to new data in one swoop right?

Impact Code Function Structure

Impact Coding is the traditional categorical variable treatment described here: http://helios.mm.di.uoa.gr/~rouvas/ssi/sigkdd/sigkdd.vol3.1/barreca.pdf

The Impact Coding needs to know:

  1. The Dependent variable you are using and the Independent Variable to use for transformation
  2. The type of transformation to use
  3. If there are any hierarchical relationships (e.g. 3 digit zip code - 4 digit zip code, city - state)

It will need to return a data frame with the levels of the category and the new estimate.

New Helper Functions to Write

  • diff_time

  • in_table

  • roll_func

  • safe_ifelse

  • acf_by_group

  • diff_time: removes having difftime return some lame object that can't be manipulated

diff_time <- function(...) as.numeric(difftime(...))
  • in_table: compares the number of elements found in combinations of vectors
    • could actually use any function that returns a scalar between two vectors....
in_table <- function(..., f = mean){
  vectors <- list(...)
  shared_counts <- vapply(vectors, function(x, others = vectors){
    vapply(others, function(y, orig = x){
      f(orig %in% y)
    }, numeric(1))
  }, numeric(length(vectors)))
   
  l <- as.list(substitute(list(...)))[-1L]
  colnames(shared_counts) <- l
  return(t(shared_counts))  
 }

Create rolling functions

For example here is a quick rolling mean function

roll_mean <- function(n, vec) seq(from = 0, to = n) %>% map(~ lag(vec, n = .x)) %>% do.call(cbind, .) %>% rowMeans(na.rm = T)

however I'd like to have the ability to customize handling of NAs and creating any rolling function, possibly by using purrr's by_row function?

Incorporate a rug plot into partial_plot

There should be an option to overlay a rug plot in partial_plot. This will allow the user to be weary of when the smooth curves may be extrapolating too much.

create wrapper for project template spin calls

basically instead of doing this:

ezknitr::ezspin("Magic Pass Attendance/munge/pass_acceptance_model.R", out_dir = "Magic Pass Attendance/doc", fig_dir = "../graphs/pass_acceptance_model", keep_html = F, chunk_opts = list(cache.path = "Magic Pass Attendance/cache/pass_acceptance_model/"))

I could just do

spin_project("Magic Pass Attendance", "pass_acceptance_model")

and it would know where to put everything in the project_template directories

Make partial_plot use NSE

I'd like the partial_plot function to use non-standard evaluation so the function call doesn't have to use character values. Both for the code to be cleaner and because I want to learn how NSE works.

For example here is how the function currently works:

test_model <- mgcv::gam(mpg ~ s(hp), data = mtcars)
partial_plot(test_model, "hp")

But I'd like it to drop the quotes

partial_plot(test_model, hp)

Add tests to partial_plot

Some important tests need to be added including:

  • Make sure it can handle all types of smoothing terms (s, te, ti, t2)
  • Make sure it can handle cases when variable names are included in other variables. An example would be where you want the partial plot for balance but there are also other variables with names like balance_higher_than_500 or whatever. Does it ignore the others?

Write function to sort within groups

since dplyr changed the way arrange handles group sorting implement a function that will both group by variables and arrange them in the order listed.

write a one-hot function

when you have variables with a value you want to create a dummy variable for (like missing values) automatically create the dummy variable and add it to your data frame

check_drift changes

  • change default num_bins to null
  • include an option for other ggplot2 theme objects
  • change to percentage scales
  • Fix scale name
  • Look into changes for manipulating the variables

Create n_distinct for data frames

dplyr includes n_distinct to find the number of unique elements in a data frame but I sometimes want to find the number of rows among distinct combos in a data frame:

head(mtcars)
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
nrow(mtcars)
#> [1] 32
n_distinct(mtcars$cyl)
#> [1] 3
n_distinct(mtcars$am)
#> [1] 2
nrow(dplyr::distinct(mtcars, cyl, am))
#> [1] 6
# should be a simpler function call: n_distinct

Function for working with grouped time series

I'd like to be able to calculate time series info (auto correlations, partial auto correlations, differences, stationary, etc...) from grouped data. Here is a quick function for grouped acf

library(dplyr)
library(tidyr)
library(purrr)

sample_data <- dplyr::data_frame(group = sample(c("a", "b", "c"), size = 100, replace = T), value = sample.int(30, size = 100, replace = T)) 
head(sample_data)
#> # A tibble: 6 ร— 2
#>   group value
#>   <chr> <int>
#> 1     c    28
#> 2     c     9
#> 3     c    13
#> 4     c    11
#> 5     a     9
#> 6     c     9

grouped_acf_values <- sample_data %>%
  tidyr::nest(-group) %>%
  dplyr::mutate(acf_results = purrr::map(data, ~ acf(.x$value, plot = F)),
         acf_values = purrr::map(acf_results, ~ drop(.x$acf))) %>%
  tidyr::unnest(acf_values) %>%
  dplyr::group_by(group) %>%
  dplyr::mutate(lag = seq(0, n() - 1))

head(grouped_acf_values)
#> Source: local data frame [6 x 3]
#> Groups: group [1]
#> 
#>   group  acf_values   lag
#>   <chr>       <dbl> <int>
#> 1     c  1.00000000     0
#> 2     c -0.20192774     1
#> 3     c  0.07191805     2
#> 4     c -0.18440489     3
#> 5     c -0.31817935     4
#> 6     c  0.06368096     5

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.