Code Monkey home page Code Monkey logo

row-oriented-workflows's Introduction

Row-oriented workflows in R with the tidyverse

Materials for RStudio webinar recording available at this link!:

Thinking inside the box: you can do that inside a data frame?!
Jenny Bryan
Wednesday, April 11 at 1:00pm ET / 10:00am PT
rstd.io/row-work <-- shortlink to this repo
Slides available on SpeakerDeck

Abstract

The data frame is a crucial data structure in R and, especially, in the tidyverse. Working on a column or a variable is a very natural operation, which is great. But what about row-oriented work? That also comes up frequently and is more awkward. In this webinar I’ll work through concrete code examples, exploring patterns that arise in data analysis. We’ll discuss the general notion of "split-apply-combine", row-wise work in a data frame, splitting vs. nesting, and list-columns.

Code examples

Beginner --> intermediate --> advanced
Not all are used in webinar

  • Leave your data in that big, beautiful data frame. ex01_leave-it-in-the-data-frame Show the evil of creating copies of certain rows of certain variables, using Magic Numbers and cryptic names, just to save some typing.
  • Adding or modifying variables. ex02_create-or-mutate-in-place df$var <- ... versus dplyr::mutate(). Recycling/safety, df's as data mask, aesthetics.
  • Are you SURE you need to iterate over rows? ex03_row-wise-iteration-are-you-sure Don't fixate on most obvious generalization of your pilot example and risk overlooking a vectorized solution. Features a paste() example, then goes out with some glue glory.
  • Working with non-vectorized functions. ex04_map-example Small example using purrr::map() to apply nrow() to list of data frames.
  • Row-wise thinking vs. column-wise thinking. ex05_attack-via-rows-or-columns Data rectangling example. Both are possible, but I find building a tibble column-by-column is less aggravating than building rows, then row binding.
  • Iterate over rows of a data frame. iterate-over-rows Empirical study of reshaping a data frame into this form: a list with one component per row. Revisiting a study originally done by Winston Chang. Run times for different number of rows or columns.
  • Generate data from different distributions via purrr::pmap(). ex06_runif-via-pmap Use purrr::pmap() to generate U[min, max] data for various combinations of (n, min, max), stored as rows of a data frame.
  • Are you SURE you need to iterate over groups? ex07_group-by-summarise Use dplyr::group_by() and dplyr::summarise() to compute group-wise summaries, without explicitly splitting up the data frame and re-combining the results. Use list() to package multivariate summaries into something summarise() can handle, creating a list-column.
  • Group-and-nest. ex08_nesting-is-good How to explicitly work on groups of rows via nesting (our recommendation) vs splitting.
  • Row-wise mean or sum. ex09_row-summaries How to do rowSums()-y and rowMeans()-y work inside a data frame.

More tips and links

Big thanks to everyone who weighed in on the related twitter thread. This was very helpful for planning content.

45 minutes is not enough! A few notes about more special functions and patterns for row-driven work. Maybe we need to do a follow up ...

tibble::enframe() and deframe() are handy for getting into and out of the data frame state.

map() and map2() are useful for working with list-columns inside mutate().

tibble::add_row() handy for adding a single row at an arbitrary position in data frame.

imap() handy for iterating over something and its names or integer indices at the same time.

dplyr::case_when() helps you get rid of hairy, nested if () {...} else {...} statements.

Great resource on the "why?" of functional programming approaches (such as map()): https://github.com/getify/Functional-Light-JS/blob/master/manuscript/ch1.md/

row-oriented-workflows's People

Contributors

jennybc avatar taekyunk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

row-oriented-workflows's Issues

row summaries using nest

Outdated. See comment below I'm opening this issue to follow up on the nest example I added to #7. (I really appreciate this repo by the way - great resource to have!) After posting, I noticed there is an existing nest example in [ex09_row-summaries](https://github.com/jennybc/row-oriented-workflows/blob/master/ex09_row-summaries.md) which adds the sum and mean.

(s <- df %>%
gather("key", "value", -name) %>%
nest(-name) %>%
mutate(
sum = map(data, "value") %>% map_dbl(sum),
mean = map(data, "value") %>% map_dbl(mean)
) %>%
select(-data))
df %>%
left_join(s)

I thought I'd add an additional approach based on my previous example in case it's useful...

One summary variable using nest (previous example in #7)

library(tidyverse)
df <- tribble(
  ~ name, ~ t1, ~t2, ~t3, ~x,
  "Abby",    1,   2,   3, "a",
  "Bess",    4,   5,   6, "a",
  "Carl",    7,   8,   9, "b"
)

df %>% 
  nest(t1, t2, t3) %>% 
  mutate(t_avg = map_dbl(data, ~ mean(unlist(.)))) %>%
  unnest()

#> # A tibble: 3 x 6
#>   name  x     t_avg    t1    t2    t3
#>   <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Abby  a         2     1     2     3
#> 2 Bess  a         5     4     5     6
#> 3 Carl  b         8     7     8     9

Multiple summary variables (alternative to existing nest example in ex09_row-summaries)

In this context, I like to think of the nested list column as containing a "bunch of values" on each row we want to do something with (in this case t1, t2, and t3).

df %>% 
  nest(t1, t2, t3) %>% 
  mutate(result = map(data, ~ tibble(
    t_sum = sum(unlist(.)),
    t_avg = mean(unlist(.))))) %>%
  unnest()

#> # A tibble: 3 x 7
#>   name  x        t1    t2    t3 t_sum t_avg
#>   <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Abby  a         1     2     3     6     2
#> 2 Bess  a         4     5     6    15     5
#> 3 Carl  b         7     8     9    24     8

This seems to be OK if the number of rows in each element of data match the number of rows in each element of result. But this falls down if there is a mismatch between the list columns. For example, "Abby" might appear more than once in the example data and we'd still like her overall sum and mean included on each row her name appears. In this situation explicit unnesting is needed:

# add an extra row for "Abby" to illustrate
df <- tribble(
  ~ name, ~ t1, ~t2, ~t3, ~x,
  "Abby",    1,   2,   3, "a",
  "Abby",   10,  20,  30, "a",
  "Bess",    4,   5,   6, "a",
  "Carl",    7,   8,   9, "b"
)

df %>% 
  nest(t1, t2, t3) %>% 
  mutate(result = map(data, ~ tibble(
    t_sum = sum(unlist(.)),
    t_avg = mean(unlist(.))))) %>%
  unnest(data, .preserve = result) %>% unnest(result)

#> # A tibble: 4 x 7
#>   name  x        t1    t2    t3 t_sum t_avg
#>   <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Abby  a         1     2     3    66    11
#> 2 Abby  a        10    20    30    66    11
#> 3 Bess  a         4     5     6    15     5
#> 4 Carl  b         7     8     9    24     8

How to develop a map call

Part of the appeal of a for() loop is the development workflow. Sketch:

Pick i. Develop code that works. Drop into a loop over i. Done. What is so nice about this? It's clear how to experiment with your logic via top-level code.

With very little modification, something similar works with the functional approaches to iteration.

Pick i. Set .x <- thingy[[i]] (for a map workflow) or .x <- thingy[i, , drop = FALSE] for a pmap workflow) or what have you. Now develop your logic as in a for loop situation.

I touch on this in ex06_runif-via-pmap.R and in my purrr tutorial workflow advice.

Related to #3 How debug map() calls.

Are you creating your own (de-)vectorization problems?

Sometimes we are the ones who miss opportunities to vectorize our own functions. Or sometimes we de-vectorize things that started out as vectorized. Red flags:

@colearendt:

Thinking too much in control structures like if () {} else {} often lands me in a non-vectorized state.

@hadley:

use of unlist() is a good warning sign that you're de-vectorising a function

slow row operation to capture the first match of a pattern

Capturing the first match of the pattern "a" by row, looking into selected columns(1 to 3). The result should be a vector with a length equals to the number of rows of the data frame: Unfortunately the function is slow when applied to a large data frame.
This is the data frame:

df <- data.frame(
  x = c("ab", "ay", "cd", "ae", "ef"),
  y = c("bx", "ax", NA, "cx","ax"),
  z = c("cy", "dy", "ey", "ay", "by")
)

This is function

first_valid_code <- function(data, colvec, pattern){
  colvec <- enquo(colvec)
  f0 <- function(x) grepl(pattern = pattern, x, ignore.case = T, perl = T)
  f1 <- function(x) detect(x, f0)
  data %>%
    select(!!colvec) %>%
    map_dfr(as.character) %>%
    transpose() %>%
    map(f1) %>%
    map_if(is.null, ~NA_character_) %>%
    unlist()
}

This is the result when running the function on df:

first_valid_code(df,colvec = c(1:3), pattern = "a")

[1] "ab" "ay" NA "ae" "ax"

"split then recombine" vs "nest and extract"

Productive recent conversation with @mpettis

https://gist.github.com/mpettis/1afd9a7f42fff34ba9a2d5c240356acc

Interesting to really think through pros and cons of "split then recombine" vs "nest and extract/unnest".

There's also a connection to the need for "map within map". Or, rather, avoiding this need. Which seems good because that's awfully hard to reason about.

Also connected to the conversation re: a tidy-er split() function: tidyverse/tidyr#434

Calculating row summaries

This is related to ex09_row-summaries.

Data from a repeated measures design is usually recorded like this:

library(tidyverse)

df <- tribble(
  ~ name, ~ t1, ~t2, ~t3, ~x,
  "Abby",    1,   2,   3, "a",
  "Bess",    4,   5,   6, "a",
  "Carl",    7,   8,   9, "b"
)

Note that I have an additional variable, x, that I'd like to not lose while calculating the average of t* variables for each row.

The end goal is a data frame with (at a minimum) the variables name, x, and t_avg (this is the average of the t* variables). And this data frame should not be grouped.

(1) My original approach was the following:

df %>%
  rowwise() %>%
  mutate(t_avg = mean(c(t1, t2, t3))) %>%
  ungroup()
#> # A tibble: 3 x 6
#>   name     t1    t2    t3 x     t_avg
#>   <chr> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 Abby      1     2     3 a         2
#> 2 Bess      4     5     6 a         5
#> 3 Carl      7     8     9 b         8

But, it uses rowwise(), so 🙅.

(2) Summarising in a separate data frame is suggested at ex09_row-summaries, but this requires introducing a new data frame.

(3) An alternative is gathering the data first, so something like:

df %>%
  gather(t, value, -c(name, x)) %>%
  group_by(name) %>%
  mutate(t_avg = mean(value)) %>%
  select(-c(t, value)) %>%
  distinct() %>%
  ungroup()
#> # A tibble: 3 x 3
#>   name  x     t_avg
#>   <chr> <chr> <dbl>
#> 1 Abby  a         2
#> 2 Bess  a         5
#> 3 Carl  b         8

It's a lot more lines of code and some new functions, but the upside is explicitly addressing the problem that df is not a tidy data frame to begin with.

Among other reasons for not using (1), I like the group_by() -> ungroup() workflow better than rowwise() -> ungroup().

From a teaching (to an intro audience) perspective I'm leaning towards (3) since it seems to me like the most explicit approach, but I'd love to hear thoughts on this.

Turn a row-oriented problem into a column-oriented problem

Inspired by a post from @vllorens in R-Ladies slack.

There are different patients and, for each patient, several samples. However, only one sample per patient was analyzed. This is indicated in column “sampleUsed”. How to turn the input tibble into the output here? You could work this row-wise: for each patient, identify the sample used, then determine which column to consult, to get the correct sample size. But it's actually easier to reshape the data and solve an easier problem, column-wise.

library(tidyverse)

df <- tibble::tribble(
  ~patient, ~sampleUsed, ~sample1_size, ~sample2_size, ~sample3_size, ~sample4_size,
  1L, "sample1", 12L, 13L, 17L,  9L,
  2L, "sample4", 15L, 13L, 21L, 11L,
  3L, "sample2", 14L, 15L, 13L, 15L,
  4L, "sample1", 20L, 14L, 15L, 13L
)

df %>% 
  gather(key = "key", value = "sample_size", ends_with("size")) %>% 
  filter(str_detect(key, sampleUsed)) %>% 
  select(-key)
#> # A tibble: 4 x 3
#>   patient sampleUsed sample_size
#>     <int> <chr>            <int>
#> 1       1 sample1             12
#> 2       4 sample1             20
#> 3       3 sample2             15
#> 4       2 sample4             11

How debug `map()` calls that aren't working

Question raised in connection to this example:

What if data frame includes variables that should not be passed to .f()?

df_oops <- tibble(
  n = 1:3,
  min = c(0, 10, 100),
  max = c(1, 100, 1000),
  oops = c("please", "ignore", "me")
)
df_oops
#> # A tibble: 3 x 4
#>       n   min   max oops  
#>   <int> <dbl> <dbl> <chr> 
#> 1     1    0.    1. please
#> 2     2   10.  100. ignore
#> 3     3  100. 1000. me

This will not work!

set.seed(123)
pmap(df_oops, runif)
#> Error in .f(n = .l[[c(1L, i)]], min = .l[[c(2L, i)]], max = .l[[c(3L, : unused argument (oops = .l[[c(4, i)]])

Use that to motivate some development and debugging strategies.

"Vectorized function": demystify at first use

This term can sound much more intimidating than it is, especially if you are new to R. @edgararuiz likened it to "quantum entanglement".

How to demystify it quickly when introduced?

@garrett suggestions: "a function that works with vectors", a diagram, clear code demo

@jimhester points to this quote from @andrie's book:

Vectorized functions are a very useful feature of R, but programmers who are used to other languages often have trouble with this concept at first. A vectorized function works not just on a single value, but on a whole vector of values at the same time.

Suggestion for clarification

This is a great read. I offer this up for consideration on adding clarity to one section.

Here you are using a list to wrap the dots argument that has the single-row values of each column of the data frame and feeding it to runif. When I read it, I had to stop and think for a bit, and recall that pmap splices the list values in as function arguments, and so what is fed to my_runif are actual keyword arguments (like what do.call does), and not a list of length 1 lists. And that is why you have to wrap the dots argument in a list.

It didn't take me long to think through, but perhaps a sentence that addresses what cleared it up for me, per above, would help others.

Thanks again for this stuff, it is very valuable.

Add this to row wise sum/mean examples

From @krlmlr

library(tidyverse)

df <- tribble(
  ~name, ~t1, ~t2, ~t3,
  "Abby", 1, 2, 3,
  "Bess", 4, 5, 6,
  "Carl", 7, 8, 9
)

df %>%
  gather("key", "value", -name) %>%
  nest(-name) %>%
  mutate(
    sum = map(data, "value") %>% map_dbl(sum),
    mean = map(data, "value") %>% map_dbl(mean)
  )
#> # A tibble: 3 x 4
#>   name  data               sum  mean
#>   <chr> <list>           <dbl> <dbl>
#> 1 Abby  <tibble [3 × 2]>     6     2
#> 2 Bess  <tibble [3 × 2]>    15     5
#> 3 Carl  <tibble [3 × 2]>    24     8

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.