tidyverse / purrr Goto Github PK

View Code? Open in Web Editor NEW

1.2K 1.2K 258.0 9.22 MB

A functional programming toolkit for R

Home Page: https://purrr.tidyverse.org/

License: Other

R 87.85% C 12.13% Shell 0.02%

functional-programming r

purrr's Introduction

tidyverse

Overview

The tidyverse is a set of packages that work in harmony because they share common data representations and API design. The tidyverse package is designed to make it easy to install and load core packages from the tidyverse in a single command.

If you’d like to learn how to use the tidyverse effectively, the best place to start is R for Data Science (2e).

Installation

# Install from CRAN
install.packages("tidyverse")

# Install the development version from GitHub
# install.packages("pak")
pak::pak("tidyverse/tidyverse")

If you’re compiling from source, you can run pak::pkg_system_requirements("tidyverse"), to see the complete set of system packages needed on your machine.

Usage

library(tidyverse) will load the core tidyverse packages:

ggplot2, for data visualisation.
dplyr, for data manipulation.
tidyr, for data tidying.
readr, for data import.
purrr, for functional programming.
tibble, for tibbles, a modern re-imagining of data frames.
stringr, for strings.
forcats, for factors.
lubridate, for date/times.

You also get a condensed summary of conflicts with other packages you have loaded:

library(tidyverse)
#> ── Attaching core tidyverse packages ─────────────────── tidyverse 2.0.0.9000 ──
#> ✔ dplyr     1.1.3     ✔ readr     2.1.4
#> ✔ forcats   1.0.0     ✔ stringr   1.5.0
#> ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
#> ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
#> ✔ purrr     1.0.2     
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

You can see conflicts created later with tidyverse_conflicts():

library(MASS)
#> 
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#> 
#>     select
tidyverse_conflicts()
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ✖ MASS::select()  masks dplyr::select()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

And you can check that all tidyverse packages are up-to-date with tidyverse_update():

tidyverse_update()
#> The following packages are out of date:
#>  * broom (0.4.0 -> 0.4.1)
#>  * DBI   (0.4.1 -> 0.5)
#>  * Rcpp  (0.12.6 -> 0.12.7)
#>  
#> Start a clean R session then run:
#> install.packages(c("broom", "DBI", "Rcpp"))

Packages

As well as the core tidyverse, installing this package also installs a selection of other packages that you’re likely to use frequently, but probably not in every analysis. This includes packages for:

Working with specific types of vectors:
- hms, for times.
Importing other types of data:
- feather, for sharing with Python and other languages.
- haven, for SPSS, SAS and Stata files.
- httr, for web apis.
- jsonlite for JSON.
- readxl, for .xls and .xlsx files.
- rvest, for web scraping.
- xml2, for XML.
Modelling
- modelr, for modelling within a pipeline
- broom, for turning models into tidy data

Code of Conduct

Please note that the tidyverse project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

purrr's People

Contributors

Stargazers

Watchers

Forkers

msonnabaum snazz2001 lionel- smbache dgromer xtmgah kirillseva arturochian steromano fayecoga sumprain earino dshen1 parthasen paulhendricks piccolbo nacnudus gshotwell fxcebx wibeasley alexandrudaia constantin345 michaelquinn32 junjiemao davharris sjackman jrnold benmarwick nabilabd jimhester nathania jayhesselberth poldham vspinu tjmahr kendonb xpingli jeremystan mplourde xulukai kghub duytran16 stopcontrol egnha nanaakwasiabayieboateng ravirajadrangi orenov 1danjordan huangrh rlugojr zerostack yutannihilation ijlyttle dgrtwo wojciechniemczyk sks95 gergness shivendra90 nsethi007 amaybaum dataguy-anil xmur phebbar85 sunilkumar87 kevinykuo aravalaraju valbat markdly gregce richierocks amarchin wypd mherradora nicholasjhorton pachevalier tuqiang2014 radovankavicky gapdata 3dan3 ai-axiom edwinth geauxdojang xinai57 aaronwolen rlesur g-bruce guimarthe fomenkosmart t-kalinowski zedseayou pmfall dataxujing anhqle cderv malcolmbarrett ashesitr huftis ktaranov davidcorneto edublancas

purrr's Issues

Turning arrays into lists

To allow purrr's functions to talk to arrays, they need be transformed into lists.

plyr::alply() is a flexible tool that makes it is easy to turn arrays into lists, but it can be slow and hard to read. I think it makes sense to offer faster and more expressive functions in purrr for the most common coercions.

I think the most common tasks are flattening dimensions into a list, and enlisting dimensions hierarchically:

flatten_margin <- function(array, margin = NULL) {
  if (is.null(dim(array))) {
    dim(array) <- length(array)
  }
  if (is.null(margin)) {
    margin <- seq_along(dim(array))
  }
  apply(array, margin, list) %>% flatten()
}

enlist_margin <- function(array, margin = NULL) {
  if (is.null(margin)) {
    margin <- seq_along(dim(array))
  }
  if (length(margin) > 1) {
    new_margin <- ifelse(margin[-1] > margin[[1]], margin[-1] - 1, margin[-1])
    apply(array, margin[[1]], . %>% enlist_margin(new_margin))
  } else {
    flatten_margin(array, margin)
  }
}

These functions are faster than the corresponding alply() operations.

x <- array(seq_len(1e6), c(1000, 100, 10))
microbenchmark::microbenchmark(
  plyr = x %>% plyr::alply(2, .fun = . %>% plyr::alply(2)),
  purrr = x %>% enlist_margin(2:3)
)

neval: 100
   expr median         unit
1  plyr    322 milliseconds
2 purrr     66 milliseconds

And scale much better:

x2 <- array(seq_len(1e6), c(10, 100, 1000))
microbenchmark::microbenchmark(
  plyr = x2 %>% plyr::alply(2, .fun = . %>% plyr::alply(2)),
  purrr = x2 %>% enlist_margin(2:3)
)

neval: 100
   expr median         unit
1  plyr  11479 milliseconds
2 purrr    241 milliseconds

They could be rewritten in Rcpp in the future for an additional performance boost.

Prepare for release

@lionel- of the existing issues and PRs, which are the most important? It doesn't matter too much if purrr is incomplete, but changing interfaces is harder once we've released it. If there's anything you think is useful but still need work, we should not export for this release.

@andrie, @piccolbo is there anything in particular that you need in this release that's not already in the dev version?

power_set_split() ?

Synopsis

power_set_split() works just like split.data.frame() but generates a list which represents the power set. Compare to the example from the purrr documentation:

library(dplyr)
library(purrr)
Rcpp::sourceCpp("pwr_set.cpp")

mtcars %>%
  power_set_split(.$cyl) %>%
  map(~ lm(mpg ~ wt, data = .)) %>%
  map(summary) %>%
  map_v("r.squared")

#>        4         6       4∪6         8       4∪8       6∪8     4∪6∪8 
#> 0.5086326 0.4645102 0.6924591 0.4229655 0.7668954 0.6012214 0.7528328 

mtcars %>%
  split(.$cyl) %>%
  map(~ lm(mpg ~ wt, data = .)) %>%
  map(summary) %>%
  map_v("r.squared")

#>        4         6         8 
#> 0.5086326 0.4645102 0.4229655

here 4∪6 is the subset of row indices which have cyl == 4 | cyl == 6 (their union/catenate).

my (likely poor) basic implementation

power_set_split <- function(x, f, drop = FALSE, naive = TRUE,no_empty_set = TRUE,...) {


  var_splits_ls <- split(x = seq_len(nrow(x)), f = f, drop = drop,...)
  var_splits_names <- names(var_splits_ls)

  if(length(var_splits_names) >= 20 && naive){
    stop("that's more than a million split groupings. If you're sure set naive = FALSE")
  }

  # generate power set from names
  var_power_set_els_ls <- pwr_set_cpp(var_splits_names) 
  # to kepp empty set? 
  if(no_empty_set) {
    var_power_set_els_ls <- var_power_set_els_ls[-1]
  }
  # concatenate reduction with sorting
  var_power_splits_ls <- lapply(var_power_set_els_ls,function(ind) sort(reduce(var_splits_ls[ind],c)))
  names(var_power_splits_ls) <- sapply(var_power_set_els_ls,paste0,collapse='∪')

  lapply(var_power_splits_ls, function(ind) x[ind, , drop = FALSE])
}

In pwr_set.cpp:

#include <Rcpp.h>
#include <Math.h>
using namespace Rcpp;

// [[Rcpp::export]]
List pwr_set_cpp(CharacterVector els) {

  int n_els = els.size();         
  int pwrset_card = pow(2,n_els); 
  List out(pwrset_card);          
  out[0] = StringVector::create(); 
  CharacterVector tmp;            
  int counter = 0;
  for (int i=0; i < n_els; ++i) {
    int cnt2 = counter;            // capture counter state
      for (int j =0; j <= counter; ++j) {
        cnt2++;                   // capture counter + j steps
        tmp = as<StringVector>(out[j]);
        tmp.push_back(as<std::string>(els[i]));
        out[cnt2] = tmp;

    }
      counter = cnt2;             // update counter state
  }
  return out;
}

group_by, sort_by, order_by

partition_by? (Needed in strongly typed language to distinguish between predicate function vs. function that returns string). Maybe call it split_by()?

max_by() and min_by()?

Should flatten(list()) return list() instead of NULL?

I expected that flatten() applied on an empty list would return an empty list rather than NULL. This behavior is caused by unlist() but I don't find it very intuitive.

I can send a PR to make it return a list instead.

zip_n() and missing names

Currently, zip_n() allows the components names to be different so that:

zip2(list(x = 1), list(y = 2))
## Gives list(x = list(1, NULL))

zip2(list(x = 1), list(y = 2), .fields = c("x", "y"))
## Gives list(x = list(1, NULL), y = list(NULL, 2))

This currently fails with atomic vectors, which could be fixed. But I wonder if it would be more generally useful to only use the fields when all components names are equal.

For example, walk_n() uses zip_n() internally and the current fields handling prevents the following snippet from working correctly:

# Using as.list() to circumvent the bug with vectors
walk2(mtcars, as.list(names(mtcars)), function(col, name) {
  print(paste(name, sum(col)))
})

So either we don't change zip_n() and fix walk_n() not to use it, or we change the fields handling.

Maybe we could have the current behaviour when .fields is supplied, and otherwise only use names when they are equal across all components?

Making lowliner better understand data frames: map_rows()

This is a version of map_n() that returns a data frame. If the option .trace is set (it is by default), the output is returned col-binded to the original data frame. This way we address @baptiste's concerns in tidyverse/dplyr#441.

If .f does not return a data frame but a bare atomic vector of length n, it is coerced to a data frame of n columns. In other cases, the ouput is coerced to a one-row list-column.

In action:

p <- expand.grid(mean = 1:5, sd = seq(0, 1, length = 10))
# Next line is equivalent to: p %>% plyr::mdply(rnorm, n = 5) %>% tbl_df()
p %>% map_rows(rnorm, n = 5)

## Source: local data frame [50 x 7]
## 
##    mean      sd       1      2       3       4      5
## 1     1 0.00000 1.00000 1.0000 1.00000 1.00000 1.0000
## 2     2 0.00000 2.00000 2.0000 2.00000 2.00000 2.0000
## 3     3 0.00000 3.00000 3.0000 3.00000 3.00000 3.0000
## 4     4 0.00000 4.00000 4.0000 4.00000 4.00000 4.0000
## 5     5 0.00000 5.00000 5.0000 5.00000 5.00000 5.0000
## 6     1 0.11111 0.98537 1.0664 0.95703 0.80309 0.9408
## 7     2 0.11111 1.91233 2.0431 1.93546 1.76423 1.7954
## 8     3 0.11111 2.89180 3.1357 2.97548 3.04977 3.0097
## 9     4 0.11111 4.09118 4.0426 3.95294 4.02712 4.0887
## 10    5 0.11111 5.02453 4.8226 5.02239 5.07338 5.0010
## ..  ...     ...     ...    ...     ...     ...    ...

p <- data_frame(
  a = rerun(5, rnorm(10, 5)),
  b = rerun(5, rnorm(10))
)
p %>% map_rows(function(a, b) lm(a ~ b))

## Source: local data frame [5 x 3]
## 
##           a         b     out
## 1 <dbl[10]> <dbl[10]> <S3:lm>
## 2 <dbl[10]> <dbl[10]> <S3:lm>
## 3 <dbl[10]> <dbl[10]> <S3:lm>
## 4 <dbl[10]> <dbl[10]> <S3:lm>
## 5 <dbl[10]> <dbl[10]> <S3:lm>

The code is in https://gist.github.com/lionel-/0b41c6b9d3554725807a

Mapping to vectors

Just linking to this conversation so we don't forget: #57 (comment)

Rename probe() to map_chr() and write map_lgl() and map_num().
Write as_vector() for all other cases.
Should we deprecate map_v() then?

This is also linked to the modifications to can_simplify() proposed in #42, which need more work.

zip() doesn't play well with factors and named vectors

zip() has some limitations with list of vectors.

It coerces factors to numeric
It forgets the names of the vector elements

A solution to the first problem is to use [[ instead of .subset2(). This could be decided by testing for factors, so we still get the performance benefit of .subset2.

One solution to the second problem may be to offer a .soft_subset argument which would cause zip to defer the subsetting to [ or .subset(). This is possibly related to the ideas in https://gist.github.com/hadley/8667699

Here is a possible implementation of zip() that would play better with vectors:

zip2 <- function(.x, soft_subset = FALSE) {
  n <- unique(map_v(.x, length, .type = integer(1)))
  if (length(n) != 1) {
    stop("All elements must be same length", call. = FALSE)
  }

  # .subset and .subset2 do not handle factors
  # `[[` does not handle named vectors
  subset <- 
    if (soft_subset) {
      .subset
      ## base::`[`
    } else {
      base::`[[`
    }

  lapply(seq_len(n), function(i) {
    lapply(.x, subset, i)
  })
}


factors <- list(
  a = factor(1:2),
  b = factor(3:4)
)
named <- list(
  a = setNames(1:2, c("aa", "ab")),
  b = setNames(3:4, c("ba", "bb"))
)

zip(factors)
zip2(factors)

zip(named)
zip2(named, soft_subset = TRUE)

Other forms of function

Formulas, similar to pipeR, ~ mean(.) > 5
Character vector, which would always refer to list component (and hence make pluck redundant). e.g. keep(x, "y") would be equivalent to filter(x, function(.) isTRUE(.$y), or possibly filter(x, function(.) !is.null(.$y)

cross_n() and data frames

While writing the example in #44, I was disappointed that I couldn't find an easy way to reproduce

expand.grid(mean = 1:5, sd = seq(0, 1, length = 10))

with cross_n(). This is because the elements to cross have unequal lengths so we can't make a data frame to give as input to cross_n().

So maybe we do need cross_d() after all. This way cross_d(list(a = 1:3, b = letters[1:5])) would work as expected. We could also just resort to expand.grid() for this kind of inputs, but I think it's good to have alternatives to any df function that coerces character vectors to factors (and also to any function with dots in its name ;).

`split(.$cyl)` failure on first example in README

Trying to follow the first example in the README

library(lowliner)

mtcars %>%
  split(.$cyl) %>%
  map(~ lm(mpg ~ wt, data = .)) %>%
  map(summary) %>%
  map_v("r.squared")

leads to the following error (using RStudio 0.98.1081 on Mac OS X 10.10.1):

Error in split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) : 
  object '.' not found

However, if you change split(.$cyl) to split(mtcars$cyl), things work ok:

mtcars %>%
    split(mtcars$cyl) %>%
    map(~ lm(mpg ~ wt, data = .)) %>%
    map(summary) %>%
    map_v("r.squared")
        4         6         8 
0.5086326 0.4645102 0.4229655

unzip may cause scoping problems when lowliner is loaded

e.g.:

install_github("klutometis/roxygen")
## Downloading github repo klutometis/roxygen@master
## Error in unzip(src, list = TRUE) : unused argument (list = TRUE)

I will PR a fix for devtools, but I assume this issue will popup in many other places.

Ultimately, it's the packages' responsibility to make sure they use unzip from the utils namespace, but I just thought you should be aware of this.

Flatten should get recursive argument

defaulting to FALSE

Scoping for rerun isn't quite right

f <- function(n) {
  10 %>%
    rerun(x = rnorm(n), y = rnorm(n)) %>%
    map_dbl(~ cor(.x$x, .x$y))  
}

f(10)

think it probably needs to use lazyeval

Making lowliner better understand data frames: by_group()

This implements a functional that takes a grouped table and feed each group to .f. It returns a data frame and handles both mutating and summarising operations, the only constraint being that the number of returned rows is the same inside each group. The grouping variables are recycled to match the output size. If .f does not return a data frame, by_group() returns a list-column.

I also wrote set_groups() to set the grouping attributes because it looks a bit funny to have

  df %>%
    group_by(col) %>%
    by_group(fun)

and also this way we have a non-NSE version of group_by() that handles vectors of names and column positions.

In action:

# Action operation
mtcars %>%
  set_groups("cyl") %>%
  by_group(partial(lm, mpg ~ disp))

## + + Source: local data frame [3 x 2]
## 
##   cyl     out
## 1   4 <S3:lm>
## 2   6 <S3:lm>
## 3   8 <S3:#lm>

# Summarizing operation
mtcars %>%
  set_groups(c("cyl", "am")) %>%
  by_group(map, mean)

## + + Source: local data frame [6 x 11]
## 
##   cyl am    mpg    disp      hp   drat     wt   qsec    vs   gear   carb
## 1   4  0 22.900 135.867  84.667 3.7700 2.9350 20.970 1.000 3.6667 1.6667
## 2   4  1 28.075  93.612  81.875 4.1837 2.0423 18.450 0.875 4.2500 1.5000
## 3   6  0 19.125 204.550 115.250 3.4200 3.3887 19.215 1.000 3.5000 2.5000
## 4   6  1 20.567 155.000 131.667 3.8067 2.7550 16.327 0.000 4.3333 4.6667
## 5   8  0 15.050 357.617 194.167 3.1208 4.1041 17.143 0.000 3.0000 3.0833
## 6   8  1 15.400 326.000 299.500 3.8800 3.3700 14.550 0.000 5.0000 6.0000

# Muutating operation
mtcars %>%
  set_groups(c("cyl", "am")) %>%
  by_group(map, ~ . / sd(.))

## + + Source: local data frame [32 x 11]
## 
##    cyl am     mpg    disp     hp   drat     wt   qsec     vs    gear
## 1    4  0 16.7977 10.5015 3.1544 28.385 7.8278 11.966    Inf  6.9282
## 2    4  0 15.6962 10.0792 4.8333 30.154 7.7296 13.701    Inf  6.9282
## 3    4  0 14.8012  8.5974 4.9350 28.462 6.0487 11.972    Inf  5.1962
## 4    4  1  5.0849  5.2743 4.1050 10.572 5.6675 16.538 2.8284  8.6410
## 5    4  1  7.2259  3.8434 2.9132 11.203 5.3744 17.303 2.8284  8.6410
## 6    4  1  6.7799  3.6969 2.2953 13.537 3.9453 16.458 2.8284  8.6410
## 7    4  1  7.5605  3.4722 2.8691 11.588 4.4827 17.685 2.8284  8.6410
## 8    4  1  6.0885  3.8580 2.9132 11.203 4.7270 16.796 2.8284  8.6410
## 9    4  1  5.7986  5.8749 4.0167 12.164 5.2278 14.841 0.0000 10.8012
## 10   4  1  6.7799  4.6443 4.9878 10.352 3.6961 15.019 2.8284 10.8012
.. ... ..     ...     ...    ...    ...    ...    ...    ...     ...

`zip_n` problems with named lists

These work:

zip_n(list(list(1, 2), c(1, 2)))
zip_n(list(list(1, 2), list(1, 2)))

This throws an error:
zip_n(list(list(a = 1, b = 2), c(1, 2)))
This gives the wrong result:
zip_n(list(list(a = 1, b = 2), list(1, 2)))

As a result, for example this fails
list(a = 'a', b = 'b') %>% walk2(names(.), function(x, y) paste(x, y) %>% print)
and this gives the wrong result
list(a = 'a', b = 'b') %>% walk2(as.list(names(.)), function(x, y) paste(x, y) %>% print)
while both work fine with map2 (which doesn't use zip_n internally).

Add @lionel- to authors

@lionel- can you please add your details to the authors field in the description?

Compilation fails on os x when building against dplyr@master

OK for me to revert to CRAN dplyr, but I thought I'd give you a heads up before dplyr gets a new CRAN release

clang++ -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I../inst/include -DCOMPILING_DPLYR -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/dplyr/include" -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/BH/include"   -fPIC  -Wall -mtune=core2 -g -O2  -c rows.cpp -o rows.o
rows.cpp:22:23: error: no member named 'subset' in 'dplyr::DataFrameVisitors'
    out[i] = visitors.subset(indices[i], classes);
             ~~~~~~~~ ^
rows.cpp:237:25: error: no member named 'subset' in 'dplyr::DataFrameVisitors'
    SEXP row = visitors.subset(IntegerVector::create(i), classes);
               ~~~~~~~~ ^
2 errors generated.
make: *** [rows.o] Error 1

Change package name

It's no longer so directly a port of underscore, and lowliner isn't very evocative

Look into implementing Haskell's scanl

(Reminder for @hadley following a twitter conversation)

From scanl's documentation: It takes the second argument and the first item of the list and applies the function to them, then feeds the function with this result and the second argument and so on. It returns the list of intermediate and final results.

It is like fold except it also returns the intermediate results.

Should map understand data frames?

So you could do:

df %>% map(type.convert)
# OR 
df %>% map_if(is.factor, as.numeric)

idea: trigger / elicit / when

I've though about a way to have some sort of "pattern matching" / advanced "if-else" construct for use in pipelines. In F# you have brilliant pattern matching, but the idea is not really portable, I think.

I did however make a function that I find quite useful; not sure where it belongs though. Perhaps here, but maybe it is too NSE for lowliner.

So far I've named it trigger, as it triggers an action associated with the first valid match / condition. Another name could be elicit, or even just when. Each condition-action pair is specified as a formula: condition ~ action. If isTRUE(condition) the corresponding action is evaluated and returned. Names can be assigned using named arguments (and will be available to both conditions and actions) and the value being matched can be referred to as the dot.

Here's a gist: https://gist.github.com/smbache/afe0e1e105a8f56eb83f

and here's how it can be used (simple examples):

1:10 %>% 
  trigger(
    sum(.) <=  50 ~ sum(.),
    sum(.) <= 100 ~ sum(.)/2,
    sum(.)/3   # With no condition it will always be true, acts as default.
  )

1:10 %>% 
  trigger(
    sum(.) <=   x ~ sum(.),
    sum(.) <= 2*x ~ sum(.)/2,
    0,
    x = 60
  )            

iris %>% 
  subset(Sepal.Length > 10) %>%
  trigger(
    nrow(.) > 0 ~ .,
    iris %>% head(10)
  )

Complete unary boolean algebra operators

E.g. to include tools to convert missing values to logical

convert NA to true: x | is.na(x)
convert NA to false: x & !is.na(x)

I guess there are 3 ^ 3 possible operators in total (3 possible inputs and 3 possible outputs for each input), but I suspect most aren't that interesting.

as_function matches variables instead of functions

When a function and a variable have the same name, all lowliner tools using as_function() will match the variable instead of the function.

To reproduce:

l <- list(a = 1:5, b = 6:10)
map(l, mean)

mean <- "  "
map(l, mean)

## Error in .subset2(g, f) (from utils.R#59) : subscript out of bounds

One way around this would be to do something like match.fun() to match the function in the calling environment using get(fun_name, envir = calling_env, mode = "function").

But then there is the issue that the calling environment is not at a fixed number of steps in the calling hierarchy because as_function() is sometimes called from the mapping function itself, and sometimes from find_selection().

The cleanest way to capture the calling environment is probably to use lazyeval in this way:

as_function <- function(f) {
  # Capture the calling environment and try to match a function (as opposed to a variable)
  lazy <- lazyeval::lazy(f)
  if (is.name(lazy$expr)) {
    f_name <- as.character(lazy$expr)
    maybe_f <- try(get(f_name, envir = lazy$env, mode = "function"),
      silent = TRUE)

    if (is.function(maybe_f)) {
      return(maybe_f)
    }
  }

  if (is.character(f) || is.numeric(f)) {
    function(g) .subset2(g, f)
  } else if (inherits(f, "formula")) {
    if (length(f) != 2) {
      stop("Formula must be one sided", call. = FALSE)
    }
    make_function(alist(. = ), f[[2]], environment(f))
  } else {
    stop("Don't know how to convert ", paste0(class(f), collapse = "/"),
      " into a function", call. = FALSE)
  }
}

But this currently does not work reliably because of the lazyload DB problem (see hadley/lazyeval#18).

Write list

write list to a file that convenient to read back into R

a simple list

[[1]]
[1] 1

[[2]]
[1] 2 3

can be written to a file as

1    2
1    3

1     2
NA    3

Variadic mapping

map2() and map3() are nice shortcuts for specific cases but it would be good to have a way of mapping functions with a variable number of arguments.

We could have for example

map_dots(..., .f)
map_list(.l, .f, ...)

The former would be translated to Map(.f, ...) while the latter would be equivalent to splat(map_dots)(c(.l, list(..., .f = .f))).

Write vignette

Inspirations:

New adverbs `at()` and `where()`

One fundamental limitation of replacement functions such as [<- is that they cannot replace an element with a larger one. i.e., it's illegal to do this:

mtcars[3] <- data.frame(a = mtcars[[3]], b = mtcars[[3]])

This is because it would be impossible to know what goes where in case the replacement region is not contiguous. But it makes it hard to work with functions that take a data frame column and may return a larger data frame, like disjoin() (see #28 for some background on this tool). We really need a way to apply this kind of function on subsets of columns.

So I wrote these two adverbs inspired by fapply():

at <- function(.x, .where, ..f, ...) {
  if (is.character(.where)) {
    if (is.null(names(.x))) {
      stop(".x is not named", call. = FALSE)
    }
    .where <- match(.where, names(.x))
  }
  if (anyNA(.where)) {
    stop("some indexes are missing", call. =FALSE)
  }

  out <- vector("list", length(.x))
  for (i in seq_along(.x)) {
    res <-
      if (i %in% .where) {
        ..f(.x[i], ...)
      } else {
        .x[i]
      }
    stopifnot(is.list(res))
    out[[i]] <- res
  }

  flatten(out) %>% purrr:::output_hook(.x)
}

where <- function(.x, .p, ..f, ...) {
  sel <- probe(.x, .p) %>% which()
  .x %>% at(sel, ..f, ...)
}

For example,

diamonds %>% at("cut", disjoin, sep = " / ") %>% str()
## Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    53940 obs. of  14 variables:
##  $ carat          : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut / Fair     : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ cut / Good     : num  0 0 1 0 1 0 0 0 0 0 ...
##  $ cut / Very Good: num  0 0 0 0 0 1 1 1 0 1 ...
##  $ cut / Premium  : num  0 1 0 1 0 0 0 0 0 0 ...
##  $ cut / Ideal    : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ color          : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity        : Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth          : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table          : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price          : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x              : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y              : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z              : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

transforms "cut" into a disjunctive multi-columns form and

diamonds %>% where(is.factor, disjoin)

would transform all factors.

Then, map_if() could be implemented in terms of where(), and we could also have a map_at() version:

map_if <- function(.x, .p, .f, ...) {
  .f <- purrr:::as_function(.f)
  .x %>% where(.p, map, .f, ...)
}

map_at <- function(.x, .where, .f, ...) {
  .f <- purrr:::as_function(.f)
  .x %>% at(.where, map, .f, ...)
}

This would not change the functionality of map_if() in any way, but make the internal code a bit dryer.

I can document these tools and send a PR.

foreach()

It doesn't look like there's a specific function which applies a function to each element in a sequence for its side-effect.

Would you consider including something like:

foreach <- function(.x, .f, ...) {
  lapply(.x, .f, ...)
  invisible(NULL)
}

[FR] unzip(x, .simplify = TRUE) handling of NULL behaviour

Feature Request:

Would it be possible to allow unzip() it to coerce NULL? (when there are only two types in an object NULL + one of character, integer, double, ...) ?

Consider list `x` below:

x <- list(list(foo = "less", bar = "is", baz = "more"), list(foo = "many", bar = "are"))
str(x)
#> List of 2
#> $ :List of 3
#> ..$ foo: chr "less"
#> ..$ bar: chr "is"
#> ..$ baz: chr "more"
#> $ :List of 2                  # no baz here !!
#> ..$ foo: chr "many"
#> ..$ bar: chr "are"

When I `unzip()` it:

str(x %>% unzip(.simplify =TRUE) 
#> List of 3
#> $ foo: chr [1:2] "less" "many"
#> $ bar: chr [1:2] "is" "are"
#> $ baz:List of 2              # as expected since typeof(NULL) != typeof("more") 
#> ..$ : chr "more"
#> ..$ : NULL

I would love this behaviour

intended_output <- x %>% unzip(.simplify=TRUE, .coerce_null = TRUE)
str(intended_output)
#> List of 3
#> $ foo: chr [1:2] "less" "many"
#> $ bar: chr [1:2] "is" "are"
#> $ baz: chr [1:2] "more" "NULL"    # or maybe better:  $ baz: chr [1:2] "more" ""

intended_output <- x %>% unzip(.simplify=TRUE, .coerce_null = character(1))
str(intended_output)
#> List of 3
#> $ foo: chr [1:2] "less" "many"
#> $ bar: chr [1:2] "is" "are"
#> $ baz: chr [1:2] "more" ""

If this worked automagically it would be awesome, otherwise/also if you can specify how to coerce, if required, for each .field that would be very helpful.

Cumulative folds

Special evaluation functions for list creation and modification?

I know the main purpose of purrr is to provide tools for purer functional programming, but maybe a small set of verbs defining a limited DSL for list creation and modification would not hurt.

I'm thinking of having the equivalent of data_frame() for lists:

#' Store objects in a list
#'
#' Analog to dplyr's \code{data_frame()} but creates a list instead of
#' a data frame. Thus, no checks are performed on the contents of the
#' list.
#' @export
store <- function(...) {
  store_(lazyeval::lazy_dots(...))
}

#' @rdname store
#' @export
store_ <- function(dots) {
  n <- length(dots)
  dots <- lazyeval::auto_name(dots)

  out <- vector("list", n)
  names(out) <- character(n)
  for (i in seq_len(n)) {
    out[[i]] <- lazyeval::lazy_eval(dots[[i]], out)
    names(out)[i] <- names(dots)[i]
  }
  out
}

And we could also have a mutate() method to alter lists:

mutate_.list <- function(.data, ..., .dots) {
  dots <- lazyeval::all_dots(.dots, ..., all_named = TRUE)

  for (i in seq_along(dots)) {
    .data[[names(dots)[i]]] <- lazyeval::lazy_eval(dots[[i]], .data)
  }

  .data
}

Predicate functions should use argument name .p

So they can be better documented

Issues in Examples

In the examples, the lm() function doesn't use the training sets, but uses mtcars all the time.

boot <- boot %>% mutate(

Fit the models

models = map(training, ~ lm(mpg ~ wt, data = mtcars)),

Make predictions on test data

preds = map2(models, test, predict),
diffs = map2(preds, test %>% map("mpg"), msd)
)

Should the model fitting line be models = map(training, ~ lm(mpg ~ wt, data = .)),?

map_if fails silently

map_if depends on as_function which depends on make_function from pryr. Pryr is not in dependencies by default.

If pryr is not loaded, and map_if is called map_if returns the original list.
Example (from the help):

x <- rerun(10, y = if (rbinom(1, 1, prob = 0.5) == 1) NULL else sample(100, 5))
z <- x %>% map_if(~ !is.null(x$y), ~ update_list(x, y = ~ y * 100))
all.equal(x,z)
[1] TRUE

Flatmap equivalent?

e.g. http://martinfowler.com/articles/collection-pipeline/flat-map.html

head_while and tail_while

head and tail with predicates

Depth of application

In case of complicated list structures, it could be nice to be able to specify the depth of application of map() and friends.

I've been trying this:

recurse <- function(.x) {
  call <- sys.call(-1)
  call <- match.call(match.fun(call[[1]]), call)
  call$.x <- as.name(".x")
  call$.depth <- call$.depth - 1

  lapply(.x, function(.x, cl) eval(cl), call)
}

Then, at the start of each mapping function, add a call to recurse

map <- function(.x, .f, ..., .depth = 1) {
  if (.depth > 1) {
    return(recurse(.x))
  }

  .f <- as_function(.f)
  lapply(.x, .f, ...) %>% output_hook(.x)
}

So far I tried it with map() and keep() and this works well.

deep_list <- list(
  a = list(
    aa = list(1:2, 3:4),
    ab = list(5:6, 7:8)
  ),
  b = list(
    ba = list(9:10, 11:12),
    bb = list(13:14, 15:16)
  )
)

deep_list %>%
  map(sum, .depth = 3) %>%
  map(splat(mean), .depth = 2)

If you're interested in this functionality, I can test it with other lowliner functions in proper unit tests and submit a PR.

Infix attr accessor

Maybe purrr would be a good place for:

`%@%` <- function(x, name) attr(x, name)

Need to think about if this should be S3 generic or not, and need to check if this works as is for S4.

Inconsistent behavior with map ~ lapply, and map_n ~ Map

In the package, map seems to wrap lapply, while map_n wraps Map, which leads to different behavior.

For me, the thing that I noticed is that if your input to map_n is a character vector, your output is a named list whose names are that character vector. Whereas with plain map, it's an unnamed list.

input <- letters[1:5]
map(input, function(x) 1) %>% names #=> NULL
map_n(list(input), function(x) 1) %>% names #=> "a", "b", "c", "d", "e"

Before I noticed this, I would have to insert the names using magrittr::set_names or names<-.

For my own purposes, I actually prefer having the names automatically inserted, so I would just use map_n by default from now on.

update_list cannot update element if the name is "x"

I know everything is ok if the element name is not "x", which is the same name of the formal argument...

update_list(list(x = 1), x = 2)
#> Error: is.list(x) is not TRUE

Is it possible to use a list or named character vector instead?

update_list(list(x = 1), list(x = 2))
update_list(list(x = 1), c(x = 2))

map_if

map_if <- function(.x, .f, .p) {
  sel <- vapply(.x, .p, logical(1), ...)
  .x[sel] <- lapply(.x[sel], .f, ...)
  .x
}

would be useful for working with stringr output (e.g. str_locate_all())

map2 and map3 fail with vector as additional argument

E.g.

f <- function(x, y, zs) x + y + mean(zs)
map2(1:2, 1:2, f, zs = 1) # works
map2(1:2, 1:2, f, zs = 1:2) # wrong result: vectorizes over zs too
map2(1:2, 1:2, f, zs = 1:3) # Error: all(lengths %in% c(1, n)) is not TRUE

Preserving map

It would be great if lowliner had a mapping function that uses preserving subsetting as in this gist: https://gist.github.com/hadley/8667699

One difficult aspect is to create an interface that makes it easy to work with both the list component x[[i]] and the (I'll call it like that for lack of a better idea) list frame x[i]. In your gist, you supply the list frame to the mapped function. I experimented with an alternative approach in which I supplied the list component instead, as is usual in mapping tools, along with a new placeholder referring to the list frame.

This new placeholder is called ... Since the double dot evokes the action of going one level up in a file hierarchy, I figured it would work well to represent the list frame. It is implemented as an active binding so that any modification to the list component is reflected in ...

Example:

test <- list(a = list(1, 2), b = list(3, 4))

out <- remap(test, function(x) {
  if (names(..) == "a") {
    rep(.., 2) %>%
      set_names(c("a1", "a2"))
  } else {
    x <- list(5)
    names(..) <- "B"
    ..
  }
})

str(out)

## List of 3
## $ a1:List of 2
##  ..$ : num 1
##  ..$ : num 2
## $ a2:List of 2
##  ..$ : num 1
##  ..$ : num 2
## $ B :List of 1
##  ..$ : num 5

The code is in https://gist.github.com/lionel-/12350ba7e583e9c10163

install.packages("purrr")

Installing package into ‘/Users/guy.dawson/Library/R/3.1/library’
(as ‘lib’ is unspecified)
Warning in install.packages :
  package ‘purrr’ is not available (for R version 3.2.0)

Here's my session info:

devtools::session_info()

Session info ---------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.2.0 (2015-04-16)
 system   x86_64, darwin13.4.0        
 ui       RStudio (0.99.441)          
 language (EN)                        
 collate  en_GB.UTF-8                 
 tz       Europe/London               

Packages -------------------------------------------------------------------------------------
 package    * version date       source        
 devtools   * 1.7.0   2015-01-17 CRAN (R 3.1.2)
 rstudioapi * 0.2     2014-12-31 CRAN (R 3.1.2)

Create lambdas of several arguments with formula

Hi,

I think I found an elegant way of allowing multiple arguments in lambdas created with formulas. It relies on ..n being a shortcut for the nth element in ...

as_function <- function(f) {
  stopifnot(inherits(f, "formula") && length(f) == 2)
  f <- lazyeval::interp(f, .values = list(. = as.name("..1")))
  make_function(alist(... = ), f[[2]], environment(f))
}

as_function(~ ..1 + ..2)(10, 2)
## [1] 12

as_function(~ . + ..2)(10, 2)
## [1] 12

Should I document and PR this?

Should flatmap be map_flat_dbl() etc?

Similarly, do we need map2_dbl() etc?