nathaneastwood / poorman Goto Github PK

View Code? Open in Web Editor NEW

329.0 9.0 15.0 6.75 MB

A poor man's dependency free grammar of data manipulation

Home Page: https://nathaneastwood.github.io/poorman/

License: Other

R 99.96% Dockerfile 0.04%

r base-r data-manipulation grammar

poorman's Introduction

{poorman}

I’d seen my father. He was a poor man, and I watched him do astonishing things. - Sidney Poitier

Overview

{poorman} is a grammar of data manipulation, providing dependency free versions of {dplyr} verbs that help you solve the most common data manipulation challenges:

select() picks variables based on their names.
mutate() adds new variables that are functions of existing variables.
filter() picks cases based on their values.
summarise() reduces multiple values down to a single summary.
arrange() changes the ordering of the rows.

{poorman} attempts to replicate the {dplyr} API exactly such that your {dplyr} code will still run even if you use {poorman} in its place. In addition to replicating {dplyr} functionality, {poorman} implements other functionality from the wider {tidyverse} such as select helpers and the pipe, %>%.

For more details on the functionality available within {poorman}, check out the {poorman} series of blog posts here.

(back to top)

Installation

You can install:

the development version from GitHub with

# install.packages("remotes")
remotes::install_github("nathaneastwood/poorman")

the latest release from CRAN with

install.packages("poorman")

(back to top)

Docker

If you’d like to try out the latest version of the package on CRAN using Docker, you can run the latest image with:

docker run --rm -it nathaneastwood/poorman

(back to top)

Usage

library(poorman, warn.conflicts = FALSE)
# 
#   I'd seen my father. He was a poor man, and I watched him do astonishing things.
#     - Sidney Poitier

mtcars %>%
  select(mpg, wt, starts_with("c")) %>%
  mutate(kpl = (1.609 * mpg) / 3.785, wt_kg = wt * 453.5924) %>%
  filter(mpg > 28)
#                 mpg    wt cyl carb      kpl    wt_kg
# Fiat 128       32.4 2.200   4    1 13.77321 997.9033
# Honda Civic    30.4 1.615   4    2 12.92301 732.5517
# Toyota Corolla 33.9 1.835   4    1 14.41086 832.3421
# Lotus Europa   30.4 1.513   4    2 12.92301 686.2853

mtcars %>%
  group_by(am, cyl) %>%
  summarise(mean_mpg = mean(mpg), sd_mpg = sd(mpg)) %>%
  ungroup()
#   am cyl mean_mpg    sd_mpg
# 1  0   4 22.90000 1.4525839
# 2  0   6 19.12500 1.6317169
# 3  0   8 15.05000 2.7743959
# 4  1   4 28.07500 4.4838599
# 5  1   6 20.56667 0.7505553
# 6  1   8 15.40000 0.5656854

(back to top)

Related Work

{dplyr}
{bplyr} - imports {magrittr} and {rlang}; it prepends functions with b_*(), e.g. b_select().
{tbltools} - imports {magrittr} and appends *_data() to each of its functions, e.g. select_data().

(back to top)

poorman's People

Contributors

Stargazers

Watchers

Forkers

federman coolbutuseless emberwhirl msberends eyayaw sbalci humanfactors markfairbanks bwiernik eutwt dimbage etiennebacher riccoflores magojam barryrowlingson

poorman's Issues

[FEAT] Add `group_split()` and `group_keys()`

group_split(): Split a data.frame by group.
group_keys(): Return the combination of groups.

Summarise columns are not immediately available

Take the following code as an example

d <- data.frame(x = 1:4)
res <- d %>%
  summarise(y = sum(x), z = y + 2)
# Error in eval(substitute(expr), data, enclos = parent.frame()) : 
#  object 'y' not found

In dplyr this works fine and returns z as expected.

[FEAT] Select using predicates

Is your feature request related to a problem? Please describe.
Currently columns cannot be selected using predicates, e.g. is.numeric

Describe the solution you'd like
The ability to use predicates using a where() function.

[FEAT] Add `glimpse()`

I tried rerunning the code of my recent blog post using {poorman} upon a request in the comments. Everything worked, except glimpse(). Do you plan on adding this function?

[FEAT] Add `na_if()`

This is a translation of the SQL command NULLIF. It is useful if you want to convert an annoying value to NA.

Groups do not persist in selections

For example:

mtcars %>% group_by(cyl, am) %>% select(mpg)                                                                                                         
#                      mpg
# Mazda RX4           21.0
# Mazda RX4 Wag       21.0
# Datsun 710          22.8
# Hornet 4 Drive      21.4
# Hornet Sportabout   18.7
# Valiant             18.1
# Duster 360          14.3
# Merc 240D           24.4
# Merc 230            22.8
...

dplyr solution:

mtcars %>% dplyr::group_by(cyl, am) %>% dplyr::select(mpg)                                                                                           
# Adding missing grouping variables: `cyl`, `am`
# A tibble: 32 x 3
# Groups:   cyl, am [6]
#      cyl    am   mpg
#  * <dbl> <dbl> <dbl>
#  1     6     1  21  
#  2     6     1  21  
#  3     4     1  22.8
#  4     6     0  21.4
#  5     8     0  18.7
#  6     6     0  18.1
#  7     8     0  14.3
#  8     4     0  24.4
#  9     4     0  22.8
# 10     6     0  19.2
# … with 22 more rows

[FEAT] a more robust pipe

Here's a suggestion, it should be 99% consistent with {magrittr}.

`%>%` <- function(lhs, rhs) {
rhs_call <- insert_dot(substitute(rhs))
eval(rhs_call, envir = list(`.` = lhs), enclos = parent.frame())
}

insert_dot <- function(expr) {
  if(is.symbol(expr) || expr[[1]] == quote(`(`)) {
    # if a symbol or an expression inside parentheses, make it a call with dot arg
    expr <- as.call(c(expr, quote(`.`)))
  } else if(length(expr) ==1) {
    # if a call without arg, give it a dot arg
    expr <- as.call(c(expr[[1]], quote(`.`)))
  } else if (expr[[1]] != quote(`{`) &&
             all(vapply(expr[-1], `!=`, quote(`.`), FUN.VALUE = logical(1))) &&
             all(vapply(expr[-1], `!=`, quote(`!!!.`), FUN.VALUE = logical(1)))) {
    # if a call with args but no dot in arg, insert one first
    expr <- as.call(c(expr[[1]], quote(`.`), as.list(expr[-1])))
  }
  expr
}

cars %>% head
#>   speed dist
#> 1     4    2
#> 2     4   10
#> 3     7    4
#> 4     7   22
#> 5     8   16
#> 6     9   10
cars %>% head(.,2)
#>   speed dist
#> 1     4    2
#> 2     4   10
2 %>% head(cars, .)
#>   speed dist
#> 1     4    2
#> 2     4   10
cars %>% {1}
#> [1] 1

^{Created on 2020-09-23 by the reprex package (v0.3.0)}

Implement distinct()

Suggestion:

distinct <- function(.data, ..., .keep_all = FALSE) {
  check_is_dataframe(.data)
  UseMethod("distinct")
}

distinct.default <- function(.data, ..., .keep_all = FALSE) {
  names <- rownames(.data)
  rownames(.data) <- NULL
  if (length(deparse_dots(...)) == 0) {
    selected <- .data
  } else {
    selected <- select(.data, ...)
  }
  rows <- as.integer(rownames(unique(selected)))
  if (isTRUE(.keep_all)) {
    res <- .data[rows, , drop = FALSE]
  } else {
    res <- selected[rows, , drop = FALSE]
  }
  rownames(res) <- names[rows]
  res
}

distinct.grouped_data <- function(.data, ..., .keep_all = FALSE) {
  apply_grouped_function(.data, "distinct", ..., .keep_all = .keep_all)
}

I cannot get the grouped version to work to also include the group variables. It now only returns the distinct variable if set...

Another idea - You should mention on your README that this package is a great, great idea for package developers that do not want to be dependent on dplyr (as it changes too often for sustainable pkg development), but do want to code using dplyr methods. For those users you could also create an extra raw syntax file with all your functions without roxygen parts (remove all lines starting with #') and your name on it, so they can copy it to their package.

`select()` cannot use variables

This is context dependent. var needs to be evaluated in correct environment.

var <- "mpg"
mtcars %>% select(var)
# Error in select_char(as.character(expr)) : Column `var` does not exist

[FEAT] Add `tidyr::replace_na()`

Rich man is coming to visit

I bump into this package today and admire all the spirit and contribution you made into this work. I have made two packages myself using data.table (I am not alone of course), and I find these experience extremly interesting and valuable. FYI: https://github.com/hope-data-science/tidyfst and https://github.com/hope-data-science/tidyft.

Here comes two questions: (1) Are these APIs 100% the same as dplyr? I tried but failed (well no failed sometimes, I think other ways might be more efficient so I decided to use my own style in some functions, what is it like in poorman); (2) Any test on performance? Base R is considered to be very efficient, is poorman acts even better than dplyr in some operations?

Thanks.

[FEAT] Add nest(), nest_by() and unnest()

nest()
nest_by()
unnest()

Error with filter() when using lapply()

The basic use of filter() works as expected:

sp1 = 'virginica'
filter(iris, Species == sp1)

But when I want to create a list based on the factor level Species, it fails

sp2 = levels(iris$Species)

lapply(sp2, function(x) {
  filter(iris, Species == x)
})

Same if factor is used as character variable:

sp3 = unique(as.character(iris$Species))

lapply(sp3, function(x) {
  filter(iris, Species == x)
})

When using with dplyr, it's working correctly:

lapply(sp2, function(x) {
  dplyr::filter(iris, Species == x)
})

lapply(sp3, function(x) {
  dplyr::filter(iris, Species == x)
})

[FEAT] Add `recode()`

This is a vectorised version of switch(): you can replace numeric values based on their position or their name, and character or factor values only by their name. This is an S3 generic: {dplyr} provides methods for numeric, character, and factors.

Add vignettes

[BUG] relocate .before and .after cannot accept numeric

mtcars %>% relocate(ends_with("p"), .before = 1)
# Error in seq.default(1L, where - 1L) : 'to' must be a finite number

`slice()` does not give duplicate rows

slice(mtcars, 2, 3, 3, 3)                                                                                                                            
#                mpg cyl disp  hp drat    wt  qsec vs am gear carb
# Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
# Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

dplyr::slice(mtcars, 2, 3, 3, 3)                                                                                                                     
#    mpg cyl disp  hp drat    wt  qsec vs am gear carb
# 1 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
# 2 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
# 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
# 4 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

[FEAT] Add `coalesce()`

Given a set of vectors, coalesce() finds the first non-missing value at each position. This is inspired by the SQL COALESCE function which does the same thing for NULLs.

Add `slice_*()` family of functions

Implement

Implement n_distinct()

n_distinct() not only supports vectors, it also supports data.frames and then returns the number of unique rows.

Suggestion for your pkg:

n_distinct <- function(..., na.rm = FALSE) {
  out <- c(...)
  if (is.list(out)) {
    return(NROW(unique(as.data.frame(out, stringsAsFactors = FALSE))))
  }
  if (isTRUE(na.rm)) {
    out <- out[!is.na(out)]
  }
  length(unique(out))
}

Testing:

dplyr::n_distinct(mtcars)
#> [1] 32
n_distinct(mtcars)
#> [1] 32

test_vector <- c("A", "B", "A", "B")
dplyr::n_distinct(test_vector)
#> [1] 2
n_distinct(test_vector)
#> [1] 2

test_data.frame1 <- data.frame(test = c("A", "A"), test2 = c("B", "B"))
dplyr::n_distinct(test_data.frame1)
#> [1] 1
n_distinct(test_data.frame1)
#> [1] 1

test_data.frame2 <- data.frame(test = c("A", "A"), test2 = c("B", "C"))
dplyr::n_distinct(test_data.frame2)
#> [1] 2
n_distinct(test_data.frame2)
#> [1] 2

Add a website using {pkgdown}

[] Build the site
[] Have a CICD pipeline autodeploy it

Add pivot_wider()

[FEAT] Add group_cols()

Somehow got missed in #41.

This selection helpers matches grouping variables. It can be used in select() or vars() selections.

gdf <- iris %>% group_by(Species)
gdf %>% select(group_cols())

Dissolve sf polygons based on group_by()

Not sure whether this goes way beyond the scope of the poorman, but would basic dissolve functionality for sf objects - as offered by dplyr - be something you might want to think about for a future release? A quick example of what I mean can be found here.

Here's the condensed dplyr approach from above link:

library(raster)
library(sf)
library(dplyr)

## regions
regions = getData(
  country = "GBR"
  , level = 2
  , path = tmpDir()
) %>% 
  st_as_sf() %>% 
  filter(NAME_1 == "England")

plot(regions['NAME_2'])

## dissolve
england = regions %>% 
  group_by(NAME_1) %>% 
  summarise()

plot(england)

The poorman currently fails to carry out the group_by() operation with an

"Error in order(y) : unimplemented type 'list' in 'orderVector1'"

[BUG] Error of %>% operation

A simple example as following:

iris[,-5] %>% correlation::correlation() %>% summary

the %>% would work for iris[,-5] %>% correlation::correlation() , but not sucess for the next %>% summary. While dplyr do not have the problem.

Error message as following:
Error in rhs[[1L]] : object of type 'symbol' is not subsettable

select() does not work with integers

Currently select() must be used with variable names.

mtcars %>% select(mpg, cyl)

This works. The following doesn't.

mtcars %>% select(1, 2)

Renaming of multiple columns in `select()` when using `select_positions()`

For example

mtcars %>% select(tmp = mpg:cyl)
#                      tmp tmp
# Mazda RX4           21.0   6
# Mazda RX4 Wag       21.0   6

Should have

mtcars %>% select(tmp = mpg:cyl)
#                      tmp1 tmp2
# Mazda RX4            21.0    6
# Mazda RX4 Wag        21.0    6

Implement row_number()

row_number <- function(x) {
  if (missing(x)) {
    seq_len(n())
  } else {
    rank(x, ties.method = "first", na.last = "keep")
  }
}

It's the original dplyr syntax.

Logo

May I suggest a logo for the package. Made with hexmake

Implement if_else

Function if_else() misses from your impressive list of rewritten functions 😄

To get started (still misses the missing = NULL argument of dplyr::if_else()):

if_else <- function(condition, true, false, ...) {
  if (!is.logical(condition)) {
    stop("`condition` must be a logical vector")
  }

  if (!identical(class(true), class(false))) {
    stop("Classes of `true` and `false` differ")
  }

  ifelse(condition, true, false)
}

Thank you so much, this is an awesome package to not rely on dplyr anymore, GREAT for package developers.

group_by() + mutate() does not preserve row order

For example

r$> data <- data.frame(a = c(0, 1, 0, 1, 0, 1), b = rnorm(6))                                                                                            

r$> data                                                                                                                                                 
  a          b
1 0  0.1975196
2 1 -0.9084617
3 0 -1.1766728
4 1 -2.3127030
5 0 -0.7029450
6 1 -2.2935241

r$> data %>% group_by(a) %>% mutate(b2 = b * 2)                                                                                                          
  a          b         b2
1 0  0.1975196  0.3950392
3 0 -1.1766728 -2.3533457
5 0 -0.7029450 -1.4058900
2 1 -0.9084617 -1.8169233
4 1 -2.3127030 -4.6254060
6 1 -2.2935241 -4.5870483

Groups:  a

Implement between()

I think it pretty much comes down to:

between <- function(x, left, right) {
  # add this to also support factors - dplyr::between() does not do this
  if (is.factor(x)) {
    left <- factor(left, levels = levels(x), ordered = is.ordered(x))
    right <- factor(right, levels = levels(x), ordered = is.ordered(x))
  }
  
  x <- as.numeric(x)
  left <- as.numeric(left)
  right <- as.numeric(right)
  x %in% c(left:right)
}

Check:

54 %>% between(1, 500)
#> [1] TRUE
54 %>% between(100, 500)
#> [1] FALSE

Sys.Date() %>% between(Sys.Date() - 10, Sys.Date() + 10)
#> [1] TRUE

If you supports factors too:

factor("b", levels = letters[1:26]) %>% between("a", "c")
#> [1] TRUE

Add GitHub Topic: base-r

Thanks for publishing this! In case you'd like to make your repo more findable, you could add that topic ;-) Thank you!

Implement `case_when()`

[FEAT] Add `bind_cols()`

Standard evaluation functions

{dplyr} used to offer verbs with standard evaluation, e.g. mutate_(). Unfortunately, these have been deprecated. I think they are super useful, though, and would love to see them in {poorman}.

On a related note, how would one turn a {poorman} chain into a function? With {dplyr} and {rlang} I'd do something like this:

avg <- function(data, what, by) {
  by <- rlang::enquo(by)
  what <- rlang::enquo(what)
  data %>%
    dplyr::group_by(!!by) %>%
    dplyr::summarise(avg = mean(!!what))
}

I guess this would do the job with {poorman}?

avg <- function(data, what, by) {
  data %>%
    poorman::group_by(substitute(by)) %>%
    poorman::summarise(avg = mean(substitute(what)))
}

Rebrand?

I really like the design, readability, and implementation of this project. I think the adherence to minimal dependencies is great. But have you considered rebranding the package? I realize that the name is tongue in cheek but, at this point a considerable amount of work has been put in and it's beginning to look like a viable alternative to others. It looks to me more like a baseplyr package than a poor man's alternative.

Implement lag() and lead()

Getting more and more enthusiastic about this package 😃

For lag() and lead(), also used often in data wrangling:

lag <- function (x, n = 1L, default = NA, ...) {
  if (length(n) != 1 || !is.numeric(n) || n < 0) {
    stop("n must be a nonnegative integer scalar")
  }
  if (n == 0) {
    return(x)
  }
  xlen <- length(x)
  n <- pmin(n, xlen)
  out <- c(rep(default, n), x[seq_len(xlen - n)])
  attributes(out) <- attributes(x)
  out
}

lead <- function (x, n = 1L, default = NA, ...) {
  if (length(n) != 1 || !is.numeric(n) || n < 0) {
    stop("n must be a nonnegative integer scalar")
  }
  if (n == 0) {
    return(x)
  }
  xlen <- length(x)
  n <- pmin(n, xlen)
  out <- c(x[-seq_len(n)], rep(default, n))
  attributes(out) <- attributes(x)
  out
}

Reprex:

test <- factor(letters[1:10])

lag(test)
#> [1] <NA> a    b    c    d    e    f    g    h    i   
#> Levels: a b c d e f g h i j

lead(test)
#> [1] b    c    d    e    f    g    h    i    j    <NA>
#> Levels: a b c d e f g h i j

identical(dplyr::lag(test), lag(test))
#> [1] TRUE

identical(dplyr::lead(test), lead(test))
#> [1] TRUE

[FEAT] Add `across()`

[FEAT] Add `near()`

This is a safe way of comparing if two vectors of floating point numbers are (pairwise) equal. This is safer than using ==, because it has a built in tolerance.

summarise() fails in the presence of empty groups

Thanks for the amazing work! I noticed that summarise() fails with

Error in as.character(cut)[[1]] : subscript out of bounds

for groups larger than 2. Here's a minimum reproducible example:

library(poorman) 

ggplot2::diamonds %>% 
  group_by(cut, color, clarity) %>% 
  summarise(mean(carat))

For groups up to 2 (eg. cut, color only), I obtain the desired output. Also, the above snippet works flawlessly with dplyr. Here's my sessionInfo():

R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] poorman_0.1.7

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4       clisymbols_1.2.0 assertthat_0.2.1 dplyr_0.8.5     
 [5] prompt_1.0.0     crayon_1.3.4     grid_3.6.3       R6_2.4.1        
 [9] lifecycle_0.2.0  gtable_0.3.0     magrittr_1.5     scales_1.1.0    
[13] ggplot2_3.3.0    pillar_1.4.3     rlang_0.4.5      tools_3.6.3     
[17] glue_1.3.2       purrr_0.3.3      munsell_0.5.0    compiler_3.6.3  
[21] pkgconfig_2.0.3  colorspace_1.4-1 tidyselect_1.0.0 tibble_2.1.3

[FEAT] Add `bind_rows()`

Add count() function

First of all, thanks for your amazing work! I love the dependecy free philosophy. I'm going to use your package a lot, sure.
Have you considered to translate dplyr::count() function by one or more vaiables?
After dplyr v0.7.0 it has the ability to return an ungrouped table, so it is very useful

[FEAT] Add grouping metadata functionality

Add rowwise capabilities

This will be helpful with nest_by() (see #44) to be able to apply models across rows, for example.

Some functions to include:

rowwise()
c_across()

join_worker does not maintain original sort

Hey, great package!

Your join_worker() does not keep the original sorting. It is easily solved though:

join_worker <- function(x, y, by = NULL, suffix = c(".x", ".y"), ...) {
  # remember original sorting
  x$`.join_id` <- 1:nrow(x)
  if (is.null(by)) {
    by <- intersect(names(x), names(y))
    join_message(by)
    merged <- merge(x = x, y = y, by = by, suffixes = suffix, ...)[, union(names(x), names(y))]
  } else if (is.null(names(by))) {
    merged <- merge(x = x, y = y, by = by, suffixes = suffix, ...)
  } else {
    merged <- merge(x = x, y = y, by.x = names(by), by.y = by, suffixes = suffix, ...)
  }
  # return sorted according to original sorting and without the newly created column
  merged[order(merged$`.join_id`), colnames(merged) != ".join_id"]
}