Code Monkey home page Code Monkey logo

poorman's Introduction

{poorman}

CRAN status Dependencies CRAN downloads R-CMD-check codecov

I’d seen my father. He was a poor man, and I watched him do astonishing things. - Sidney Poitier

Overview

{poorman} is a grammar of data manipulation, providing dependency free versions of {dplyr} verbs that help you solve the most common data manipulation challenges:

  • select() picks variables based on their names.
  • mutate() adds new variables that are functions of existing variables.
  • filter() picks cases based on their values.
  • summarise() reduces multiple values down to a single summary.
  • arrange() changes the ordering of the rows.

{poorman} attempts to replicate the {dplyr} API exactly such that your {dplyr} code will still run even if you use {poorman} in its place. In addition to replicating {dplyr} functionality, {poorman} implements other functionality from the wider {tidyverse} such as select helpers and the pipe, %>%.

For more details on the functionality available within {poorman}, check out the {poorman} series of blog posts here.

(back to top)

Installation

You can install:

  • the development version from GitHub with
# install.packages("remotes")
remotes::install_github("nathaneastwood/poorman")
  • the latest release from CRAN with
install.packages("poorman")

(back to top)

Docker

If you’d like to try out the latest version of the package on CRAN using Docker, you can run the latest image with:

docker run --rm -it nathaneastwood/poorman

(back to top)

Usage

library(poorman, warn.conflicts = FALSE)
# 
#   I'd seen my father. He was a poor man, and I watched him do astonishing things.
#     - Sidney Poitier

mtcars %>%
  select(mpg, wt, starts_with("c")) %>%
  mutate(kpl = (1.609 * mpg) / 3.785, wt_kg = wt * 453.5924) %>%
  filter(mpg > 28)
#                 mpg    wt cyl carb      kpl    wt_kg
# Fiat 128       32.4 2.200   4    1 13.77321 997.9033
# Honda Civic    30.4 1.615   4    2 12.92301 732.5517
# Toyota Corolla 33.9 1.835   4    1 14.41086 832.3421
# Lotus Europa   30.4 1.513   4    2 12.92301 686.2853

mtcars %>%
  group_by(am, cyl) %>%
  summarise(mean_mpg = mean(mpg), sd_mpg = sd(mpg)) %>%
  ungroup()
#   am cyl mean_mpg    sd_mpg
# 1  0   4 22.90000 1.4525839
# 2  0   6 19.12500 1.6317169
# 3  0   8 15.05000 2.7743959
# 4  1   4 28.07500 4.4838599
# 5  1   6 20.56667 0.7505553
# 6  1   8 15.40000 0.5656854

(back to top)

Related Work

  • {dplyr}
  • {bplyr} - imports {magrittr} and {rlang}; it prepends functions with b_*(), e.g. b_select().
  • {tbltools} - imports {magrittr} and appends *_data() to each of its functions, e.g. select_data().

(back to top)

poorman's People

Contributors

etiennebacher avatar indrajeetpatil avatar markfairbanks avatar msberends avatar nathaneastwood avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

poorman's Issues

Summarise columns are not immediately available

Take the following code as an example

d <- data.frame(x = 1:4)
res <- d %>%
  summarise(y = sum(x), z = y + 2)
# Error in eval(substitute(expr), data, enclos = parent.frame()) : 
#  object 'y' not found

In dplyr this works fine and returns z as expected.

[FEAT] Select using predicates

Is your feature request related to a problem? Please describe.
Currently columns cannot be selected using predicates, e.g. is.numeric

Describe the solution you'd like
The ability to use predicates using a where() function.

[FEAT] Add `glimpse()`

I tried rerunning the code of my recent blog post using {poorman} upon a request in the comments. Everything worked, except glimpse(). Do you plan on adding this function?

[FEAT] Add `na_if()`

This is a translation of the SQL command NULLIF. It is useful if you want to convert an annoying value to NA.

Groups do not persist in selections

For example:

mtcars %>% group_by(cyl, am) %>% select(mpg)                                                                                                         
#                      mpg
# Mazda RX4           21.0
# Mazda RX4 Wag       21.0
# Datsun 710          22.8
# Hornet 4 Drive      21.4
# Hornet Sportabout   18.7
# Valiant             18.1
# Duster 360          14.3
# Merc 240D           24.4
# Merc 230            22.8
...

dplyr solution:

mtcars %>% dplyr::group_by(cyl, am) %>% dplyr::select(mpg)                                                                                           
# Adding missing grouping variables: `cyl`, `am`
# A tibble: 32 x 3
# Groups:   cyl, am [6]
#      cyl    am   mpg
#  * <dbl> <dbl> <dbl>
#  1     6     1  21  
#  2     6     1  21  
#  3     4     1  22.8
#  4     6     0  21.4
#  5     8     0  18.7
#  6     6     0  18.1
#  7     8     0  14.3
#  8     4     0  24.4
#  9     4     0  22.8
# 10     6     0  19.2
# … with 22 more rows

[FEAT] a more robust pipe

Here's a suggestion, it should be 99% consistent with {magrittr}.

`%>%` <- function(lhs, rhs) {
rhs_call <- insert_dot(substitute(rhs))
eval(rhs_call, envir = list(`.` = lhs), enclos = parent.frame())
}

insert_dot <- function(expr) {
  if(is.symbol(expr) || expr[[1]] == quote(`(`)) {
    # if a symbol or an expression inside parentheses, make it a call with dot arg
    expr <- as.call(c(expr, quote(`.`)))
  } else if(length(expr) ==1) {
    # if a call without arg, give it a dot arg
    expr <- as.call(c(expr[[1]], quote(`.`)))
  } else if (expr[[1]] != quote(`{`) &&
             all(vapply(expr[-1], `!=`, quote(`.`), FUN.VALUE = logical(1))) &&
             all(vapply(expr[-1], `!=`, quote(`!!!.`), FUN.VALUE = logical(1)))) {
    # if a call with args but no dot in arg, insert one first
    expr <- as.call(c(expr[[1]], quote(`.`), as.list(expr[-1])))
  }
  expr
}

cars %>% head
#>   speed dist
#> 1     4    2
#> 2     4   10
#> 3     7    4
#> 4     7   22
#> 5     8   16
#> 6     9   10
cars %>% head(.,2)
#>   speed dist
#> 1     4    2
#> 2     4   10
2 %>% head(cars, .)
#>   speed dist
#> 1     4    2
#> 2     4   10
cars %>% {1}
#> [1] 1

Created on 2020-09-23 by the reprex package (v0.3.0)

Implement distinct()

Suggestion:

distinct <- function(.data, ..., .keep_all = FALSE) {
  check_is_dataframe(.data)
  UseMethod("distinct")
}

distinct.default <- function(.data, ..., .keep_all = FALSE) {
  names <- rownames(.data)
  rownames(.data) <- NULL
  if (length(deparse_dots(...)) == 0) {
    selected <- .data
  } else {
    selected <- select(.data, ...)
  }
  rows <- as.integer(rownames(unique(selected)))
  if (isTRUE(.keep_all)) {
    res <- .data[rows, , drop = FALSE]
  } else {
    res <- selected[rows, , drop = FALSE]
  }
  rownames(res) <- names[rows]
  res
}

distinct.grouped_data <- function(.data, ..., .keep_all = FALSE) {
  apply_grouped_function(.data, "distinct", ..., .keep_all = .keep_all)
}

I cannot get the grouped version to work to also include the group variables. It now only returns the distinct variable if set...


Another idea - You should mention on your README that this package is a great, great idea for package developers that do not want to be dependent on dplyr (as it changes too often for sustainable pkg development), but do want to code using dplyr methods. For those users you could also create an extra raw syntax file with all your functions without roxygen parts (remove all lines starting with #') and your name on it, so they can copy it to their package.

`select()` cannot use variables

This is context dependent. var needs to be evaluated in correct environment.

var <- "mpg"
mtcars %>% select(var)
# Error in select_char(as.character(expr)) : Column `var` does not exist

Rich man is coming to visit

I bump into this package today and admire all the spirit and contribution you made into this work. I have made two packages myself using data.table (I am not alone of course), and I find these experience extremly interesting and valuable. FYI: https://github.com/hope-data-science/tidyfst and https://github.com/hope-data-science/tidyft.

Here comes two questions: (1) Are these APIs 100% the same as dplyr? I tried but failed (well no failed sometimes, I think other ways might be more efficient so I decided to use my own style in some functions, what is it like in poorman); (2) Any test on performance? Base R is considered to be very efficient, is poorman acts even better than dplyr in some operations?

Thanks.

Error with filter() when using lapply()

The basic use of filter() works as expected:

sp1 = 'virginica'
filter(iris, Species == sp1)

But when I want to create a list based on the factor level Species, it fails

sp2 = levels(iris$Species)

lapply(sp2, function(x) {
  filter(iris, Species == x)
})

Same if factor is used as character variable:

sp3 = unique(as.character(iris$Species))

lapply(sp3, function(x) {
  filter(iris, Species == x)
})

When using with dplyr, it's working correctly:

lapply(sp2, function(x) {
  dplyr::filter(iris, Species == x)
})

lapply(sp3, function(x) {
  dplyr::filter(iris, Species == x)
})

[FEAT] Add `recode()`

This is a vectorised version of switch(): you can replace numeric values based on their position or their name, and character or factor values only by their name. This is an S3 generic: {dplyr} provides methods for numeric, character, and factors.

`slice()` does not give duplicate rows

slice(mtcars, 2, 3, 3, 3)                                                                                                                            
#                mpg cyl disp  hp drat    wt  qsec vs am gear carb
# Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
# Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

dplyr::slice(mtcars, 2, 3, 3, 3)                                                                                                                     
#    mpg cyl disp  hp drat    wt  qsec vs am gear carb
# 1 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
# 2 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
# 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
# 4 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

[FEAT] Add `coalesce()`

Given a set of vectors, coalesce() finds the first non-missing value at each position. This is inspired by the SQL COALESCE function which does the same thing for NULLs.

Implement n_distinct()

n_distinct() not only supports vectors, it also supports data.frames and then returns the number of unique rows.

Suggestion for your pkg:

n_distinct <- function(..., na.rm = FALSE) {
  out <- c(...)
  if (is.list(out)) {
    return(NROW(unique(as.data.frame(out, stringsAsFactors = FALSE))))
  }
  if (isTRUE(na.rm)) {
    out <- out[!is.na(out)]
  }
  length(unique(out))
}

Testing:

dplyr::n_distinct(mtcars)
#> [1] 32
n_distinct(mtcars)
#> [1] 32

test_vector <- c("A", "B", "A", "B")
dplyr::n_distinct(test_vector)
#> [1] 2
n_distinct(test_vector)
#> [1] 2

test_data.frame1 <- data.frame(test = c("A", "A"), test2 = c("B", "B"))
dplyr::n_distinct(test_data.frame1)
#> [1] 1
n_distinct(test_data.frame1)
#> [1] 1

test_data.frame2 <- data.frame(test = c("A", "A"), test2 = c("B", "C"))
dplyr::n_distinct(test_data.frame2)
#> [1] 2
n_distinct(test_data.frame2)
#> [1] 2

[FEAT] Add group_cols()

Somehow got missed in #41.

This selection helpers matches grouping variables. It can be used in select() or vars() selections.

gdf <- iris %>% group_by(Species)
gdf %>% select(group_cols())

Dissolve sf polygons based on group_by()

Not sure whether this goes way beyond the scope of the poorman, but would basic dissolve functionality for sf objects - as offered by dplyr - be something you might want to think about for a future release? A quick example of what I mean can be found here.

Here's the condensed dplyr approach from above link:

library(raster)
library(sf)
library(dplyr)

## regions
regions = getData(
  country = "GBR"
  , level = 2
  , path = tmpDir()
) %>% 
  st_as_sf() %>% 
  filter(NAME_1 == "England")

plot(regions['NAME_2'])

## dissolve
england = regions %>% 
  group_by(NAME_1) %>% 
  summarise()

plot(england)

The poorman currently fails to carry out the group_by() operation with an

"Error in order(y) : unimplemented type 'list' in 'orderVector1'"

[BUG] Error of %>% operation

A simple example as following:

iris[,-5] %>% correlation::correlation() %>% summary

the %>% would work for iris[,-5] %>% correlation::correlation() , but not sucess for the next %>% summary. While dplyr do not have the problem.

Error message as following:
Error in rhs[[1L]] : object of type 'symbol' is not subsettable

Implement row_number()

row_number <- function(x) {
  if (missing(x)) {
    seq_len(n())
  } else {
    rank(x, ties.method = "first", na.last = "keep")
  }
}

It's the original dplyr syntax.

Logo

May I suggest a logo for the package. Made with hexmake
hex-poorman

Implement if_else

Function if_else() misses from your impressive list of rewritten functions 😄

To get started (still misses the missing = NULL argument of dplyr::if_else()):

if_else <- function(condition, true, false, ...) {
  if (!is.logical(condition)) {
    stop("`condition` must be a logical vector")
  }

  if (!identical(class(true), class(false))) {
    stop("Classes of `true` and `false` differ")
  }

  ifelse(condition, true, false)
}

Thank you so much, this is an awesome package to not rely on dplyr anymore, GREAT for package developers.

group_by() + mutate() does not preserve row order

For example

r$> data <- data.frame(a = c(0, 1, 0, 1, 0, 1), b = rnorm(6))                                                                                            

r$> data                                                                                                                                                 
  a          b
1 0  0.1975196
2 1 -0.9084617
3 0 -1.1766728
4 1 -2.3127030
5 0 -0.7029450
6 1 -2.2935241

r$> data %>% group_by(a) %>% mutate(b2 = b * 2)                                                                                                          
  a          b         b2
1 0  0.1975196  0.3950392
3 0 -1.1766728 -2.3533457
5 0 -0.7029450 -1.4058900
2 1 -0.9084617 -1.8169233
4 1 -2.3127030 -4.6254060
6 1 -2.2935241 -4.5870483

Groups:  a 

Implement between()

I think it pretty much comes down to:

between <- function(x, left, right) {
  # add this to also support factors - dplyr::between() does not do this
  if (is.factor(x)) {
    left <- factor(left, levels = levels(x), ordered = is.ordered(x))
    right <- factor(right, levels = levels(x), ordered = is.ordered(x))
  }
  
  x <- as.numeric(x)
  left <- as.numeric(left)
  right <- as.numeric(right)
  x %in% c(left:right)
}

Check:

54 %>% between(1, 500)
#> [1] TRUE
54 %>% between(100, 500)
#> [1] FALSE

Sys.Date() %>% between(Sys.Date() - 10, Sys.Date() + 10)
#> [1] TRUE

If you supports factors too:

factor("b", levels = letters[1:26]) %>% between("a", "c")
#> [1] TRUE

Standard evaluation functions

{dplyr} used to offer verbs with standard evaluation, e.g. mutate_(). Unfortunately, these have been deprecated. I think they are super useful, though, and would love to see them in {poorman}.

On a related note, how would one turn a {poorman} chain into a function? With {dplyr} and {rlang} I'd do something like this:

avg <- function(data, what, by) {
  by <- rlang::enquo(by)
  what <- rlang::enquo(what)
  data %>%
    dplyr::group_by(!!by) %>%
    dplyr::summarise(avg = mean(!!what))
}

I guess this would do the job with {poorman}?

avg <- function(data, what, by) {
  data %>%
    poorman::group_by(substitute(by)) %>%
    poorman::summarise(avg = mean(substitute(what)))
}

Rebrand?

I really like the design, readability, and implementation of this project. I think the adherence to minimal dependencies is great. But have you considered rebranding the package? I realize that the name is tongue in cheek but, at this point a considerable amount of work has been put in and it's beginning to look like a viable alternative to others. It looks to me more like a baseplyr package than a poor man's alternative.

Implement lag() and lead()

Getting more and more enthusiastic about this package 😃

For lag() and lead(), also used often in data wrangling:

lag <- function (x, n = 1L, default = NA, ...) {
  if (length(n) != 1 || !is.numeric(n) || n < 0) {
    stop("n must be a nonnegative integer scalar")
  }
  if (n == 0) {
    return(x)
  }
  xlen <- length(x)
  n <- pmin(n, xlen)
  out <- c(rep(default, n), x[seq_len(xlen - n)])
  attributes(out) <- attributes(x)
  out
}

lead <- function (x, n = 1L, default = NA, ...) {
  if (length(n) != 1 || !is.numeric(n) || n < 0) {
    stop("n must be a nonnegative integer scalar")
  }
  if (n == 0) {
    return(x)
  }
  xlen <- length(x)
  n <- pmin(n, xlen)
  out <- c(x[-seq_len(n)], rep(default, n))
  attributes(out) <- attributes(x)
  out
}

Reprex:

test <- factor(letters[1:10])

lag(test)
#> [1] <NA> a    b    c    d    e    f    g    h    i   
#> Levels: a b c d e f g h i j

lead(test)
#> [1] b    c    d    e    f    g    h    i    j    <NA>
#> Levels: a b c d e f g h i j

identical(dplyr::lag(test), lag(test))
#> [1] TRUE

identical(dplyr::lead(test), lead(test))
#> [1] TRUE

[FEAT] Add `near()`

This is a safe way of comparing if two vectors of floating point numbers are (pairwise) equal. This is safer than using ==, because it has a built in tolerance.

summarise() fails in the presence of empty groups

Thanks for the amazing work! I noticed that summarise() fails with

Error in as.character(cut)[[1]] : subscript out of bounds

for groups larger than 2. Here's a minimum reproducible example:

library(poorman) 

ggplot2::diamonds %>% 
  group_by(cut, color, clarity) %>% 
  summarise(mean(carat))

For groups up to 2 (eg. cut, color only), I obtain the desired output. Also, the above snippet works flawlessly with dplyr. Here's my sessionInfo():

R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] poorman_0.1.7

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4       clisymbols_1.2.0 assertthat_0.2.1 dplyr_0.8.5     
 [5] prompt_1.0.0     crayon_1.3.4     grid_3.6.3       R6_2.4.1        
 [9] lifecycle_0.2.0  gtable_0.3.0     magrittr_1.5     scales_1.1.0    
[13] ggplot2_3.3.0    pillar_1.4.3     rlang_0.4.5      tools_3.6.3     
[17] glue_1.3.2       purrr_0.3.3      munsell_0.5.0    compiler_3.6.3  
[21] pkgconfig_2.0.3  colorspace_1.4-1 tidyselect_1.0.0 tibble_2.1.3 

Add count() function

First of all, thanks for your amazing work! I love the dependecy free philosophy. I'm going to use your package a lot, sure.
Have you considered to translate dplyr::count() function by one or more vaiables?
After dplyr v0.7.0 it has the ability to return an ungrouped table, so it is very useful

Add rowwise capabilities

This will be helpful with nest_by() (see #44) to be able to apply models across rows, for example.

Some functions to include:

  • rowwise()
  • c_across()

join_worker does not maintain original sort

Hey, great package!

Your join_worker() does not keep the original sorting. It is easily solved though:

join_worker <- function(x, y, by = NULL, suffix = c(".x", ".y"), ...) {
  # remember original sorting
  x$`.join_id` <- 1:nrow(x)
  if (is.null(by)) {
    by <- intersect(names(x), names(y))
    join_message(by)
    merged <- merge(x = x, y = y, by = by, suffixes = suffix, ...)[, union(names(x), names(y))]
  } else if (is.null(names(by))) {
    merged <- merge(x = x, y = y, by = by, suffixes = suffix, ...)
  } else {
    merged <- merge(x = x, y = y, by.x = names(by), by.y = by, suffixes = suffix, ...)
  }
  # return sorted according to original sorting and without the newly created column
  merged[order(merged$`.join_id`), colnames(merged) != ".join_id"]
}

Add custom print method for data.frames

Something similar to how {data.table} or {tibble} works, probably with options too. Some option ideas:

  • Turn off custom print method and just use base::print.data.frame()
  • Number of columns/rows
  • Print row names or just print 1:n
  • Column types
  • Total dims
  • Max rows, min rows

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.