Code Monkey home page Code Monkey logo

vctrs's Introduction

vctrs

Codecov test coverage Lifecycle: maturing R-CMD-check

There are three main goals to the vctrs package, each described in a vignette:

  • To propose vec_size() and vec_ptype() as alternatives to length() and class(); vignette("type-size"). These definitions are paired with a framework for size-recycling and type-coercion. ptype should evoke the notion of a prototype, i.e. the original or typical form of something.

  • To define size- and type-stability as desirable function properties, use them to analyse existing base functions, and to propose better alternatives; vignette("stability"). This work has been particularly motivated by thinking about the ideal properties of c(), ifelse(), and rbind().

  • To provide a new vctr base class that makes it easy to create new S3 vectors; vignette("s3-vector"). vctrs provides methods for many base generics in terms of a few new vctrs generics, making implementation considerably simpler and more robust.

vctrs is a developer-focussed package. Understanding and extending vctrs requires some effort from developers, but should be invisible to most users. It’s our hope that having an underlying theory will mean that users can build up an accurate mental model without explicitly learning the theory. vctrs will typically be used by other packages, making it easy for them to provide new classes of S3 vectors that are supported throughout the tidyverse (and beyond). For that reason, vctrs has few dependencies.

Installation

Install vctrs from CRAN with:

install.packages("vctrs")

Alternatively, if you need the development version, install it with:

# install.packages("pak")
pak::pak("r-lib/vctrs")

Usage

library(vctrs)

# Sizes
str(vec_size_common(1, 1:10))
#>  int 10
str(vec_recycle_common(1, 1:10))
#> List of 2
#>  $ : num [1:10] 1 1 1 1 1 1 1 1 1 1
#>  $ : int [1:10] 1 2 3 4 5 6 7 8 9 10

# Prototypes
str(vec_ptype_common(FALSE, 1L, 2.5))
#>  num(0)
str(vec_cast_common(FALSE, 1L, 2.5))
#> List of 3
#>  $ : num 0
#>  $ : num 1
#>  $ : num 2.5

Motivation

The original motivation for vctrs comes from two separate but related problems. The first problem is that base::c() has rather undesirable behaviour when you mix different S3 vectors:

# combining factors makes integers
c(factor("a"), factor("b"))
#> [1] 1 1

# combining dates and date-times gives incorrect values; also, order matters
dt <- as.Date("2020-01-01")
dttm <- as.POSIXct(dt)

c(dt, dttm)
#> [1] "2020-01-01"    "4321940-06-07"
c(dttm, dt)
#> [1] "2019-12-31 19:00:00 EST" "1970-01-01 00:04:22 EST"

This behaviour arises because c() has dual purposes: as well as its primary duty of combining vectors, it has a secondary duty of stripping attributes. For example, ?POSIXct suggests that you should use c() if you want to reset the timezone.

The second problem is that dplyr::bind_rows() is not extensible by others. Currently, it handles arbitrary S3 classes using heuristics, but these often fail, and it feels like we really need to think through the problem in order to build a principled solution. This intersects with the need to cleanly support more types of data frame columns, including lists of data frames, data frames, and matrices.

vctrs's People

Contributors

808sandbr avatar akirathan avatar batpigandme avatar chsafouane avatar coreyyanofsky-zz avatar davisvaughan avatar earowang avatar echasnovski avatar etiennebacher avatar georgestagg avatar gergness avatar hadley avatar ijlyttle avatar indrajeetpatil avatar jamescuster avatar jameslairdsmith avatar jennybc avatar jessesadler avatar jimhester avatar juangomezduaso avatar krlmlr avatar lionel- avatar maxheld83 avatar mdsumner avatar mgirlich avatar michaelchirico avatar romainfrancois avatar salim-b avatar yutannihilation avatar zachary-foster avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vctrs's Issues

Coercion rules

Rules:

  • integer + double -> double
  • logical NA + anything -> anything
  • factor + factor (same levels) -> factor
  • factor + factor (diff levels) -> character (WARN)
  • factor + character -> character (WARN)
  • dates + datetime -> date time
  • datetime + datetime (different tz) -> date time (first tz)

Implement in C, and provide replacement to dplyr::combine()

# Integer coerced to double
df <- data_frame(x = 1:2) %>%
  group_by(x) %>%
  summarise(y = if (x == 1) 1L else 1)
expect_type(df$y, "double")

df <- data_frame(x = 1:2) %>%
  group_by(x) %>%
  mutate(y = if (x == 1) 1L else 1)
expect_type(df$y, "double")

df1 <- data_frame(x = 1L)
df2 <- data_frame(x = 1)
df <- bind_rows(df1, df2)
expect_type(df$x, "double")
expect_type(combine(df1$x, df2$x), "double")

df <- inner_join(df1, df2, by = "x")
expect_type(df$x, "double")

Figure out how to provide efficient extension mechanism.

Empty-default operator

@egnha commented on Aug 25, 2017, 6:42 AM UTC:

Would there be a place for an empty-default operator in rlang?

`%|||%` <- function(x, y) {
  if (is_empty(x)) y else x
}

This is handy in contexts where the notion of "emptiness" you want to check might not be type-consistent (e.g., "empty" names() is NULL, whereas "empty" paste() is character(0)).

This issue was moved by lionel- from r-lib/rlang#244.

rescaling helpers

cf scale() which always returns a matrix. Look back to reshape(1) for other useful rescalers.

`are_numeric`, `all_numeric`, and friends

@njtierney commented on Oct 3, 2017, 1:34 AM UTC:

Not sure if this should go in vctrs, please feel free to let me know if this should move.

It can be handy to test whether all or things are numeric, and just like there is a rlang::are_na, I'm wondering if there should be this for other types?

For example

are_numeric <- function(x, ...){
  any(as.logical(lapply(x, is.numeric)))
}

all_numeric <- function(x, ...){
  all(as.logical(lapply(x, is.numeric)))
}

are_numeric(iris)
#> [1] TRUE
are_numeric(letters)
#> [1] FALSE
are_numeric(1:10)
#> [1] TRUE
all_numeric(iris)
#> [1] FALSE
all_numeric(letters)
#> [1] FALSE
all_numeric(1:10)
#> [1] TRUE

This issue was moved by lionel- from r-lib/rlang#274.

Better error messages

vec_c(), vec_rbind(), and vec_cbind() need to create error messages that make it easier to determine the source of the error, even when nested.

e.g. how can we do better here?

library(vctrs)

vec_c(1, 2, vec_c(3, vec_c(4, "x")))
#> Error: No common type for double and character

A paste with NA handling

This might sound a bit out of scope but I think there's a place for this function somewhere.

I'm trying to do something very simple, which is concatenate a bunch of character columns in a data frame. The go to is tidyr::unite but my data set has a lot of NA values. unite concatenates vectors with paste, which means the result will be full of "blah, NA, blah, ...". There's a discussion of this here tidyverse/tidyr#203.

The simple solution is just to sub in a paste function that handled NAs and it seemed weird to me that this didn't already exist but googling just brought up more discussion with some hacky and slow pure R solutions.

So I think there's a need for a low level paste(..., sep = " ", collapse = NULL, na.rm = FALSE) and maybe vctrs is that place?

pattern-matching / 'map_when' helper?

@kevinushey commented on Aug 26, 2017, 4:02 AM UTC:

This would be something similar to case_when(), but for mapping conditions to operations -- something like a generalized switch.

As an example, the following code:

mtcars %>% map_when(
  starts_with("d") ~ prod(.),
  contains("a")    ~ sum(.),
  c("mpg", "cyl")  ~ cor(.)
)

would evaluate to something like:

> mtcars %>% map_when(
+   starts_with("d") ~ prod(.),
+   contains("a")    ~ sum(.),
+   c("mpg", "cyl")  ~ cor(.)
+ )
[[1]]
[1] 1.218225e+91

[[2]]
[1] 336.09

[[3]]
          mpg       cyl
mpg  1.000000 -0.852162
cyl -0.852162  1.000000

Would this be worth adding to rlang (or somewhere similar)? Or does this already exist as some function I'm not aware of?

This issue was moved by lionel- from r-lib/rlang#246.

Unfinished doc sentence in vec_cast documentation

This sentence looks unfinished to me, maybe an idea was lost while writing it:

vctrs/R/cast.R

Lines 18 to 22 in 6c42434

#' can only cast a subset of doubles back to integers. If a cast is lossy
#' for
#'
#' The rules for coercing from a list a fairly strict: each component of the
#' list must be of length 1, and must be coercible to type `to`.

There is an unfinished "If a cast is lossy for". Also later on there is a minor issue: The rules for coercing from a list a fairly strict: should be are fairly strict:.

It's nice to be reading a vctrs implementation!

A filter function for vectors

Filtering a vector by a condition, only returning the values for which condition is TRUE. Use x to indicate the vector in the condition so it is generic.

Rough sketch:

filter_vctr <- function(x, ..., na.rm = TRUE) {
  fun_c <- as.list(substitute(list(...)))[[2]]
  ret <- x[eval(fun_c)]
  if (isTRUE(na.rm)) {
    ret[!is.na(ret)]
  } else {
    ret
  }
}

Example:

rnorm(10) %>% round(1) %>% filter_vctr(abs(x) > 1)

Improvements to arithmetic generics

  • Binary generics should handle recycling for you
  • Add vec_grp_math(). Can we collapse with vec_grp_summary()?
  • vec_grp_unary() -> vec_grp_numeric1(), vec_grp_numeric() -> vec_grp_numeric2() ?
  • Need double dispatch for arithmetic to support (e.g.) date - date and date + 1

No code in Repo?

Is this repo depreciated? I don't see any code when I install with devtools.

Some notes

Since comments are welcome, some thoughts.

I think that having a list type that forces all objects to be of the same class can be really useful. I do have questions about introducing a new set of atomic types.

R has a type system that can already be confusing, there is

  • typeof, determining the R internal type
  • mode is a type designation that's closer to user experience (see below)
  • storage mode giving something closer to physical storage type (e.g. double for a value of type numeric)
  • class and inherits for basic types and (S3, S4, RC,...) class extensions and best for users.

Some examples

> types <- function(x) c(class=class(x)
  , typeof=typeof(x), mode=mode(x), storage.mode=storage.mode(x))
> dat <- list(int=1L,real=pi, complex=2+3i, string = character(), cat=factor(), binary=as.raw(0), fun=function(){})

> dat <- list(integer=1L,real=pi, complex=2+3i
  , string = character(), categorical=factor(), binary=as.raw(0)
  , 'function'=function(){})
> t(sapply(dat, types))
            class       typeof      mode        storage.mode
integer     "integer"   "integer"   "numeric"   "integer"   
real        "numeric"   "double"    "numeric"   "double"    
complex     "complex"   "complex"   "complex"   "complex"   
string      "character" "character" "character" "character" 
categorical "factor"    "integer"   "numeric"   "integer"   
binary      "raw"       "raw"       "raw"       "raw"       
function    "function"  "closure"   "function"  "function"  

So the question is if another `atomic' type designation just for the benefit of stricter coercion rules will help users. In my experience, it is already uncomfortable having to explain novice users the difference between data frames and tibbles. This is necessary because they will run into both very quickly once they start doing actual work with R. When you are a new user, this is another thing to put on your mental stack that has nothing to do with getting things done. This is even more true for vector types. For example, until now, I can explain R's coercion rules fairly easily, by letting people experiment with a few lines of code like:

c(10, "hello")

Then, when I ask them to explain what happened and why, this usually clears things up pretty quickly.

With these new vector types new users would have to grok a second type system, especially if you automatically translate everything that is (implicitly) translated to a tibble. [By the way, beyond R's usual behavior, I think that type casts should be asked for by the user explicitly (Explicit is better than implicit)].

So in conclusion, I do not feel that the benefits of stricter coercion rules outweigh the burden of having to cope with an extra system for atomic types for end-users. I see there may be benefits for (package) developers. So it should be developer-facing rather than user-facing.

"Ropes" for vectors

e.g. https://github.com/google/xi-editor/tree/master/doc/rope_science

This is probably out of scope for vctrs, but it would be interesting to think more about a tree like structure for vectors which would allow more efficient modification without having to copy the complete vector.

A finger-tree like structure would also make it very efficient to recompute algebraic statistics (i.e. counts, sums, means, sd) as the data changes.

Implement vec_proxy_compare

  • xtfrm() -> order(), sort()
  • <, <=, >, >=
  • min(), max()
  • median(), quantile() (change default type)

Replaces vec_proxy_order()

Promote integer and double to factor and character

Consider the use case of two CSV files, a.csv and b.csv, each with identical column names. One column is id. In a.csv, all id values are 10-digit codes. In b.csv, some id values contain letters. Concatenating the contents fails:

a <- readr::read_csv("id\n12345678901")
b <- readr::read_csv("id\nX100000")
#> vec_c(a$id, b$id)
Error: No common type for double and character

A better summary

Data frame, and grouped_df methods

select() semantics

Return tibble. One row per var, then break into groups. Col of types. List col of summaries.

Logical gives number of T/F/NA.

Integer/numeric min-[Q1 [median] Q3]-max NA: Inf?: []. Extract out common multiple 10³ x 1-[5 [10] 11]-20

For date/time, just display range. Need to special case if min & max on same day (do it by year/month/day/hour/minute ?).

Factors to give compact freq table. In one line. How?? (print method based on width?)

Characters give summary of length (1-20 chars). Encoding?. Number of empty & missing.

For unknown type, use obj_sum

https://github.com/holman/spark

Print method for vctrs

I recently inspected flights$arr_delay and hit getOption("max.print"). Does anyone ever want that? If I execute the below, depending on whether I'm in R Console, use RStudio's "Knit" button or use rmarkdown::render() I get anywhere from the first 1K to 10K elements. 😢

library(nycflights13)
flights$arr_delay

argument name antipattern

Having special names for some arguments when the other arguments are absorbed into everything else is an antipattern because the function can't tell between the argument as a special argument or as an unspecial named argument. Example:

vec_c(1, 2, .type = integer())
#> [1] 1 2

vec_c supports named vectors, so vec_c(a=1, b=2, .type=integer()) is valid, but vec_c(.print=1, .type=2, .screen=3) fails because the .type argument is treated specially.

The arguments for this antipattern are:

  • The low probability of having an argument with that name is very low - but that probability is non-zero and increases linearly with the amount of usage. When it hits it generates massive surprise.
  • Names starting with a dot are well-known as "special" and are documented - but this relies on people being introduced to the documentation at first usage.

The existence of specially-named arguments when there is a possibility of misinterpretation is an irregularity which seems incongruous with everything else being tidy.

You could write vec_c to return a function, and then pass the type argument as a further argument:

vec_c(a=1, .type=99, type=23)()  # returns c(a=1,.type=99, type=23)
vec_c(a=1, .type=99, type=23)(type=integer()) # as above but integer type.

This approach has the advantage that the argument no-longer needs to be "dotted".

Complete lubridate support

#' @examples
#' w1 <- lubridate::years(1)
#' w2 <- lubridate::ddays(7)
#' w3 <- lubridate::interval("2020-01-01", "2020-01-08")
#'
#' vec_ptype(w1)
#' vec_ptype(w2)
#' vec_ptype(w3)
#'
#' vec_c(w1, w3, .ptype = w3)
#' vec_c(w1, w2)
#' vec_c(w2, w3)
#'
#' library(lubridate)
#' vec_c(years(1), months(1), weeks(1), days(1))
#' vec_c(dyears(1), years(1))
NULL

# https://github.com/tidyverse/lubridate/issues/707
new_Period <- function() lubridate::seconds(integer())
new_Duration <- function() lubridate::dseconds(integer())
new_Interval <- function() lubridate::interval(character(), character())

# vec_type2 ---------------------------------------------------------------

vec_type2.Period <- function(x, y) UseMethod("vec_type2.Period")
vec_type2.Duration <- function(x, y) UseMethod("vec_type2.Duration")
vec_type2.Interval <- function(x, y) UseMethod("vec_type2.Interval")

vec_type2.Period.NULL <- function(x, y) x[0L]
vec_type2.Duration.NULL <- function(x, y) x[0L]
vec_type2.Interval.NULL <- function(x, y) x[0L]

vec_type2.Period.default <- function(x, y) stop_incompatible_type(x, y)
vec_type2.Duration.default <- function(x, y) stop_incompatible_type(x, y)
vec_type2.Interval.default <- function(x, y) stop_incompatible_type(x, y)

vec_type2.Period.Period     <- function(x, y) new_Period()

vec_type2.Period.Duration   <- function(x, y) new_Period()
vec_type2.Duration.Period   <- function(x, y) new_Period()

vec_type2.difftime.Period   <- function(x, y) new_Period()
vec_type2.Period.difftime   <- function(x, y) new_Period()

vec_type2.Period.Interval   <- function(x, y) new_Period()
vec_type2.Interval.Period   <- function(x, y) new_Period()

vec_type2.Duration.Duration <- function(x, y) new_Duration()

vec_type2.difftime.Duration <- function(x, y) new_Duration()
vec_type2.Duration.difftime <- function(x, y) new_Duration()

vec_type2.Duration.Interval <- function(x, y) new_Duration()
vec_type2.Interval.Duration <- function(x, y) new_Duration()

# vec_type2.difftime.difftime <- function(x, y) new_Duration()
vec_type2.difftime.Interval <- function(x, y) new_difftime()
vec_type2.Interval.difftime <- function(x, y) new_difftime()

vec_type2.Interval.Interval <- function(x, y) new_Interval()


# vec_cast ----------------------------------------------------------------

vec_cast.Period <- function(x, to) UseMethod("vec_cast.Period")
vec_cast.Duration <- function(x, to) UseMethod("vec_cast.Duration")
vec_cast.Interval <- function(x, to) UseMethod("vec_cast.Interval")

vec_cast.Period.NULL <- function(x, to) x
vec_cast.Duration.NULL <- function(x, to) x
vec_cast.Interval.NULL <- function(x, to) x

vec_cast.Period.default <- function(x, to) stop_incompatible_cast(x, to)
vec_cast.Duration.default <- function(x, to) stop_incompatible_cast(x, to)
vec_cast.Interval.default <- function(x, to) stop_incompatible_cast(x, to)

vec_cast.Period.Period     <- function(x, to) x
vec_cast.Period.Duration   <- function(x, to) lubridate::as.period(x)
vec_cast.Period.difftime   <- function(x, to) lubridate::as.period(x)
vec_cast.Period.Interval   <- function(x, to) lubridate::as.period(x)

vec_cast.Duration.Duration <- function(x, to) x
vec_cast.Duration.Period   <- function(x, to) lubridate::as.duration(x)
vec_cast.Duration.difftime <- function(x, to) lubridate::as.duration(x)
vec_cast.Duration.Interval <- function(x, to) lubridate::as.duration(x)

vec_cast.Interval.Interval <- function(x, to) x

# vec_cast.difftime.difftime <- function(x, to) new_Duration()

Create an empty Date vector

Sometimes it is useful to create empty vectors using commands like integer(), character(), etc.

Very often I use Date vectors created with the ymd function from lubridate but I am not able to create empty Date vectors (maybe there is a way to do taht but I don't know the trick).

Could be this package the right place in which expose this feature?

Vectorized isTRUE() and friends

Vectorized isTRUE() would be really helpful. And lots of people define isFALSE() in package utils. And then it's a slippery slope ...

x <- c(TRUE, NA, FALSE)
is_true <- Vectorize(isTRUE)
is_false <- Vectorize(function(x) identical(x, FALSE))
is_not_true <- function(x) !is_true(x)
is_not_false <- function(x) !is_false(x)
is_true(x)
#> [1]  TRUE FALSE FALSE
is_false(x)
#> [1] FALSE FALSE  TRUE
is_not_true(x)
#> [1] FALSE  TRUE  TRUE
is_not_false(x)
#> [1]  TRUE  TRUE FALSE

Partial types

It would be useful to supply the type of some columns in a data frame, or to specify that you wanted a factor without specifying the levels.

This requires some sort of "partial" type, which find_type() would handle specially

Hashing and equality functions

Related to:

  • split()
  • duplicated()
  • table() / count() (and need for vector version of count())
  • joins / lookup
  • match()
  • intersect(), setdiff() etc

Should vctr support names?

  • Check handled appropriately in print method
  • $ error message should match 1$a
  • Should names be unique? If so, what will rep() and [ to preserve uniqueness?

vctr base class for vectors

Could provide method implementations for:

  • as.data.frame()
  • print() in terms of format()
  • [, [[ in terms of reconstruct()
  • rep() in terms of [
  • as.list() in terms of [[
  • [<-, [[<- in terms of vec_cast() ?
  • names<- and dim<-

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.