r-lib / vctrs Goto Github PK

View Code? Open in Web Editor NEW

282.0 13.0 65.0 26.78 MB

Generic programming with typed R vectors

Home Page: https://vctrs.r-lib.org

License: Other

R 46.77% C 53.19% C++ 0.04%

r s3-vectors

vctrs's Introduction

vctrs

There are three main goals to the vctrs package, each described in a vignette:

To propose vec_size() and vec_ptype() as alternatives to length() and class(); vignette("type-size"). These definitions are paired with a framework for size-recycling and type-coercion. ptype should evoke the notion of a prototype, i.e. the original or typical form of something.
To define size- and type-stability as desirable function properties, use them to analyse existing base functions, and to propose better alternatives; vignette("stability"). This work has been particularly motivated by thinking about the ideal properties of c(), ifelse(), and rbind().
To provide a new vctr base class that makes it easy to create new S3 vectors; vignette("s3-vector"). vctrs provides methods for many base generics in terms of a few new vctrs generics, making implementation considerably simpler and more robust.

vctrs is a developer-focussed package. Understanding and extending vctrs requires some effort from developers, but should be invisible to most users. It’s our hope that having an underlying theory will mean that users can build up an accurate mental model without explicitly learning the theory. vctrs will typically be used by other packages, making it easy for them to provide new classes of S3 vectors that are supported throughout the tidyverse (and beyond). For that reason, vctrs has few dependencies.

Installation

Install vctrs from CRAN with:

install.packages("vctrs")

Alternatively, if you need the development version, install it with:

# install.packages("pak")
pak::pak("r-lib/vctrs")

Usage

library(vctrs)

# Sizes
str(vec_size_common(1, 1:10))
#>  int 10
str(vec_recycle_common(1, 1:10))
#> List of 2
#>  $ : num [1:10] 1 1 1 1 1 1 1 1 1 1
#>  $ : int [1:10] 1 2 3 4 5 6 7 8 9 10

# Prototypes
str(vec_ptype_common(FALSE, 1L, 2.5))
#>  num(0)
str(vec_cast_common(FALSE, 1L, 2.5))
#> List of 3
#>  $ : num 0
#>  $ : num 1
#>  $ : num 2.5

Motivation

The original motivation for vctrs comes from two separate but related problems. The first problem is that base::c() has rather undesirable behaviour when you mix different S3 vectors:

# combining factors makes integers
c(factor("a"), factor("b"))
#> [1] 1 1

# combining dates and date-times gives incorrect values; also, order matters
dt <- as.Date("2020-01-01")
dttm <- as.POSIXct(dt)

c(dt, dttm)
#> [1] "2020-01-01"    "4321940-06-07"
c(dttm, dt)
#> [1] "2019-12-31 19:00:00 EST" "1970-01-01 00:04:22 EST"

This behaviour arises because c() has dual purposes: as well as its primary duty of combining vectors, it has a secondary duty of stripping attributes. For example, ?POSIXct suggests that you should use c() if you want to reset the timezone.

The second problem is that dplyr::bind_rows() is not extensible by others. Currently, it handles arbitrary S3 classes using heuristics, but these often fail, and it feels like we really need to think through the problem in order to build a principled solution. This intersects with the need to cleanly support more types of data frame columns, including lists of data frames, data frames, and matrices.

vctrs's People

Contributors

Stargazers

Watchers

Forkers

lionel- krlmlr gergness jimhester thiyangt njtierney kevinykuo yutannihilation davisvaughan echasnovski sgnajar batpigandme juangomezduaso ijlyttle jeffreypullin earowang shobhit0511 jessesadler zachary-foster qulogic nanaakwasiabayieboateng mabafaba mikmart dpprdan rcodo gshotwell coreyyanofsky-zz 808sandbr jonocarroll michaelquinn32 mgirlich cstepper paleolimbot trinker markromanmiller maurolepore davidchall pantheracorp tubbz-alt maxheld83 jimsforks reedacartwright akirathan isabella232 romainfrancois rnaimehaom dragosmg halhen stefvanbuuren michaelchirico mdsumner fenguoerbian averissimo r-wasm josiahparry tjmahr joshwlambert timtaylor shikokuchuo olivroy mjskay gast1111 khusmann emilhvitfeldt

vctrs's Issues

Coercion rules

Rules:

integer + double -> double
logical NA + anything -> anything
factor + factor (same levels) -> factor
factor + factor (diff levels) -> character (WARN)
factor + character -> character (WARN)
dates + datetime -> date time
datetime + datetime (different tz) -> date time (first tz)

Implement in C, and provide replacement to dplyr::combine()

# Integer coerced to double
df <- data_frame(x = 1:2) %>%
  group_by(x) %>%
  summarise(y = if (x == 1) 1L else 1)
expect_type(df$y, "double")

df <- data_frame(x = 1:2) %>%
  group_by(x) %>%
  mutate(y = if (x == 1) 1L else 1)
expect_type(df$y, "double")

df1 <- data_frame(x = 1L)
df2 <- data_frame(x = 1)
df <- bind_rows(df1, df2)
expect_type(df$x, "double")
expect_type(combine(df1$x, df2$x), "double")

df <- inner_join(df1, df2, by = "x")
expect_type(df$x, "double")

Figure out how to provide efficient extension mechanism.

Empty-default operator

@egnha commented on Aug 25, 2017, 6:42 AM UTC:

Would there be a place for an empty-default operator in rlang?

`%|||%` <- function(x, y) {
  if (is_empty(x)) y else x
}

This is handy in contexts where the notion of "emptiness" you want to check might not be type-consistent (e.g., "empty" names() is NULL, whereas "empty" paste() is character(0)).

This issue was moved by lionel- from r-lib/rlang#244.

rescaling helpers

cf scale() which always returns a matrix. Look back to reshape(1) for other useful rescalers.

`are_numeric`, `all_numeric`, and friends

@njtierney commented on Oct 3, 2017, 1:34 AM UTC:

Not sure if this should go in vctrs, please feel free to let me know if this should move.

It can be handy to test whether all or things are numeric, and just like there is a rlang::are_na, I'm wondering if there should be this for other types?

For example

are_numeric <- function(x, ...){
  any(as.logical(lapply(x, is.numeric)))
}

all_numeric <- function(x, ...){
  all(as.logical(lapply(x, is.numeric)))
}

are_numeric(iris)
#> [1] TRUE
are_numeric(letters)
#> [1] FALSE
are_numeric(1:10)
#> [1] TRUE
all_numeric(iris)
#> [1] FALSE
all_numeric(letters)
#> [1] FALSE
all_numeric(1:10)
#> [1] TRUE

This issue was moved by lionel- from r-lib/rlang#274.

Better error messages

vec_c(), vec_rbind(), and vec_cbind() need to create error messages that make it easier to determine the source of the error, even when nested.

e.g. how can we do better here?

library(vctrs)

vec_c(1, 2, vec_c(3, vec_c(4, "x")))
#> Error: No common type for double and character

Improve class names

Rename record to rcrd
Add vctrs_ prefix to all class names

A paste with NA handling

This might sound a bit out of scope but I think there's a place for this function somewhere.

I'm trying to do something very simple, which is concatenate a bunch of character columns in a data frame. The go to is tidyr::unite but my data set has a lot of NA values. unite concatenates vectors with paste, which means the result will be full of "blah, NA, blah, ...". There's a discussion of this here tidyverse/tidyr#203.

The simple solution is just to sub in a paste function that handled NAs and it seemed weird to me that this didn't already exist but googling just brought up more discussion with some hacky and slow pure R solutions.

So I think there's a need for a low level paste(..., sep = " ", collapse = NULL, na.rm = FALSE) and maybe vctrs is that place?

Option to vec_rbind() to take intersection of data frame columns

Rather than the union.

As suggested by @gmbecker

Explore SIMD and parallel optimisations

Via RcppParallel and/or RcppArmadillo

Eliminate default of using as.list

And instead define as.list() using vec_cast()

Rename unknown to anything

And finish consideration of casting

pattern-matching / 'map_when' helper?

@kevinushey commented on Aug 26, 2017, 4:02 AM UTC:

This would be something similar to case_when(), but for mapping conditions to operations -- something like a generalized switch.

As an example, the following code:

mtcars %>% map_when(
  starts_with("d") ~ prod(.),
  contains("a")    ~ sum(.),
  c("mpg", "cyl")  ~ cor(.)
)

would evaluate to something like:

> mtcars %>% map_when(
+   starts_with("d") ~ prod(.),
+   contains("a")    ~ sum(.),
+   c("mpg", "cyl")  ~ cor(.)
+ )
[[1]]
[1] 1.218225e+91

[[2]]
[1] 336.09

[[3]]
          mpg       cyl
mpg  1.000000 -0.852162
cyl -0.852162  1.000000

Would this be worth adding to rlang (or somewhere similar)? Or does this already exist as some function I'm not aware of?

This issue was moved by lionel- from r-lib/rlang#246.

Unfinished doc sentence in vec_cast documentation

This sentence looks unfinished to me, maybe an idea was lost while writing it:

vctrs/R/cast.R

Lines 18 to 22 in 6c42434

    
           #' can only cast a subset of doubles back to integers. If a cast is lossy 
        
           #' for 
        
           #' 
        
           #' The rules for coercing from a list a fairly strict: each component of the 
        
           #' list must be of length 1, and must be coercible to type `to`.

There is an unfinished "If a cast is lossy for". Also later on there is a minor issue: The rules for coercing from a list a fairly strict: should be are fairly strict:.

It's nice to be reading a vctrs implementation!

A filter function for vectors

Filtering a vector by a condition, only returning the values for which condition is TRUE. Use x to indicate the vector in the condition so it is generic.

Rough sketch:

filter_vctr <- function(x, ..., na.rm = TRUE) {
  fun_c <- as.list(substitute(list(...)))[[2]]
  ret <- x[eval(fun_c)]
  if (isTRUE(na.rm)) {
    ret[!is.na(ret)]
  } else {
    ret
  }
}

Example:

rnorm(10) %>% round(1) %>% filter_vctr(abs(x) > 1)

Improvements to arithmetic generics

Binary generics should handle recycling for you
Add vec_grp_math(). Can we collapse with vec_grp_summary()?
vec_grp_unary() -> vec_grp_numeric1(), vec_grp_numeric() -> vec_grp_numeric2() ?
Need double dispatch for arithmetic to support (e.g.) date - date and date + 1

No code in Repo?

Is this repo depreciated? I don't see any code when I install with devtools.

Some notes

Since comments are welcome, some thoughts.

I think that having a list type that forces all objects to be of the same class can be really useful. I do have questions about introducing a new set of atomic types.

R has a type system that can already be confusing, there is

typeof, determining the R internal type
mode is a type designation that's closer to user experience (see below)
storage mode giving something closer to physical storage type (e.g. double for a value of type numeric)
class and inherits for basic types and (S3, S4, RC,...) class extensions and best for users.

Some examples

> types <- function(x) c(class=class(x)
  , typeof=typeof(x), mode=mode(x), storage.mode=storage.mode(x))
> dat <- list(int=1L,real=pi, complex=2+3i, string = character(), cat=factor(), binary=as.raw(0), fun=function(){})

> dat <- list(integer=1L,real=pi, complex=2+3i
  , string = character(), categorical=factor(), binary=as.raw(0)
  , 'function'=function(){})
> t(sapply(dat, types))
            class       typeof      mode        storage.mode
integer     "integer"   "integer"   "numeric"   "integer"   
real        "numeric"   "double"    "numeric"   "double"    
complex     "complex"   "complex"   "complex"   "complex"   
string      "character" "character" "character" "character" 
categorical "factor"    "integer"   "numeric"   "integer"   
binary      "raw"       "raw"       "raw"       "raw"       
function    "function"  "closure"   "function"  "function"

So the question is if another `atomic' type designation just for the benefit of stricter coercion rules will help users. In my experience, it is already uncomfortable having to explain novice users the difference between data frames and tibbles. This is necessary because they will run into both very quickly once they start doing actual work with R. When you are a new user, this is another thing to put on your mental stack that has nothing to do with getting things done. This is even more true for vector types. For example, until now, I can explain R's coercion rules fairly easily, by letting people experiment with a few lines of code like:

c(10, "hello")

Then, when I ask them to explain what happened and why, this usually clears things up pretty quickly.

With these new vector types new users would have to grok a second type system, especially if you automatically translate everything that is (implicitly) translated to a tibble. [By the way, beyond R's usual behavior, I think that type casts should be asked for by the user explicitly (Explicit is better than implicit)].

So in conclusion, I do not feel that the benefits of stricter coercion rules outweigh the burden of having to cope with an extra system for atomic types for end-users. I see there may be benefits for (package) developers. So it should be developer-facing rather than user-facing.

[FR] Better c() for named vectors

As reported in tidyverse/dplyr#2284, c() behaves very tricky for named vectors. We need better c().

x <- "φ"
names(x) <- "φ"

Encoding(names(x))
#> [1] "unknown"

Encoding(names(c(x)))
#> [1] "UTF-8"

"Ropes" for vectors

e.g. https://github.com/google/xi-editor/tree/master/doc/rope_science

This is probably out of scope for vctrs, but it would be interesting to think more about a tree like structure for vectors which would allow more efficient modification without having to copy the complete vector.

A finger-tree like structure would also make it very efficient to recompute algebraic statistics (i.e. counts, sums, means, sd) as the data changes.

has_dim() predicate

@lionel- commented on Jul 2, 2018, 1:17 PM UTC:

has_dim(x, ndim = 2)

Do we also want

has_dim(x, dim = c(2, 1))

Supplying both ndim and dim would be an error.

This issue was moved by lionel- from r-lib/rlang#552.

Implement is.na

Reimplement list_of using vctr

Consider attribute preservation

In light of how people tend to actually use attributes

Provide is_vector generic

With default method for S4 objects, using the approach outlined in
tidyverse/tibble#326 (comment)

Benchmark vec_rbind()

Compare to do.call() and bind_rows()

Identity bottlenecks

Support nonstandard representations?

POSIXlt (list of 11, each element has length n)
intervals (n x 2 matrix)
S4 classes
...

Implement vec_proxy_compare

xtfrm() -> order(), sort()
<, <=, >, >=
min(), max()
median(), quantile() (change default type)

Replaces vec_proxy_order()

Promote integer and double to factor and character

Consider the use case of two CSV files, a.csv and b.csv, each with identical column names. One column is id. In a.csv, all id values are 10-digit codes. In b.csv, some id values contain letters. Concatenating the contents fails:

a <- readr::read_csv("id\n12345678901")
b <- readr::read_csv("id\nX100000")
#> vec_c(a$id, b$id)
Error: No common type for double and character

Try ellipsis in generics

Flesh out S3 vectors vignette

cached sum
rational numbers - pair of values
polynomials? integer vector inside list

Extract hybrid internals out of dplyr

e.g. internal implementation of mean, sum, etc

Look at bigvis for other important summary functions

A better summary

Data frame, and grouped_df methods

select() semantics

Return tibble. One row per var, then break into groups. Col of types. List col of summaries.

Logical gives number of T/F/NA.

Integer/numeric min-[Q1 [median] Q3]-max NA: Inf?: []. Extract out common multiple 10³ x 1-[5 [10] 11]-20

For date/time, just display range. Need to special case if min & max on same day (do it by year/month/day/hour/minute ?).

Factors to give compact freq table. In one line. How?? (print method based on width?)

Characters give summary of length (1-20 chars). Encoding?. Number of empty & missing.

For unknown type, use obj_sum

https://github.com/holman/spark

What is the correct way to access the "first" class for AsIs forwarding

df <- tibble::tibble(x = 1:50)
vctrs::vec_ptype_full(I(df))
#> [1] "I<AsIs<x:integer>>"

class(x)[[1]] gives AsIs; .class gives data.frame (because that's the method that gets dispatched upon)

Print method for vctrs

I recently inspected flights$arr_delay and hit getOption("max.print"). Does anyone ever want that? If I execute the below, depending on whether I'm in R Console, use RStudio's "Knit" button or use rmarkdown::render() I get anywhere from the first 1K to 10K elements. 😢

library(nycflights13)
flights$arr_delay

argument name antipattern

Having special names for some arguments when the other arguments are absorbed into everything else is an antipattern because the function can't tell between the argument as a special argument or as an unspecial named argument. Example:

vec_c(1, 2, .type = integer())
#> [1] 1 2

vec_c supports named vectors, so vec_c(a=1, b=2, .type=integer()) is valid, but vec_c(.print=1, .type=2, .screen=3) fails because the .type argument is treated specially.

The arguments for this antipattern are:

The low probability of having an argument with that name is very low - but that probability is non-zero and increases linearly with the amount of usage. When it hits it generates massive surprise.
Names starting with a dot are well-known as "special" and are documented - but this relies on people being introduced to the documentation at first usage.

The existence of specially-named arguments when there is a possibility of misinterpretation is an irregularity which seems incongruous with everything else being tidy.

You could write vec_c to return a function, and then pass the type argument as a further argument:

vec_c(a=1, .type=99, type=23)()  # returns c(a=1,.type=99, type=23)
vec_c(a=1, .type=99, type=23)(type=integer()) # as above but integer type.

This approach has the advantage that the argument no-longer needs to be "dotted".

Accessing, testing for and rationalizing names

There's a set of name-handling functions that recur in many a utils.R file. They would seem to fit nicely here.

Some specific examples:

names2() from purrr or tibble

has_names() from purrr or httr

named() and unnamed() from httr

Implement reconstruct generic

It's subtly different from vec_cast(x, vec_ptype(x)) - see rcrd_reconstruct

Complete lubridate support

#' @examples
#' w1 <- lubridate::years(1)
#' w2 <- lubridate::ddays(7)
#' w3 <- lubridate::interval("2020-01-01", "2020-01-08")
#'
#' vec_ptype(w1)
#' vec_ptype(w2)
#' vec_ptype(w3)
#'
#' vec_c(w1, w3, .ptype = w3)
#' vec_c(w1, w2)
#' vec_c(w2, w3)
#'
#' library(lubridate)
#' vec_c(years(1), months(1), weeks(1), days(1))
#' vec_c(dyears(1), years(1))
NULL

# https://github.com/tidyverse/lubridate/issues/707
new_Period <- function() lubridate::seconds(integer())
new_Duration <- function() lubridate::dseconds(integer())
new_Interval <- function() lubridate::interval(character(), character())

# vec_type2 ---------------------------------------------------------------

vec_type2.Period <- function(x, y) UseMethod("vec_type2.Period")
vec_type2.Duration <- function(x, y) UseMethod("vec_type2.Duration")
vec_type2.Interval <- function(x, y) UseMethod("vec_type2.Interval")

vec_type2.Period.NULL <- function(x, y) x[0L]
vec_type2.Duration.NULL <- function(x, y) x[0L]
vec_type2.Interval.NULL <- function(x, y) x[0L]

vec_type2.Period.default <- function(x, y) stop_incompatible_type(x, y)
vec_type2.Duration.default <- function(x, y) stop_incompatible_type(x, y)
vec_type2.Interval.default <- function(x, y) stop_incompatible_type(x, y)

vec_type2.Period.Period     <- function(x, y) new_Period()

vec_type2.Period.Duration   <- function(x, y) new_Period()
vec_type2.Duration.Period   <- function(x, y) new_Period()

vec_type2.difftime.Period   <- function(x, y) new_Period()
vec_type2.Period.difftime   <- function(x, y) new_Period()

vec_type2.Period.Interval   <- function(x, y) new_Period()
vec_type2.Interval.Period   <- function(x, y) new_Period()

vec_type2.Duration.Duration <- function(x, y) new_Duration()

vec_type2.difftime.Duration <- function(x, y) new_Duration()
vec_type2.Duration.difftime <- function(x, y) new_Duration()

vec_type2.Duration.Interval <- function(x, y) new_Duration()
vec_type2.Interval.Duration <- function(x, y) new_Duration()

# vec_type2.difftime.difftime <- function(x, y) new_Duration()
vec_type2.difftime.Interval <- function(x, y) new_difftime()
vec_type2.Interval.difftime <- function(x, y) new_difftime()

vec_type2.Interval.Interval <- function(x, y) new_Interval()


# vec_cast ----------------------------------------------------------------

vec_cast.Period <- function(x, to) UseMethod("vec_cast.Period")
vec_cast.Duration <- function(x, to) UseMethod("vec_cast.Duration")
vec_cast.Interval <- function(x, to) UseMethod("vec_cast.Interval")

vec_cast.Period.NULL <- function(x, to) x
vec_cast.Duration.NULL <- function(x, to) x
vec_cast.Interval.NULL <- function(x, to) x

vec_cast.Period.default <- function(x, to) stop_incompatible_cast(x, to)
vec_cast.Duration.default <- function(x, to) stop_incompatible_cast(x, to)
vec_cast.Interval.default <- function(x, to) stop_incompatible_cast(x, to)

vec_cast.Period.Period     <- function(x, to) x
vec_cast.Period.Duration   <- function(x, to) lubridate::as.period(x)
vec_cast.Period.difftime   <- function(x, to) lubridate::as.period(x)
vec_cast.Period.Interval   <- function(x, to) lubridate::as.period(x)

vec_cast.Duration.Duration <- function(x, to) x
vec_cast.Duration.Period   <- function(x, to) lubridate::as.duration(x)
vec_cast.Duration.difftime <- function(x, to) lubridate::as.duration(x)
vec_cast.Duration.Interval <- function(x, to) lubridate::as.duration(x)

vec_cast.Interval.Interval <- function(x, to) x

# vec_cast.difftime.difftime <- function(x, to) new_Duration()

Efficient n_distinct

See tidyverse/dplyr#977

Export register_s3_method

Create an empty Date vector

Sometimes it is useful to create empty vectors using commands like integer(), character(), etc.

Very often I use Date vectors created with the ymd function from lubridate but I am not able to create empty Date vectors (maybe there is a way to do taht but I don't know the trick).

Could be this package the right place in which expose this feature?

Review base R for inconvenient interfaces

Vectorized isTRUE() and friends

Vectorized isTRUE() would be really helpful. And lots of people define isFALSE() in package utils. And then it's a slippery slope ...

x <- c(TRUE, NA, FALSE)
is_true <- Vectorize(isTRUE)
is_false <- Vectorize(function(x) identical(x, FALSE))
is_not_true <- function(x) !is_true(x)
is_not_false <- function(x) !is_false(x)
is_true(x)
#> [1]  TRUE FALSE FALSE
is_false(x)
#> [1] FALSE FALSE  TRUE
is_not_true(x)
#> [1] FALSE  TRUE  TRUE
is_not_false(x)
#> [1]  TRUE  TRUE FALSE

Partial types

It would be useful to supply the type of some columns in a data frame, or to specify that you wanted a factor without specifying the levels.

This requires some sort of "partial" type, which find_type() would handle specially

Hashing and equality functions

Related to:

split()
duplicated()
table() / count() (and need for vector version of count())
joins / lookup
match()
intersect(), setdiff() etc

Should vctr support names?

Check handled appropriately in print method
$ error message should match 1$a
Should names be unique? If so, what will rep() and [ to preserve uniqueness?

as.data.frame()
print() in terms of format()
[, [[ in terms of reconstruct()
rep() in terms of [
as.list() in terms of [[
[<-, [[<- in terms of vec_cast() ?
names<- and dim<-

tibble::lst()

perhaps should live here.

	#' can only cast a subset of doubles back to integers. If a cast is lossy
	#' for
	#'
	#' The rules for coercing from a list a fairly strict: each component of the
	#' list must be of length 1, and must be coercible to type `to`.