r-lib / vctrs Goto Github PK

View Code? Open in Web Editor NEW

284.0 13.0 66.0 26.78 MB

Generic programming with typed R vectors

Home Page: https://vctrs.r-lib.org

License: Other

R 46.77% C 53.19% C++ 0.04%

r s3-vectors

vctrs's Issues

Implement reconstruct generic

It's subtly different from vec_cast(x, vec_ptype(x)) - see rcrd_reconstruct

Empty-default operator

@egnha commented on Aug 25, 2017, 6:42 AM UTC:

Would there be a place for an empty-default operator in rlang?

`%|||%` <- function(x, y) {
  if (is_empty(x)) y else x
}

This is handy in contexts where the notion of "emptiness" you want to check might not be type-consistent (e.g., "empty" names() is NULL, whereas "empty" paste() is character(0)).

This issue was moved by lionel- from r-lib/rlang#244.

Flesh out S3 vectors vignette

cached sum
rational numbers - pair of values
polynomials? integer vector inside list

Provide is_vector generic

With default method for S4 objects, using the approach outlined in
tidyverse/tibble#326 (comment)

Option to vec_rbind() to take intersection of data frame columns

Rather than the union.

As suggested by @gmbecker

Consider attribute preservation

In light of how people tend to actually use attributes

Benchmark vec_rbind()

Compare to do.call() and bind_rows()

Identity bottlenecks

Promote integer and double to factor and character

Consider the use case of two CSV files, a.csv and b.csv, each with identical column names. One column is id. In a.csv, all id values are 10-digit codes. In b.csv, some id values contain letters. Concatenating the contents fails:

a <- readr::read_csv("id\n12345678901")
b <- readr::read_csv("id\nX100000")
#> vec_c(a$id, b$id)
Error: No common type for double and character

A better summary

Data frame, and grouped_df methods

select() semantics

Return tibble. One row per var, then break into groups. Col of types. List col of summaries.

Logical gives number of T/F/NA.

Integer/numeric min-[Q1 [median] Q3]-max NA: Inf?: []. Extract out common multiple 10³ x 1-[5 [10] 11]-20

For date/time, just display range. Need to special case if min & max on same day (do it by year/month/day/hour/minute ?).

Factors to give compact freq table. In one line. How?? (print method based on width?)

Characters give summary of length (1-20 chars). Encoding?. Number of empty & missing.

For unknown type, use obj_sum

https://github.com/holman/spark

Create an empty Date vector

Sometimes it is useful to create empty vectors using commands like integer(), character(), etc.

Very often I use Date vectors created with the ymd function from lubridate but I am not able to create empty Date vectors (maybe there is a way to do taht but I don't know the trick).

Could be this package the right place in which expose this feature?

Hashing and equality functions

Related to:

split()
duplicated()
table() / count() (and need for vector version of count())
joins / lookup
match()
intersect(), setdiff() etc

Review base R for inconvenient interfaces

Reimplement list_of using vctr

Support nonstandard representations?

POSIXlt (list of 11, each element has length n)
intervals (n x 2 matrix)
S4 classes
...

Rename unknown to anything

And finish consideration of casting

Efficient n_distinct

See tidyverse/dplyr#977

Implement vec_proxy_compare

xtfrm() -> order(), sort()
<, <=, >, >=
min(), max()
median(), quantile() (change default type)

Replaces vec_proxy_order()

[FR] Better c() for named vectors

As reported in tidyverse/dplyr#2284, c() behaves very tricky for named vectors. We need better c().

x <- "φ"
names(x) <- "φ"

Encoding(names(x))
#> [1] "unknown"

Encoding(names(c(x)))
#> [1] "UTF-8"

What is the correct way to access the "first" class for AsIs forwarding

df <- tibble::tibble(x = 1:50)
vctrs::vec_ptype_full(I(df))
#> [1] "I<AsIs<x:integer>>"

class(x)[[1]] gives AsIs; .class gives data.frame (because that's the method that gets dispatched upon)

argument name antipattern

Having special names for some arguments when the other arguments are absorbed into everything else is an antipattern because the function can't tell between the argument as a special argument or as an unspecial named argument. Example:

vec_c(1, 2, .type = integer())
#> [1] 1 2

vec_c supports named vectors, so vec_c(a=1, b=2, .type=integer()) is valid, but vec_c(.print=1, .type=2, .screen=3) fails because the .type argument is treated specially.

The arguments for this antipattern are:

The low probability of having an argument with that name is very low - but that probability is non-zero and increases linearly with the amount of usage. When it hits it generates massive surprise.
Names starting with a dot are well-known as "special" and are documented - but this relies on people being introduced to the documentation at first usage.

The existence of specially-named arguments when there is a possibility of misinterpretation is an irregularity which seems incongruous with everything else being tidy.

You could write vec_c to return a function, and then pass the type argument as a further argument:

vec_c(a=1, .type=99, type=23)()  # returns c(a=1,.type=99, type=23)
vec_c(a=1, .type=99, type=23)(type=integer()) # as above but integer type.

This approach has the advantage that the argument no-longer needs to be "dotted".

Implement is.na

Use ALTREP for recycling

So we don't have to copy the input data

`are_numeric`, `all_numeric`, and friends

@njtierney commented on Oct 3, 2017, 1:34 AM UTC:

Not sure if this should go in vctrs, please feel free to let me know if this should move.

It can be handy to test whether all or things are numeric, and just like there is a rlang::are_na, I'm wondering if there should be this for other types?

For example

are_numeric <- function(x, ...){
  any(as.logical(lapply(x, is.numeric)))
}

all_numeric <- function(x, ...){
  all(as.logical(lapply(x, is.numeric)))
}

are_numeric(iris)
#> [1] TRUE
are_numeric(letters)
#> [1] FALSE
are_numeric(1:10)
#> [1] TRUE
all_numeric(iris)
#> [1] FALSE
all_numeric(letters)
#> [1] FALSE
all_numeric(1:10)
#> [1] TRUE

This issue was moved by lionel- from r-lib/rlang#274.

"Ropes" for vectors

e.g. https://github.com/google/xi-editor/tree/master/doc/rope_science

This is probably out of scope for vctrs, but it would be interesting to think more about a tree like structure for vectors which would allow more efficient modification without having to copy the complete vector.

A finger-tree like structure would also make it very efficient to recompute algebraic statistics (i.e. counts, sums, means, sd) as the data changes.

Improvements to arithmetic generics

Binary generics should handle recycling for you
Add vec_grp_math(). Can we collapse with vec_grp_summary()?
vec_grp_unary() -> vec_grp_numeric1(), vec_grp_numeric() -> vec_grp_numeric2() ?
Need double dispatch for arithmetic to support (e.g.) date - date and date + 1

Coercion rules

Rules:

integer + double -> double
logical NA + anything -> anything
factor + factor (same levels) -> factor
factor + factor (diff levels) -> character (WARN)
factor + character -> character (WARN)
dates + datetime -> date time
datetime + datetime (different tz) -> date time (first tz)

Implement in C, and provide replacement to dplyr::combine()

# Integer coerced to double
df <- data_frame(x = 1:2) %>%
  group_by(x) %>%
  summarise(y = if (x == 1) 1L else 1)
expect_type(df$y, "double")

df <- data_frame(x = 1:2) %>%
  group_by(x) %>%
  mutate(y = if (x == 1) 1L else 1)
expect_type(df$y, "double")

df1 <- data_frame(x = 1L)
df2 <- data_frame(x = 1)
df <- bind_rows(df1, df2)
expect_type(df$x, "double")
expect_type(combine(df1$x, df2$x), "double")

df <- inner_join(df1, df2, by = "x")
expect_type(df$x, "double")

Figure out how to provide efficient extension mechanism.

Explore SIMD and parallel optimisations

Via RcppParallel and/or RcppArmadillo

A filter function for vectors

Filtering a vector by a condition, only returning the values for which condition is TRUE. Use x to indicate the vector in the condition so it is generic.

Rough sketch:

filter_vctr <- function(x, ..., na.rm = TRUE) {
  fun_c <- as.list(substitute(list(...)))[[2]]
  ret <- x[eval(fun_c)]
  if (isTRUE(na.rm)) {
    ret[!is.na(ret)]
  } else {
    ret
  }
}

Example:

rnorm(10) %>% round(1) %>% filter_vctr(abs(x) > 1)

No code in Repo?

Is this repo depreciated? I don't see any code when I install with devtools.

has_dim() predicate

@lionel- commented on Jul 2, 2018, 1:17 PM UTC:

has_dim(x, ndim = 2)

Do we also want

has_dim(x, dim = c(2, 1))

Supplying both ndim and dim would be an error.

This issue was moved by lionel- from r-lib/rlang#552.

Export register_s3_method

Rolling functions from RcppRoll

Vectorized isTRUE() and friends

Vectorized isTRUE() would be really helpful. And lots of people define isFALSE() in package utils. And then it's a slippery slope ...

x <- c(TRUE, NA, FALSE)
is_true <- Vectorize(isTRUE)
is_false <- Vectorize(function(x) identical(x, FALSE))
is_not_true <- function(x) !is_true(x)
is_not_false <- function(x) !is_false(x)
is_true(x)
#> [1]  TRUE FALSE FALSE
is_false(x)
#> [1] FALSE FALSE  TRUE
is_not_true(x)
#> [1] FALSE  TRUE  TRUE
is_not_false(x)
#> [1]  TRUE  TRUE FALSE

Improve class names

Rename record to rcrd
Add vctrs_ prefix to all class names

Better error messages

vec_c(), vec_rbind(), and vec_cbind() need to create error messages that make it easier to determine the source of the error, even when nested.

e.g. how can we do better here?

library(vctrs)

vec_c(1, 2, vec_c(3, vec_c(4, "x")))
#> Error: No common type for double and character

vctr base class for vectors

Could provide method implementations for:

as.data.frame()
print() in terms of format()
[, [[ in terms of reconstruct()
rep() in terms of [
as.list() in terms of [[
[<-, [[<- in terms of vec_cast() ?
names<- and dim<-

Accessing, testing for and rationalizing names

There's a set of name-handling functions that recur in many a utils.R file. They would seem to fit nicely here.

Some specific examples:

names2() from purrr or tibble

has_names() from purrr or httr

named() and unnamed() from httr

A paste with NA handling

This might sound a bit out of scope but I think there's a place for this function somewhere.

I'm trying to do something very simple, which is concatenate a bunch of character columns in a data frame. The go to is tidyr::unite but my data set has a lot of NA values. unite concatenates vectors with paste, which means the result will be full of "blah, NA, blah, ...". There's a discussion of this here tidyverse/tidyr#203.

The simple solution is just to sub in a paste function that handled NAs and it seemed weird to me that this didn't already exist but googling just brought up more discussion with some hacky and slow pure R solutions.

So I think there's a need for a low level paste(..., sep = " ", collapse = NULL, na.rm = FALSE) and maybe vctrs is that place?

rescaling helpers

cf scale() which always returns a matrix. Look back to reshape(1) for other useful rescalers.

Complete lubridate support

#' @examples
#' w1 <- lubridate::years(1)
#' w2 <- lubridate::ddays(7)
#' w3 <- lubridate::interval("2020-01-01", "2020-01-08")
#'
#' vec_ptype(w1)
#' vec_ptype(w2)
#' vec_ptype(w3)
#'
#' vec_c(w1, w3, .ptype = w3)
#' vec_c(w1, w2)
#' vec_c(w2, w3)
#'
#' library(lubridate)
#' vec_c(years(1), months(1), weeks(1), days(1))
#' vec_c(dyears(1), years(1))
NULL

# https://github.com/tidyverse/lubridate/issues/707
new_Period <- function() lubridate::seconds(integer())
new_Duration <- function() lubridate::dseconds(integer())
new_Interval <- function() lubridate::interval(character(), character())

# vec_type2 ---------------------------------------------------------------

vec_type2.Period <- function(x, y) UseMethod("vec_type2.Period")
vec_type2.Duration <- function(x, y) UseMethod("vec_type2.Duration")
vec_type2.Interval <- function(x, y) UseMethod("vec_type2.Interval")

vec_type2.Period.NULL <- function(x, y) x[0L]
vec_type2.Duration.NULL <- function(x, y) x[0L]
vec_type2.Interval.NULL <- function(x, y) x[0L]

vec_type2.Period.default <- function(x, y) stop_incompatible_type(x, y)
vec_type2.Duration.default <- function(x, y) stop_incompatible_type(x, y)
vec_type2.Interval.default <- function(x, y) stop_incompatible_type(x, y)

vec_type2.Period.Period     <- function(x, y) new_Period()

vec_type2.Period.Duration   <- function(x, y) new_Period()
vec_type2.Duration.Period   <- function(x, y) new_Period()

vec_type2.difftime.Period   <- function(x, y) new_Period()
vec_type2.Period.difftime   <- function(x, y) new_Period()

vec_type2.Period.Interval   <- function(x, y) new_Period()
vec_type2.Interval.Period   <- function(x, y) new_Period()

vec_type2.Duration.Duration <- function(x, y) new_Duration()

vec_type2.difftime.Duration <- function(x, y) new_Duration()
vec_type2.Duration.difftime <- function(x, y) new_Duration()

vec_type2.Duration.Interval <- function(x, y) new_Duration()
vec_type2.Interval.Duration <- function(x, y) new_Duration()

# vec_type2.difftime.difftime <- function(x, y) new_Duration()
vec_type2.difftime.Interval <- function(x, y) new_difftime()
vec_type2.Interval.difftime <- function(x, y) new_difftime()

vec_type2.Interval.Interval <- function(x, y) new_Interval()


# vec_cast ----------------------------------------------------------------

vec_cast.Period <- function(x, to) UseMethod("vec_cast.Period")
vec_cast.Duration <- function(x, to) UseMethod("vec_cast.Duration")
vec_cast.Interval <- function(x, to) UseMethod("vec_cast.Interval")

vec_cast.Period.NULL <- function(x, to) x
vec_cast.Duration.NULL <- function(x, to) x
vec_cast.Interval.NULL <- function(x, to) x

vec_cast.Period.default <- function(x, to) stop_incompatible_cast(x, to)
vec_cast.Duration.default <- function(x, to) stop_incompatible_cast(x, to)
vec_cast.Interval.default <- function(x, to) stop_incompatible_cast(x, to)

vec_cast.Period.Period     <- function(x, to) x
vec_cast.Period.Duration   <- function(x, to) lubridate::as.period(x)
vec_cast.Period.difftime   <- function(x, to) lubridate::as.period(x)
vec_cast.Period.Interval   <- function(x, to) lubridate::as.period(x)

vec_cast.Duration.Duration <- function(x, to) x
vec_cast.Duration.Period   <- function(x, to) lubridate::as.duration(x)
vec_cast.Duration.difftime <- function(x, to) lubridate::as.duration(x)
vec_cast.Duration.Interval <- function(x, to) lubridate::as.duration(x)

vec_cast.Interval.Interval <- function(x, to) x

# vec_cast.difftime.difftime <- function(x, to) new_Duration()

Eliminate default of using as.list

And instead define as.list() using vec_cast()

Some notes

Since comments are welcome, some thoughts.

I think that having a list type that forces all objects to be of the same class can be really useful. I do have questions about introducing a new set of atomic types.

R has a type system that can already be confusing, there is

typeof, determining the R internal type
mode is a type designation that's closer to user experience (see below)
storage mode giving something closer to physical storage type (e.g. double for a value of type numeric)
class and inherits for basic types and (S3, S4, RC,...) class extensions and best for users.

Some examples

> types <- function(x) c(class=class(x)
  , typeof=typeof(x), mode=mode(x), storage.mode=storage.mode(x))
> dat <- list(int=1L,real=pi, complex=2+3i, string = character(), cat=factor(), binary=as.raw(0), fun=function(){})

> dat <- list(integer=1L,real=pi, complex=2+3i
  , string = character(), categorical=factor(), binary=as.raw(0)
  , 'function'=function(){})
> t(sapply(dat, types))
            class       typeof      mode        storage.mode
integer     "integer"   "integer"   "numeric"   "integer"   
real        "numeric"   "double"    "numeric"   "double"    
complex     "complex"   "complex"   "complex"   "complex"   
string      "character" "character" "character" "character" 
categorical "factor"    "integer"   "numeric"   "integer"   
binary      "raw"       "raw"       "raw"       "raw"       
function    "function"  "closure"   "function"  "function"

So the question is if another `atomic' type designation just for the benefit of stricter coercion rules will help users. In my experience, it is already uncomfortable having to explain novice users the difference between data frames and tibbles. This is necessary because they will run into both very quickly once they start doing actual work with R. When you are a new user, this is another thing to put on your mental stack that has nothing to do with getting things done. This is even more true for vector types. For example, until now, I can explain R's coercion rules fairly easily, by letting people experiment with a few lines of code like:

c(10, "hello")

Then, when I ask them to explain what happened and why, this usually clears things up pretty quickly.

With these new vector types new users would have to grok a second type system, especially if you automatically translate everything that is (implicitly) translated to a tibble. [By the way, beyond R's usual behavior, I think that type casts should be asked for by the user explicitly (Explicit is better than implicit)].

So in conclusion, I do not feel that the benefits of stricter coercion rules outweigh the burden of having to cope with an extra system for atomic types for end-users. I see there may be benefits for (package) developers. So it should be developer-facing rather than user-facing.

Extract hybrid internals out of dplyr

e.g. internal implementation of mean, sum, etc

Look at bigvis for other important summary functions

tibble::lst()

perhaps should live here.

Unfinished doc sentence in vec_cast documentation

This sentence looks unfinished to me, maybe an idea was lost while writing it:

vctrs/R/cast.R

Lines 18 to 22 in 6c42434

    
           #' can only cast a subset of doubles back to integers. If a cast is lossy 
        
           #' for 
        
           #' 
        
           #' The rules for coercing from a list a fairly strict: each component of the 
        
           #' list must be of length 1, and must be coercible to type `to`.

There is an unfinished "If a cast is lossy for". Also later on there is a minor issue: The rules for coercing from a list a fairly strict: should be are fairly strict:.

It's nice to be reading a vctrs implementation!

Should vctr support names?

Check handled appropriately in print method
$ error message should match 1$a
Should names be unique? If so, what will rep() and [ to preserve uniqueness?

Try ellipsis in generics

Partial types

It would be useful to supply the type of some columns in a data frame, or to specify that you wanted a factor without specifying the levels.

This requires some sort of "partial" type, which find_type() would handle specially

pattern-matching / 'map_when' helper?

@kevinushey commented on Aug 26, 2017, 4:02 AM UTC:

This would be something similar to case_when(), but for mapping conditions to operations -- something like a generalized switch.

As an example, the following code:

mtcars %>% map_when(
  starts_with("d") ~ prod(.),
  contains("a")    ~ sum(.),
  c("mpg", "cyl")  ~ cor(.)
)

would evaluate to something like:

> mtcars %>% map_when(
+   starts_with("d") ~ prod(.),
+   contains("a")    ~ sum(.),
+   c("mpg", "cyl")  ~ cor(.)
+ )
[[1]]
[1] 1.218225e+91

[[2]]
[1] 336.09

[[3]]
          mpg       cyl
mpg  1.000000 -0.852162
cyl -0.852162  1.000000

Would this be worth adding to rlang (or somewhere similar)? Or does this already exist as some function I'm not aware of?

This issue was moved by lionel- from r-lib/rlang#246.

Print method for vctrs

I recently inspected flights$arr_delay and hit getOption("max.print"). Does anyone ever want that? If I execute the below, depending on whether I'm in R Console, use RStudio's "Knit" button or use rmarkdown::render() I get anywhere from the first 1K to 10K elements. 😢

library(nycflights13)
flights$arr_delay

	#' can only cast a subset of doubles back to integers. If a cast is lossy
	#' for
	#'
	#' The rules for coercing from a list a fairly strict: each component of the
	#' list must be of length 1, and must be coercible to type `to`.

r-lib / vctrs Goto Github PK

vctrs's Issues

Recommend Projects

Recommend Topics

Recommend Org