Code Monkey home page Code Monkey logo

vctrs's Issues

Empty-default operator

@egnha commented on Aug 25, 2017, 6:42 AM UTC:

Would there be a place for an empty-default operator in rlang?

`%|||%` <- function(x, y) {
  if (is_empty(x)) y else x
}

This is handy in contexts where the notion of "emptiness" you want to check might not be type-consistent (e.g., "empty" names() is NULL, whereas "empty" paste() is character(0)).

This issue was moved by lionel- from r-lib/rlang#244.

Promote integer and double to factor and character

Consider the use case of two CSV files, a.csv and b.csv, each with identical column names. One column is id. In a.csv, all id values are 10-digit codes. In b.csv, some id values contain letters. Concatenating the contents fails:

a <- readr::read_csv("id\n12345678901")
b <- readr::read_csv("id\nX100000")
#> vec_c(a$id, b$id)
Error: No common type for double and character

A better summary

Data frame, and grouped_df methods

select() semantics

Return tibble. One row per var, then break into groups. Col of types. List col of summaries.

Logical gives number of T/F/NA.

Integer/numeric min-[Q1 [median] Q3]-max NA: Inf?: []. Extract out common multiple 10³ x 1-[5 [10] 11]-20

For date/time, just display range. Need to special case if min & max on same day (do it by year/month/day/hour/minute ?).

Factors to give compact freq table. In one line. How?? (print method based on width?)

Characters give summary of length (1-20 chars). Encoding?. Number of empty & missing.

For unknown type, use obj_sum

https://github.com/holman/spark

Create an empty Date vector

Sometimes it is useful to create empty vectors using commands like integer(), character(), etc.

Very often I use Date vectors created with the ymd function from lubridate but I am not able to create empty Date vectors (maybe there is a way to do taht but I don't know the trick).

Could be this package the right place in which expose this feature?

Hashing and equality functions

Related to:

  • split()
  • duplicated()
  • table() / count() (and need for vector version of count())
  • joins / lookup
  • match()
  • intersect(), setdiff() etc

Implement vec_proxy_compare

  • xtfrm() -> order(), sort()
  • <, <=, >, >=
  • min(), max()
  • median(), quantile() (change default type)

Replaces vec_proxy_order()

argument name antipattern

Having special names for some arguments when the other arguments are absorbed into everything else is an antipattern because the function can't tell between the argument as a special argument or as an unspecial named argument. Example:

vec_c(1, 2, .type = integer())
#> [1] 1 2

vec_c supports named vectors, so vec_c(a=1, b=2, .type=integer()) is valid, but vec_c(.print=1, .type=2, .screen=3) fails because the .type argument is treated specially.

The arguments for this antipattern are:

  • The low probability of having an argument with that name is very low - but that probability is non-zero and increases linearly with the amount of usage. When it hits it generates massive surprise.
  • Names starting with a dot are well-known as "special" and are documented - but this relies on people being introduced to the documentation at first usage.

The existence of specially-named arguments when there is a possibility of misinterpretation is an irregularity which seems incongruous with everything else being tidy.

You could write vec_c to return a function, and then pass the type argument as a further argument:

vec_c(a=1, .type=99, type=23)()  # returns c(a=1,.type=99, type=23)
vec_c(a=1, .type=99, type=23)(type=integer()) # as above but integer type.

This approach has the advantage that the argument no-longer needs to be "dotted".

`are_numeric`, `all_numeric`, and friends

@njtierney commented on Oct 3, 2017, 1:34 AM UTC:

Not sure if this should go in vctrs, please feel free to let me know if this should move.

It can be handy to test whether all or things are numeric, and just like there is a rlang::are_na, I'm wondering if there should be this for other types?

For example

are_numeric <- function(x, ...){
  any(as.logical(lapply(x, is.numeric)))
}

all_numeric <- function(x, ...){
  all(as.logical(lapply(x, is.numeric)))
}

are_numeric(iris)
#> [1] TRUE
are_numeric(letters)
#> [1] FALSE
are_numeric(1:10)
#> [1] TRUE
all_numeric(iris)
#> [1] FALSE
all_numeric(letters)
#> [1] FALSE
all_numeric(1:10)
#> [1] TRUE

This issue was moved by lionel- from r-lib/rlang#274.

"Ropes" for vectors

e.g. https://github.com/google/xi-editor/tree/master/doc/rope_science

This is probably out of scope for vctrs, but it would be interesting to think more about a tree like structure for vectors which would allow more efficient modification without having to copy the complete vector.

A finger-tree like structure would also make it very efficient to recompute algebraic statistics (i.e. counts, sums, means, sd) as the data changes.

Improvements to arithmetic generics

  • Binary generics should handle recycling for you
  • Add vec_grp_math(). Can we collapse with vec_grp_summary()?
  • vec_grp_unary() -> vec_grp_numeric1(), vec_grp_numeric() -> vec_grp_numeric2() ?
  • Need double dispatch for arithmetic to support (e.g.) date - date and date + 1

Coercion rules

Rules:

  • integer + double -> double
  • logical NA + anything -> anything
  • factor + factor (same levels) -> factor
  • factor + factor (diff levels) -> character (WARN)
  • factor + character -> character (WARN)
  • dates + datetime -> date time
  • datetime + datetime (different tz) -> date time (first tz)

Implement in C, and provide replacement to dplyr::combine()

# Integer coerced to double
df <- data_frame(x = 1:2) %>%
  group_by(x) %>%
  summarise(y = if (x == 1) 1L else 1)
expect_type(df$y, "double")

df <- data_frame(x = 1:2) %>%
  group_by(x) %>%
  mutate(y = if (x == 1) 1L else 1)
expect_type(df$y, "double")

df1 <- data_frame(x = 1L)
df2 <- data_frame(x = 1)
df <- bind_rows(df1, df2)
expect_type(df$x, "double")
expect_type(combine(df1$x, df2$x), "double")

df <- inner_join(df1, df2, by = "x")
expect_type(df$x, "double")

Figure out how to provide efficient extension mechanism.

A filter function for vectors

Filtering a vector by a condition, only returning the values for which condition is TRUE. Use x to indicate the vector in the condition so it is generic.

Rough sketch:

filter_vctr <- function(x, ..., na.rm = TRUE) {
  fun_c <- as.list(substitute(list(...)))[[2]]
  ret <- x[eval(fun_c)]
  if (isTRUE(na.rm)) {
    ret[!is.na(ret)]
  } else {
    ret
  }
}

Example:

rnorm(10) %>% round(1) %>% filter_vctr(abs(x) > 1)

No code in Repo?

Is this repo depreciated? I don't see any code when I install with devtools.

Vectorized isTRUE() and friends

Vectorized isTRUE() would be really helpful. And lots of people define isFALSE() in package utils. And then it's a slippery slope ...

x <- c(TRUE, NA, FALSE)
is_true <- Vectorize(isTRUE)
is_false <- Vectorize(function(x) identical(x, FALSE))
is_not_true <- function(x) !is_true(x)
is_not_false <- function(x) !is_false(x)
is_true(x)
#> [1]  TRUE FALSE FALSE
is_false(x)
#> [1] FALSE FALSE  TRUE
is_not_true(x)
#> [1] FALSE  TRUE  TRUE
is_not_false(x)
#> [1]  TRUE  TRUE FALSE

Better error messages

vec_c(), vec_rbind(), and vec_cbind() need to create error messages that make it easier to determine the source of the error, even when nested.

e.g. how can we do better here?

library(vctrs)

vec_c(1, 2, vec_c(3, vec_c(4, "x")))
#> Error: No common type for double and character

vctr base class for vectors

Could provide method implementations for:

  • as.data.frame()
  • print() in terms of format()
  • [, [[ in terms of reconstruct()
  • rep() in terms of [
  • as.list() in terms of [[
  • [<-, [[<- in terms of vec_cast() ?
  • names<- and dim<-

A paste with NA handling

This might sound a bit out of scope but I think there's a place for this function somewhere.

I'm trying to do something very simple, which is concatenate a bunch of character columns in a data frame. The go to is tidyr::unite but my data set has a lot of NA values. unite concatenates vectors with paste, which means the result will be full of "blah, NA, blah, ...". There's a discussion of this here tidyverse/tidyr#203.

The simple solution is just to sub in a paste function that handled NAs and it seemed weird to me that this didn't already exist but googling just brought up more discussion with some hacky and slow pure R solutions.

So I think there's a need for a low level paste(..., sep = " ", collapse = NULL, na.rm = FALSE) and maybe vctrs is that place?

rescaling helpers

cf scale() which always returns a matrix. Look back to reshape(1) for other useful rescalers.

Complete lubridate support

#' @examples
#' w1 <- lubridate::years(1)
#' w2 <- lubridate::ddays(7)
#' w3 <- lubridate::interval("2020-01-01", "2020-01-08")
#'
#' vec_ptype(w1)
#' vec_ptype(w2)
#' vec_ptype(w3)
#'
#' vec_c(w1, w3, .ptype = w3)
#' vec_c(w1, w2)
#' vec_c(w2, w3)
#'
#' library(lubridate)
#' vec_c(years(1), months(1), weeks(1), days(1))
#' vec_c(dyears(1), years(1))
NULL

# https://github.com/tidyverse/lubridate/issues/707
new_Period <- function() lubridate::seconds(integer())
new_Duration <- function() lubridate::dseconds(integer())
new_Interval <- function() lubridate::interval(character(), character())

# vec_type2 ---------------------------------------------------------------

vec_type2.Period <- function(x, y) UseMethod("vec_type2.Period")
vec_type2.Duration <- function(x, y) UseMethod("vec_type2.Duration")
vec_type2.Interval <- function(x, y) UseMethod("vec_type2.Interval")

vec_type2.Period.NULL <- function(x, y) x[0L]
vec_type2.Duration.NULL <- function(x, y) x[0L]
vec_type2.Interval.NULL <- function(x, y) x[0L]

vec_type2.Period.default <- function(x, y) stop_incompatible_type(x, y)
vec_type2.Duration.default <- function(x, y) stop_incompatible_type(x, y)
vec_type2.Interval.default <- function(x, y) stop_incompatible_type(x, y)

vec_type2.Period.Period     <- function(x, y) new_Period()

vec_type2.Period.Duration   <- function(x, y) new_Period()
vec_type2.Duration.Period   <- function(x, y) new_Period()

vec_type2.difftime.Period   <- function(x, y) new_Period()
vec_type2.Period.difftime   <- function(x, y) new_Period()

vec_type2.Period.Interval   <- function(x, y) new_Period()
vec_type2.Interval.Period   <- function(x, y) new_Period()

vec_type2.Duration.Duration <- function(x, y) new_Duration()

vec_type2.difftime.Duration <- function(x, y) new_Duration()
vec_type2.Duration.difftime <- function(x, y) new_Duration()

vec_type2.Duration.Interval <- function(x, y) new_Duration()
vec_type2.Interval.Duration <- function(x, y) new_Duration()

# vec_type2.difftime.difftime <- function(x, y) new_Duration()
vec_type2.difftime.Interval <- function(x, y) new_difftime()
vec_type2.Interval.difftime <- function(x, y) new_difftime()

vec_type2.Interval.Interval <- function(x, y) new_Interval()


# vec_cast ----------------------------------------------------------------

vec_cast.Period <- function(x, to) UseMethod("vec_cast.Period")
vec_cast.Duration <- function(x, to) UseMethod("vec_cast.Duration")
vec_cast.Interval <- function(x, to) UseMethod("vec_cast.Interval")

vec_cast.Period.NULL <- function(x, to) x
vec_cast.Duration.NULL <- function(x, to) x
vec_cast.Interval.NULL <- function(x, to) x

vec_cast.Period.default <- function(x, to) stop_incompatible_cast(x, to)
vec_cast.Duration.default <- function(x, to) stop_incompatible_cast(x, to)
vec_cast.Interval.default <- function(x, to) stop_incompatible_cast(x, to)

vec_cast.Period.Period     <- function(x, to) x
vec_cast.Period.Duration   <- function(x, to) lubridate::as.period(x)
vec_cast.Period.difftime   <- function(x, to) lubridate::as.period(x)
vec_cast.Period.Interval   <- function(x, to) lubridate::as.period(x)

vec_cast.Duration.Duration <- function(x, to) x
vec_cast.Duration.Period   <- function(x, to) lubridate::as.duration(x)
vec_cast.Duration.difftime <- function(x, to) lubridate::as.duration(x)
vec_cast.Duration.Interval <- function(x, to) lubridate::as.duration(x)

vec_cast.Interval.Interval <- function(x, to) x

# vec_cast.difftime.difftime <- function(x, to) new_Duration()

Some notes

Since comments are welcome, some thoughts.

I think that having a list type that forces all objects to be of the same class can be really useful. I do have questions about introducing a new set of atomic types.

R has a type system that can already be confusing, there is

  • typeof, determining the R internal type
  • mode is a type designation that's closer to user experience (see below)
  • storage mode giving something closer to physical storage type (e.g. double for a value of type numeric)
  • class and inherits for basic types and (S3, S4, RC,...) class extensions and best for users.

Some examples

> types <- function(x) c(class=class(x)
  , typeof=typeof(x), mode=mode(x), storage.mode=storage.mode(x))
> dat <- list(int=1L,real=pi, complex=2+3i, string = character(), cat=factor(), binary=as.raw(0), fun=function(){})

> dat <- list(integer=1L,real=pi, complex=2+3i
  , string = character(), categorical=factor(), binary=as.raw(0)
  , 'function'=function(){})
> t(sapply(dat, types))
            class       typeof      mode        storage.mode
integer     "integer"   "integer"   "numeric"   "integer"   
real        "numeric"   "double"    "numeric"   "double"    
complex     "complex"   "complex"   "complex"   "complex"   
string      "character" "character" "character" "character" 
categorical "factor"    "integer"   "numeric"   "integer"   
binary      "raw"       "raw"       "raw"       "raw"       
function    "function"  "closure"   "function"  "function"  

So the question is if another `atomic' type designation just for the benefit of stricter coercion rules will help users. In my experience, it is already uncomfortable having to explain novice users the difference between data frames and tibbles. This is necessary because they will run into both very quickly once they start doing actual work with R. When you are a new user, this is another thing to put on your mental stack that has nothing to do with getting things done. This is even more true for vector types. For example, until now, I can explain R's coercion rules fairly easily, by letting people experiment with a few lines of code like:

c(10, "hello")

Then, when I ask them to explain what happened and why, this usually clears things up pretty quickly.

With these new vector types new users would have to grok a second type system, especially if you automatically translate everything that is (implicitly) translated to a tibble. [By the way, beyond R's usual behavior, I think that type casts should be asked for by the user explicitly (Explicit is better than implicit)].

So in conclusion, I do not feel that the benefits of stricter coercion rules outweigh the burden of having to cope with an extra system for atomic types for end-users. I see there may be benefits for (package) developers. So it should be developer-facing rather than user-facing.

Unfinished doc sentence in vec_cast documentation

This sentence looks unfinished to me, maybe an idea was lost while writing it:

vctrs/R/cast.R

Lines 18 to 22 in 6c42434

#' can only cast a subset of doubles back to integers. If a cast is lossy
#' for
#'
#' The rules for coercing from a list a fairly strict: each component of the
#' list must be of length 1, and must be coercible to type `to`.

There is an unfinished "If a cast is lossy for". Also later on there is a minor issue: The rules for coercing from a list a fairly strict: should be are fairly strict:.

It's nice to be reading a vctrs implementation!

Should vctr support names?

  • Check handled appropriately in print method
  • $ error message should match 1$a
  • Should names be unique? If so, what will rep() and [ to preserve uniqueness?

Partial types

It would be useful to supply the type of some columns in a data frame, or to specify that you wanted a factor without specifying the levels.

This requires some sort of "partial" type, which find_type() would handle specially

pattern-matching / 'map_when' helper?

@kevinushey commented on Aug 26, 2017, 4:02 AM UTC:

This would be something similar to case_when(), but for mapping conditions to operations -- something like a generalized switch.

As an example, the following code:

mtcars %>% map_when(
  starts_with("d") ~ prod(.),
  contains("a")    ~ sum(.),
  c("mpg", "cyl")  ~ cor(.)
)

would evaluate to something like:

> mtcars %>% map_when(
+   starts_with("d") ~ prod(.),
+   contains("a")    ~ sum(.),
+   c("mpg", "cyl")  ~ cor(.)
+ )
[[1]]
[1] 1.218225e+91

[[2]]
[1] 336.09

[[3]]
          mpg       cyl
mpg  1.000000 -0.852162
cyl -0.852162  1.000000

Would this be worth adding to rlang (or somewhere similar)? Or does this already exist as some function I'm not aware of?

This issue was moved by lionel- from r-lib/rlang#246.

Print method for vctrs

I recently inspected flights$arr_delay and hit getOption("max.print"). Does anyone ever want that? If I execute the below, depending on whether I'm in R Console, use RStudio's "Knit" button or use rmarkdown::render() I get anywhere from the first 1K to 10K elements. 😢

library(nycflights13)
flights$arr_delay

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.