tonyfischetti / assertr Goto Github PK

View Code? Open in Web Editor NEW

463.0 16.0 34.0 14.28 MB

Assertive programming for R analysis pipelines

Home Page: https://docs.ropensci.org/assertr

License: Other

R 99.54% Makefile 0.46%

predicate-functions analysis-pipeline assertions assertion-methods assertion-library r rstats r-package peer-reviewed

assertr's Introduction

assertr

What is it?

The assertr package supplies a suite of functions designed to verify assumptions about data early in an analysis pipeline so that data errors are spotted early and can be addressed quickly.

This package does not need to be used with the magrittr/dplyr piping mechanism but the examples in this README use them for clarity.

Installation

You can install the latest version on CRAN like this

    install.packages("assertr")

or you can install the bleeding-edge development version like this:

    install.packages("devtools")
    devtools::install_github("ropensci/assertr")

What does it look like?

This package offers five assertion functions, assert, verify, insist, assert_rows, and insist_rows, that are designed to be used shortly after data-loading in an analysis pipeline...

Let’s say, for example, that the R’s built-in car dataset, mtcars, was not built-in but rather procured from an external source that was known for making errors in data entry or coding. Pretend we wanted to find the average miles per gallon for each number of engine cylinders. We might want to first, confirm

that it has the columns "mpg", "vs", and "am"
that the dataset contains more than 10 observations
that the column for 'miles per gallon' (mpg) is a positive number
that the column for ‘miles per gallon’ (mpg) does not contain a datum that is outside 4 standard deviations from its mean, and
that the am and vs columns (automatic/manual and v/straight engine, respectively) contain 0s and 1s only
each row contains at most 2 NAs
each row is unique jointly between the "mpg", "am", and "wt" columns
each row's mahalanobis distance is within 10 median absolute deviations of all the distances (for outlier detection)

This could be written (in order) using assertr like this:

    library(dplyr)
    library(assertr)

    mtcars %>%
      verify(has_all_names("mpg", "vs", "am", "wt")) %>%
      verify(nrow(.) > 10) %>%
      verify(mpg > 0) %>%
      insist(within_n_sds(4), mpg) %>%
      assert(in_set(0,1), am, vs) %>%
      assert_rows(num_row_NAs, within_bounds(0,2), everything()) %>%
      assert_rows(col_concat, is_uniq, mpg, am, wt) %>%
      insist_rows(maha_dist, within_n_mads(10), everything()) %>%
      group_by(cyl) %>%
      summarise(avg.mpg=mean(mpg))

If any of these assertions were violated, an error would have been raised and the pipeline would have been terminated early.

Let's see what the error message look like when you chain a bunch of failing assertions together.

    > mtcars %>%
    +   chain_start %>%
    +   assert(in_set(1, 2, 3, 4), carb) %>%
    +   assert_rows(rowMeans, within_bounds(0,5), gear:carb) %>%
    +   verify(nrow(.)==10) %>%
    +   verify(mpg < 32) %>%
    +   chain_end
    There are 7 errors across 4 verbs:
    -
             verb redux_fn           predicate     column index value
    1      assert     <NA>  in_set(1, 2, 3, 4)       carb    30   6.0
    2      assert     <NA>  in_set(1, 2, 3, 4)       carb    31   8.0
    3 assert_rows rowMeans within_bounds(0, 5) ~gear:carb    30   5.5
    4 assert_rows rowMeans within_bounds(0, 5) ~gear:carb    31   6.5
    5      verify     <NA>       nrow(.) == 10       <NA>     1    NA
    6      verify     <NA>            mpg < 32       <NA>    18    NA
    7      verify     <NA>            mpg < 32       <NA>    20    NA

    Error: assertr stopped execution

What does `assertr` give me?

verify - takes a data frame (its first argument is provided by the %>% operator above), and a logical (boolean) expression. Then, verify evaluates that expression using the scope of the provided data frame. If any of the logical values of the expression's result are FALSE, verify will raise an error that terminates any further processing of the pipeline.
assert - takes a data frame, a predicate function, and an arbitrary number of columns to apply the predicate function to. The predicate function (a function that returns a logical/boolean value) is then applied to every element of the columns selected, and will raise an error if it finds any violations. Internally, the assert function uses dplyr's select function to extract the columns to test the predicate function on.
insist - takes a data frame, a predicate-generating function, and an arbitrary number of columns. For each column, the the predicate-generating function is applied, returning a predicate. The predicate is then applied to every element of the columns selected, and will raise an error if it finds any violations. The reason for using a predicate-generating function to return a predicate to use against each value in each of the selected rows is so that, for example, bounds can be dynamically generated based on what the data look like; this the only way to, say, create bounds that check if each datum is within x z-scores, since the standard deviation isn't known a priori. Internally, the insist function uses dplyr's select function to extract the columns to test the predicate function on.
assert_rows - takes a data frame, a row reduction function, a predicate function, and an arbitrary number of columns to apply the predicate function to. The row reduction function is applied to the data frame, and returns a value for each row. The predicate function is then applied to every element of vector returned from the row reduction function, and will raise an error if it finds any violations. This functionality is useful, for example, in conjunction with the num_row_NAs() function to ensure that there is below a certain number of missing values in each row. Internally, the assert_rows function uses dplyr'sselect function to extract the columns to test the predicate function on.
insist_rows - takes a data frame, a row reduction function, a predicate-generating function, and an arbitrary number of columns to apply the predicate function to. The row reduction function is applied to the data frame, and returns a value for each row. The predicate-generating function is then applied to the vector returned from the row reduction function and the resultant predicate is applied to each element of that vector. It will raise an error if it finds any violations. This functionality is useful, for example, in conjunction with the maha_dist() function to ensure that there are no flagrant outliers. Internally, the assert_rows function uses dplyr'sselect function to extract the columns to test the predicate function on.

assertr also offers four (so far) predicate functions designed to be used with the assert and assert_rows functions:

not_na - that checks if an element is not NA
within_bounds - that returns a predicate function that checks if a numeric value falls within the bounds supplied, and
in_set - that returns a predicate function that checks if an element is a member of the set supplied. (also allows inverse for "not in set")
is_uniq - that checks to see if each element appears only once

and predicate generators designed to be used with the insist and insist_rows functions:

within_n_sds - used to dynamically create bounds to check vector elements with based on standard z-scores
within_n_mads - better method for dynamically creating bounds to check vector elements with based on 'robust' z-scores (using median absolute deviation)

and the following row reduction functions designed to be used with assert_rows and insist_rows:

num_row_NAs - counts number of missing values in each row
maha_dist - computes the mahalanobis distance of each row (for outlier detection). It will coerce categorical variables into numerics if it needs to.
col_concat - concatenates all rows into strings
duplicated_across_cols - checking if a row contains a duplicated value across columns

and, finally, some other utilities for use with verify

has_all_names - check if the data frame or list has all supplied names
has_only_names - check that a data frame or list have only the names requested
has_class - checks if passed data has a particular class

More info

For more info, check out the assertr vignette

    > vignette("assertr")

Or read it here

assertr's People

Contributors

Stargazers

Watchers

assertr's Issues

Allow parameter that will count failures (in assert or insist)

The default will always be to terminate execution at the first exception, but add a optional parameter that will override this behavior and count the number of failures

maha_dist's data.matrix doesn't convert non-factor strings into numerics

I discovered that data.matrix doesn't convert non-factor strings into numbers like I assumed it did.

Implement assertr::insist on dataframe grouped with dplyr::group_by

What about being able to run functions like within_n_mads() on a group wise basis, for example:

library(dplyr)
library(assertr)

mtcars %>%
  group_by(cyl) %>%
  insist(within_n_mads(2), mpg)

Unless I'm very much mistaken, this isn't currently implemented, but would be extremely powerful for testing datasets which are a mix of categorical and continuous variable.

Implement "check_rows" verb

check_rows :: data.frame df : df -> (df -> [a]) -> ([a] -> ([a] -> Bool)) -> df

check_rows(a, b, c, ...)
a = data frame
b = function that takes data frame and returns nrow vector
c = predicate generating function
... optional "select" rows

apply predicate generating function to return value from b(a)
then map the resultant predicate [c(b(a))] over return value from b(a)
[map(c(b(a)), b(a))]

example:

iris %>%
   check_rows(cosine_dist, within_n_sds(5), -Species) %>%
    ...

new example for vignette

Hi what a nice package--

Perhaps you'd like to add to your vignette that it's easy to check that a data frame column contains a column

data.frame(a=c(1,2,3)) %>%
    verify("a" %>% exists) %>%    # ok
    verify("b" %>% exists)        # fails

feature request: adding attribute information for succeeded assertions

I am looking for ways to visualise the result of an assertr chain, i.e. with shiny. Therefore, I need a complete list of succeeded and failed assertions. Setting the option error_fun to error_append in chain_end gives error messages for failed assertions as attributes. I think it would be helpful to also get something similar for successful assertions, so at the end of the chain, I have a complete list of all assertions tested and whether they failed or not. I looked at the source code and it seems the just adding a new success_and_error_function is not sufficient to achieve my goal.

Option to receive warning instead of error

It would be helpful to allow for warnings rather than errors. I think the way to do this would be to allow each testing function an argument like warn, but the default go to a package-level option (options(assert_warn=FALSE)) which is false by default.

Do data.tables really not work?!

Investigate this.

error list for verify function

Is it possible to output an error list for verify function as assert does?

It will be helpful to have the index and all used variables in the expression/function.

new no_duplicates predicate for verify

Here is a predicate for verify that tests if there are any duplicate keys a data.frame. Feel free to include it with the package if you'd like.

#' Returns TRUE if no values are duplicated
#'
#' This function tests if values in the sepcified columns have no
#' duplicate values. This the inverse of
#' \code{\link[base]{duplicated}}. This is a convenience function
#' meant to be used as a predicate in an \code{\link{assertr}}
#' verify statment.
#'
#' Warning: Since this uses \code{\link[base]{duplicated}}, the columns
#' are pasted together with '\t', this will fail if any of the
#' specified columns contain '\t' characters.
#'
#' @param ... columns from the data.frame
#' @return A vector of the same length that is TRUE when the values in
#' the specified columns are distinct
#' @seealso \code{\link{duplicated}}
#' @examples
#' z <- data.frame(a=c(1,2,3,4,5,6), b=c(1,1,1,3,3,3), c=c(1,2,3,1,2,3))
#' zz <- z %>% verify(no_duplicates(a))
#' zz <- z %>% verify(no_duplicates(b))    # verification failed! (4 failures)
#' zz <- z %>% verify(no_duplicates(c))    # verification failed! (3 failures)
#' zz <- z %>% verify(no_duplicates(a,b))
#' zz <- z %>% verify(no_duplicates(a,c))
#' zz <- z %>% verify(no_duplicates(b,c))
#' zz <- z %>% verify(no_duplicates(a,b,c))
#'
#' @export
no_duplicates <- function(...){
    args <- list(...)
    args$incomparables=F
    !duplicated.data.frame(args)
}

Implement either parallel execution or Rcpp execution

within_n_mads chokes on mtcars$vs

> mtcars %>% insist(within_n_mads(2), vs)
Error in within_bounds((dmed - (n * dmad)), (dmed + (n * dmad)), ...) :
  lower bound must be strictly lower than upper bound

Implement "assert_rows"

assert_rows :: data.frame df => df -> (df -> [a]) -> (a -> Bool) -> df

assert_rows(a, b, c, ...)
a = data frame
b = function that takes data frame and returns nrow vector
c = predicate
... optional "select" rows

map the predicate c over return value from b(a)
[map(c), b(a))]

example:

iris %>%
   assert_rows(cosine_dist, within_bounds(0,.5), -Species) %>%
    ...

the error message will read like....
Error: Assertion 'within_bounds' violated at row 4 of iris (value: 32.4)

Error chaining broken with assert_rows and insist_rows

If assert_rows or insist_rows is called during a chain, and it finds no new errors, and there are existing errors from the earlier parts of the chain, it will return an empty string. This breaks the chain. To continue the chain, these functions should always call either the success function or the error function and return the data set.

Here is a transcript of a session exhibiting the problem. I believe that the output from the first chain is correct, and that that from the second and third is incorrect. Indeed, all three chains should give the same output.

> library(assertr)
> library(magrittr)
> test.df = data.frame(x=c(1,0,2))
> test.df %>% chain_start %>% verify(x>0) %>% chain_end
There is 1 error:

- verification [x > 0] failed! (1 failure)

Error: assertr stopped execution
> test.df %>% chain_start %>% verify(x>0) %>% assert_rows(col_concat, is_uniq, x) %>% chain_end
[1] ""
> test.df %>% chain_start %>% verify(x>0) %>% assert_rows(col_concat, is_uniq, x) %>% assert_rows(col_concat, is_uniq, x) %>% chain_end
Error in UseMethod("select_") : 
  no applicable method for 'select_' applied to an object of class "character"
>

Patch file below.

First commit adds tests (which fail), second fixes bug (making tests pass). Consider squashing or rewriting them.

From 844789702f93338fabdaad222a4997bd56853cfc Mon Sep 17 00:00:00 2001
From: Peter Wicks Stringfield <[email protected]>
Date: Tue, 21 Mar 2017 16:20:38 -0500
Subject: [PATCH 1/2] add tests for chaining error

---
 tests/testthat/test-assertions.R | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/tests/testthat/test-assertions.R b/tests/testthat/test-assertions.R
index beb085a..84b791f 100644
--- a/tests/testthat/test-assertions.R
+++ b/tests/testthat/test-assertions.R
@@ -24,6 +24,7 @@ mnexmpl.data[12,1] <- NA
 nanmnexmpl.data <- mnexmpl.data
 nanmnexmpl.data[10,1] <- 0/0
 
+test.df <- data.frame(x = c(0,1,2))
 
 # custom error (or success) messages
 yell <- function(message){
@@ -560,3 +561,38 @@ test_that("insist_rows breaks appropriately (using se)", {
 ###########################################
 
 
+########## chaining works ############
+
+# A special error function for these tests, produces the error but no
+# standard output.
+error_no_output <- function (list_of_errors, data=NULL, ...) {
+  stop("assertr stopped execution", call.=FALSE)
+}
+
+test_that("assert_rows works with chaining", {
+  code_to_test <- function () {
+    test.df %>%
+      chain_start %>%
+      # This gives one error.
+      assert(within_bounds(1, Inf), x) %>%
+      # This gives no errors.
+      assert_rows(col_concat, is_uniq, x) %>%
+      assert_rows(col_concat, is_uniq, x) %>%
+      chain_end(error_fun = error_no_output)
+  }
+  expect_error(code_to_test(),
+               "assertr stopped execution")
+})
+
+test_that("insist_rows works with chaining", {
+  code_to_test <- function () {
+    test.df %>%
+      chain_start %>%
+      assert(within_bounds(1, Inf), x) %>%
+      insist_rows(col_concat, function (a_vector) {function (xx) TRUE}, x) %>%
+      insist_rows(col_concat, function (a_vector) {function (xx) TRUE}, x) %>%
+      chain_end(error_fun = error_no_output)
+  }
+  expect_error(code_to_test(),
+               "assertr stopped execution")
+})
-- 
1.9.1


From ff69f4ff6fa4537b97f811403290e4b739f4c1fa Mon Sep 17 00:00:00 2001
From: Peter Wicks Stringfield <[email protected]>
Date: Tue, 21 Mar 2017 16:20:52 -0500
Subject: [PATCH 2/2] fix chaining error

---
 R/assertions.R | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/R/assertions.R b/R/assertions.R
index cbb2173..b43cb48 100644
--- a/R/assertions.R
+++ b/R/assertions.R
@@ -253,7 +253,10 @@ assert_rows_ <- function(data, row_reduction_fn, predicate, ..., .dots,
 
   num.violations <- sum(!log.vec)
   if(num.violations==0)
-    return("")
+    # There are errors, just no new ones, so calling success
+    # is inappropriate, so we must call the error function.
+    # NOT calling either function would break the pipeline.
+    return(error_fun(list(), data=data))
   loc.violations <- which(!log.vec)
 
   error <- make.assertr.assert_rows.error(name.of.row.redux.fn,
@@ -517,7 +520,10 @@ insist_rows_ <- function(data, row_reduction_fn, predicate_generator, ...,
 
   num.violations <- sum(!log.vec)
   if(num.violations==0)
-    return("")
+    # There are errors, just no new ones, so calling success
+    # is inappropriate, so we must call the error function.
+    # NOT calling either function would break the pipeline.
+    return(error_fun(list(), data=data))
   loc.violations <- which(!log.vec)
 
   error <- make.assertr.assert_rows.error(name.of.row.redux.fn,
-- 
1.9.1

After patch:

> library(assertr)
> library(magrittr)
> test.df = data.frame(x=c(1,0,2))
> test.df %>% chain_start %>% verify(x>0) %>% chain_end
There is 1 error:

- verification [x > 0] failed! (1 failure)

Error: assertr stopped execution
> test.df %>% chain_start %>% verify(x>0) %>% assert_rows(col_concat, is_uniq, x) %>% chain_end
There is 1 error:

- verification [x > 0] failed! (1 failure)

Error: assertr stopped execution
> test.df %>% chain_start %>% verify(x>0) %>% assert_rows(col_concat, is_uniq, x) %>% assert_rows(col_concat, is_uniq, x) %>% chain_end
There is 1 error:

- verification [x > 0] failed! (1 failure)

Error: assertr stopped execution
>

Not return the data frame on success?

Is it possible to have assert() not return the data frame? I tried setting success_fun = function() {} but it complains
Error in success_fun(data) : unused argument (data)
I want to use assert() to implement a check but do nothing if it passes. Returning the data frame results in it being printed to the screen when I run my code from the command line. Wrapping the check inside capture.output(assert(...), file='/dev/null') causes problems when there is an error.

Vignette don't work

Update tested error message

Some error messages in dplyr have changed with the switch to tidyselect, the tests in assertr fail now: https://github.com/krlmlr/dplyr/blob/r-0.7.5/revdep/problems.md#newly-broken

Generally I prefer to test only messages and output which are under control of the package that tests them, to avoid such breakages.

Implement row-wise distance functions in C++

mahalanobois
cosine

`maha_dist` incompatible with `tbl_df`

Data frames with a tbl_df wrapper (e.g., data imported with the readr package) raise an error in the maha_dist function.

link to vignette is broken! http://www.onthelambda.com/wp-content/uploads/2015/03/assertr.html (linked from README)

in_set should be vectorized

something %>%
  filter(in_set(c("M", "F"))(Gender)) %>%
  something

Throws a "bounds must be checked on a single element"
which is the wrong error, anyway

Increase efficiency of in_set

increase efficiency of in_set by usng any()
maybe make it vectorized

It doesn't make full predicates out of half ones anymore

assert_rows(mtcars, num_row_NAs, function(x) if(x==10) FALSE, everything())

Says

Error in vapply(a.column, predicate, logical(1)) : 
  values must be length 1,
 but FUN(X[[1]]) result is length 0

Add predicate for percentage of records with that value...

This would help identify misspecied categorical variables. If, for example, in iris a few versicolors were versecolor, the percentage of records with that value would be low.

I haven't thought this through yet. Can I use with within_bounds? Or is it it's own complete function?

Assertion about dates

This is a feature request - thank you for all your work on assertr so far.

Could there be assertions about dates?

I realise I can write a predicate function or convert to numeric. Just wondering about some built in approach.

Here are a couple of approaches I've used

within_dateRange <- function(x, date_min, date_max) {
  x > date_min & x < date_max
}

assert(within_bounds(as.numeric(as.Date('2015-01-01')), as.numeric(as.Date('2015-02-01'))), Sample_date)

mutate(Sample_date_num = as.numeric(Sample_date)) %>% 
assert(within_bounds(as.numeric(as.Date('2015-01-01')), as.numeric(as.Date('2015-02-01'))), Sample_date_num)

Detect if a data type check is being performed on a column

Detect if a data type check is being performed on a column so that assert doesn't have to check every element

Add within_n_mads predicate generator

Like within_n_sds predicate generator but better and more robust

Checks if a specifc "cell" has a value

I tried to write a assert to check that a specific cell (e.g. to check that table from a web scrape via rvest has the right format) has a specific value, but somehow this doesn't work because the predicates assume that they are handled on a single element (and are vectorized by insist) and therefore have no knowledge about the row number :-(

The idea is this:

html %>% html_node(css=".table") %>% html_table() %>% insist(row_is(1, "Start"), X1)

This would check that the first element in column X1 is the string Start.

Here are two versions which do not work:

# Does not work because the innermost function does not know which row it is
row_is <- function(row_n, value){
    if (!is.numeric(row_n)) 
        stop("row_n must be numeric")
    function(a_vector){
        function(x){
            if (is.na(value)){
                is.na(a_vector[row_n])
            } else {
                a_vector[row_n] == value
            }
        }
    }
}
# Does not work because one function too little...    
row_is_not <- function(row_n, value){
    if (!is.numeric(row_n)) 
        stop("row_n must be numeric")
    fun <- function(a_vector){
        res = rep(TRUE, length(a_vector))
        if (is.na(value)){
            res[row_n] <- !is.na(a_vector[row_n])
        } else {
            res[row_n] <- a_vector[row_n] != value
        }
        return(res)
    }
    return(fun)
}

Vectorize not_na and within_bounds...

this should also make within_n_sds vectorized. confirm this!

Implement standard evaluation counterparts

Like the \w+_ functions in dplyr to allow standard evaluation

Feature request: support for .data pronoun

Thanks for the great package!

I'm wondering if you'd consider adding support for the .data and .env pronouns from the tidyverse's new tidyeval framework. (.data says "look in the dataframe for the column" while .env says something like "look in the environment for the value")

See the dplyr programming vignette has some background.) I think the rlang::eval_tidy function will be key here.

Here's a demo:

library(rlang)
library(assertr)

mtcars %>% verify(mpg > 30)  # verify works!
#> verification [mpg > 30] failed! (28 failures)

mgp <- 40
mtcars %>% verify(mgp > 30)  # fat fingers :(
# No error because mgp exists elsewhere

# I'd like to write:
mtcars %>% verify(.data$mpg > 30)
# and 
mtcars %>% verify(.data$mpg > .env$mgp)


# How it works in dplyr:
mtcars %>% dplyr::mutate(x = mgp)  # no error, but mgp always 40

# but if we're careful, we get an error
mtcars %>% dplyr::mutate(x = .data$mgp)
#> Error in mutate_impl(.data, dots) : 
#>  Evaluation error: Column `mgp`: not found in data.

Upgrade travis ci mechanism

Implement "insist_rows"

insist_rows :: data.frame df : df -> (df -> [a]) -> ([a] -> ([a] -> Bool)) -> df

check_rows(a, b, c, ...)
a = data frame
b = function that takes data frame and returns nrow vector
c = predicate generating function
... optional "select" rows

apply predicate generating function to return value from b(a)
then map the resultant predicate [c(b(a))] over return value from b(a)
[map(c(b(a)), b(a))]

example:

iris %>%
   insist_rows(maha_dist, within_n_sds(5), -Species) %>%
    ...

the error message will read like....
Error: Assertion 'within_n_sds' violated at row 4 of iris (value: 32.4)

Add verbs related to missing values

Like disallowing above a certain percentage of missing values, etc....

Mention standard evaluation counterparts in vignette

Add support for vectorized predicates in assert by using comment

Need to figure out solution to use instead of "deparse"

because it makes a NOTE and then CRAN will never accept the package :(

Updated documentation when insist_rows comes out

this function is meant to be used with the \code{\link{insist}} function to

Read over all of it, actually

Assertion for duplicates

Hi thanks for putting together assertr. I am new to using it, sorry if this should be straightforward.

How would you approach writing a custom predicate for duplicates using assertr? It seems that some basic design choices of assertr make this hard.

The function should:

Test whether specified vars are joinly duplicates
Tell you something informative on failure (ideal would be the number of unique duplicate values, number of total duplicated rows, and an example set of row numbers with duplicates. Minimum would be the number of unique duplicate values)
Be extremely fast

My understanding is that this would be hard because assert applies the predicate function to each specified variable, and I would want all of the specified variables passed to the predicate function.

The ideal would be something like:

df <- data.frame(
    country = c("USA", "USA", "Canada", "Canada"), 
    region = c("Alabama", "Alabama", "Alberta", "Quebec")
)
df %>% assert(no_duplicates, country, region)
#Error: 
#country, region in df has 1 unique duplicated value and 2 duplicated rows (e.g., (1, 2))

Or maybe it's just assert(no_duplicates(country, region))?

Thanks.

Feature request: Add return_true option to verify

Would it be possible to add a return_true option to verify and assert that returns just a short "TRUE" if the expr or predicate is TRUE instead of returning the data?

Would a verify_if make sense?

Consider the following data.frame

df  <- mtcars  %>% tibble::rownames_to_column(.)  %>%  dplyr::select(cyl, rowname, vs:carb)

Suppose I want to verify whether all numeric/double columns are really integers. For one column I could do:

verify(df, all(cyl == floor(cyl)))

(See http://stackoverflow.com/a/10114392)

But if I want to verify this for multiple columns (I guess) I would have to use the (IMHO) rather verbose:

verify(df, all(which(sapply(df, is.numeric)) == floor(which(sapply(df, is.numeric)))))

So wouldn't a verify_if, along the lines of dplyr's select_if and mutate_if, be a good idea? :)
Or am I missing something?

verify_if(df, is.numeric, all(cyl == floor(cyl)))

Adding print output to stop/warning functions

I have a use for this package where I'm iterating over many data frames using map and verifying their inputs. I want to be able to identify which ones are a success and which ones have an error with the error information included.

In the current setup of the package, the error information is not captured. Below is a reproducible example.

library(tidyverse) 
library(assertr)

safe_assert <- safely(assert)
current <- map(seq(20, 40, 10), ~{
  safe_assert(mtcars, within_bounds(0, .x), mpg, error_fun = error_report)})

current

[[1]]
[[1]]$result
NULL

[[1]]$error
<simpleError: assertr stopped execution>

[[2]]
[[2]]$result
NULL

[[2]]$error
<simpleError: assertr stopped execution>

[[3]]
[[3]]$result
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4

I think it would be beneficial to include the print information in the stop and warning functions of the error_report function (see below). Have you considered this?

error_report <- function(errors, data=NULL, warn=FALSE, ...){
  if(!is.null(data) && !is.null(attr(data, "assertr_errors")))
    errors <- append(attr(data, "assertr_errors"), errors)
  num.of.errors <- length(errors)
  head <- paste0(sprintf("There %s %d error%s:\n",
                         if (num.of.errors==1) "is" else "are",
                         num.of.errors,
                         if (num.of.errors==1) "" else "s"))
  body <- sapply(errors, function(x) paste0("\n- ", x))
  if(!warn)
    stop(paste0("assertr stopped execution", "\n\n", head, body, "\n"), call.=FALSE)
  warning(paste0("assertr encountered errors", "\n\n", head, body, "\n"), call.=FALSE)
  return(data)
}

Rerunning the same example with the changes to error_report the new output is below.

library(tidyverse) 
library(assertr)

# with change to error_report function()
safe_assert <- safely(assert)
current <- map(seq(20, 40, 10), ~{
  safe_assert(mtcars, within_bounds(0, .x), mpg, error_fun = error_report)})

current

[[1]]
[[1]]$result
NULL

[[1]]$error
<simpleError: assertr stopped execution

There is 1 error:

Error in "within_bounds(0, .x)": Column 'mpg' violates assertion 'within_bounds(0, .x)' 14 times

[[2]]
[[2]]$result
NULL

[[2]]$error
<simpleError: assertr stopped execution

There is 1 error:

Error in "within_bounds(0, .x)": Column 'mpg' violates assertion 'within_bounds(0, .x)' 4 times

[[3]]
[[3]]$result
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4

If you are open to this, let me know and I can make a pull request. And if you don't want to make the change to the default settings, a solution could be having a global package option to include or not include in the stop and warning functions.

Thanks. This is a great package!

put NEWS bits in releases please

hey @tonyfischetti - We want all rOpenSci pkgs to consistently keep track of changes, following https://github.com/ropensci/onboarding/blob/master/packaging_guide.md#-news

you already keep a NEWS file, thanks! but looks like it doesn't include news for your latest CRAN version?
thanks for git tagging! but can you do a git tag for each CRAN release?
You have for some releases, but could you please use the releases tab on this repo to include the associated NEWS items for every tag/version ? thanks 😄

Add verb/grammar that will allow for observation-wise assertions

To ensure that, for example, Mahalanobis distance isn't crazy for each particular observation

Option to list all violating rows instead of using e.g.

Excellent package!

I'm wondering if there currently exists (or if their might be in the future) an option in assert() that could list each violation of a predicate in the given column, rather than simply listing the number times and one example?

Steal validation ideas from the following sources...

https://github.com/dataproofer/Dataproofer

it looks like we already have all of these

num_row_NAs doesn't work when "data.frame" is not in the first position of class(my_data)

When num_row_NAs gets an object when "data.frame" class is not the first one (e.g. output from dplyr/tibble) it throws an error "data" must be a data.frame (or matrix).

mtcars <- tibble::as_tibble(mtcars)
[1] "tbl_df"     "tbl"        "data.frame"
assertr::num_row_NAs(mtcars)
Error: "data" must be a data.frame (or matrix)

It is due to a line in row-redux.R:

if(!(class(data) %in% c("matrix", "data.frame")))

It can be fixed, e.g., as follows:

if(!(any(class(data) %in% c("matrix", "data.frame"))))

error function that returns data that triggered errors

Would it be possible to have an error function that returns the data that failed at least one test?

Parameterize error function

Right now assertr uses "stop" and a message to halt execution. I should allow the user to specify their own function that will be called when an assertion is violated.

For example, one could make a function that accepts the error string and emails the error message before calling stop

a generalized solution is to make a function that takes an email address and an error string, emails the error string to the email address and then halts execution. Then, a partially applied function can be used as the custom error function...

email.and error <- function(email, message){
  ...
}

mtcars %>% assert(is_set(c(1,0)), vs, on_error=partial(email.and.error("[email protected]")))

`df %>% assertr::assert(assertr::not_na, var)` gives incorrect message

Thanks for the great package!

Here's an incorrect message I got when calling the not_na() function through namespace:

df = data_frame(var = c(1, NA))
df %>% assertr::assert(assertr::not_na, var)
#> Error: 
#> Vector 'var' violates assertion '::' 1 time (value [NA] at index 2)

If assertr is loaded beforehand, the message is fine:

library(assertr)
df %>% assert(not_na, var)
#> Error: 
#> Vector 'var' violates assertion 'not_na' 1 time (value [NA] at index 2)

Also, is it possible to have a short cut function such as assertr::assert_not_na(df, var)?

tonyfischetti / assertr Goto Github PK

assertr's Introduction

assertr

What is it?

Installation

What does it look like?

What does assertr give me?

More info

assertr's People

Contributors

Stargazers

Watchers

Forkers

assertr's Issues

Recommend Projects

Recommend Topics

Recommend Org

What does `assertr` give me?