elbersb / tidylog Goto Github PK

Tidylog provides feedback about dplyr and tidyr operations. It provides wrapper functions for the most common functions, such as filter, mutate, select, and group_by, and provides detailed output for joins.

License: Other

R 100.00%

dplyr r tidyr tidyverse wrapper-functions

tidylog's Introduction

tidylog

The goal of tidylog is to provide feedback about dplyr and tidyr operations. It provides simple wrapper functions for almost all dplyr and tidyr functions, such as filter, mutate, select, full_join, and group_by.

Example

Load tidylog after dplyr and/or tidyr:

library("dplyr")
library("tidyr")
library("tidylog", warn.conflicts = FALSE)

Tidylog will give you feedback, for instance when filtering a data frame or adding a new variable:

filtered <- filter(mtcars, cyl == 4)
#> filter: removed 21 rows (66%), 11 rows remaining
mutated <- mutate(mtcars, new_var = wt ** 2)
#> mutate: new variable 'new_var' (double) with 29 unique values and 0% NA

Tidylog reports detailed information for joins:

joined <- left_join(nycflights13::flights, nycflights13::weather,
    by = c("year", "month", "day", "origin", "hour", "time_hour"))
#> left_join: added 9 columns (temp, dewp, humid, wind_dir, wind_speed, …)
#>            > rows only in nycflights13::flights    1,556
#>            > rows only in nycflights13::weather (  6,737)
#>            > matched rows                        335,220
#>            >                                    =========
#>            > rows total                          336,776

In this case, we see that 1,556 rows from the flights dataset do not have weather information.

Tidylog can be especially helpful in longer pipes:

summary <- mtcars %>%
    select(mpg, cyl, hp, am) %>%
    filter(mpg > 15) %>%
    mutate(mpg_round = round(mpg)) %>%
    group_by(cyl, mpg_round, am) %>%
    tally() %>%
    filter(n >= 1)
#> select: dropped 7 variables (disp, drat, wt, qsec, vs, …)
#> filter: removed 6 rows (19%), 26 rows remaining
#> mutate: new variable 'mpg_round' (double) with 15 unique values and 0% NA
#> group_by: 3 grouping variables (cyl, mpg_round, am)
#> tally: now 20 rows and 4 columns, 2 group variables remaining (cyl, mpg_round)
#> filter (grouped): no rows removed

Here, it might have been accidental that the last filter command had no effect.

Installation

Download from CRAN:

install.packages("tidylog")

Or install the development version:

devtools::install_github("elbersb/tidylog")

Benchmarks

Tidylog will add a small overhead to each function call. This can be relevant for very large datasets and especially for joins. If you want to switch off tidylog for a single long-running command, simply prefix dplyr:: or tidyr::, such as in dplyr::left_join. See this vignette for more information.

More examples

filter, distinct, drop_na

a <- filter(mtcars, mpg > 20)
#> filter: removed 18 rows (56%), 14 rows remaining
b <- filter(mtcars, mpg > 100)
#> filter: removed all rows (100%)
c <- filter(mtcars, mpg > 0)
#> filter: no rows removed
d <- filter_at(mtcars, vars(starts_with("d")), any_vars((. %% 2) == 0))
#> filter_at: removed 19 rows (59%), 13 rows remaining
e <- distinct(mtcars)
#> distinct: no rows removed
f <- distinct_at(mtcars, vars(vs:carb))
#> distinct_at: removed 18 rows (56%), 14 rows remaining
g <- top_n(mtcars, 2, am)
#> top_n: removed 19 rows (59%), 13 rows remaining
i <- sample_frac(mtcars, 0.5)
#> sample_frac: removed 16 rows (50%), 16 rows remaining

j <- drop_na(airquality)
#> drop_na: removed 42 rows (27%), 111 rows remaining
k <- drop_na(airquality, Ozone)
#> drop_na: removed 37 rows (24%), 116 rows remaining

mutate, transmute, replace_na, fill

a <- mutate(mtcars, new_var = 1)
#> mutate: new variable 'new_var' (double) with one unique value and 0% NA
b <- mutate(mtcars, new_var = runif(n()))
#> mutate: new variable 'new_var' (double) with 32 unique values and 0% NA
c <- mutate(mtcars, new_var = NA)
#> mutate: new variable 'new_var' (logical) with one unique value and 100% NA
d <- mutate_at(mtcars, vars(mpg, gear, drat), round)
#> mutate_at: changed 28 values (88%) of 'mpg' (0 new NAs)
#>            changed 31 values (97%) of 'drat' (0 new NAs)
e <- mutate(mtcars, am_factor = as.factor(am))
#> mutate: new variable 'am_factor' (factor) with 2 unique values and 0% NA
f <- mutate(mtcars, am = as.ordered(am))
#> mutate: converted 'am' from double to ordered factor (0 new NA)
g <- mutate(mtcars, am = ifelse(am == 1, NA, am))
#> mutate: changed 13 values (41%) of 'am' (13 new NAs)
h <- mutate(mtcars, am = recode(am, `0` = "zero", `1` = NA_character_))
#> mutate: converted 'am' from double to character (13 new NA)

i <- transmute(mtcars, mpg = mpg * 2, gear = gear + 1, new_var = vs + am)
#> transmute: dropped 9 variables (cyl, disp, hp, drat, wt, …)
#> transmute: dropped 9 variables (cyl, disp, hp, drat, wt, …)
#>            changed 32 values (100%) of 'mpg' (0 new NAs)
#>            changed 32 values (100%) of 'gear' (0 new NAs)
#>            new variable 'new_var' (double) with 3 unique values and 0% NA

j <- replace_na(airquality, list(Solar.R = 1))
#> replace_na: changed 7 values (5%) of 'Solar.R' (7 fewer NAs)
k <- fill(airquality, Ozone)
#> fill: changed 37 values (24%) of 'Ozone' (37 fewer NAs)

joins

For joins, tidylog provides more detailed information. For any join, tidylog will show the number of rows that are only present in x (the first dataframe), only present in y (the second dataframe), and rows that have been matched. Numbers in parentheses indicate that these rows are not included in the result. Tidylog will also indicate whether any rows were duplicated (which is often unintentional):

x <- tibble(a = 1:2)
y <- tibble(a = c(1, 1, 2), b = 1:3) # 1 is duplicated
j <- left_join(x, y, by = "a")
#> left_join: added one column (b)
#>            > rows only in x  0
#>            > rows only in y (0)
#>            > matched rows    3    (includes duplicates)
#>            >                ===
#>            > rows total      3

More examples:

a <- left_join(band_members, band_instruments, by = "name")
#> left_join: added one column (plays)
#>            > rows only in band_members      1
#>            > rows only in band_instruments (1)
#>            > matched rows                   2
#>            >                               ===
#>            > rows total                     3
b <- full_join(band_members, band_instruments, by = "name")
#> full_join: added one column (plays)
#>            > rows only in band_members      1
#>            > rows only in band_instruments  1
#>            > matched rows                   2
#>            >                               ===
#>            > rows total                     4
c <- anti_join(band_members, band_instruments, by = "name")
#> anti_join: added no columns
#>            > rows only in band_members      1
#>            > rows only in band_instruments (1)
#>            > matched rows                  (2)
#>            >                               ===
#>            > rows total                     1

Because tidylog needs to perform two additional joins behind the scenes to report this information, the overhead will be larger than for the other tidylog functions (especially with large datasets).

select, relocate, rename

a <- select(mtcars, mpg, wt)
#> select: dropped 9 variables (cyl, disp, hp, drat, qsec, …)
b <- select_if(mtcars, is.character)
#> select_if: dropped all variables
c <- relocate(mtcars, hp)
#> relocate: columns reordered (hp, mpg, cyl, disp, drat, …)
d <- select(mtcars, a = wt, b = mpg)
#> select: renamed 2 variables (a, b) and dropped 9 variables

e <- rename(mtcars, miles_per_gallon = mpg)
#> rename: renamed one variable (miles_per_gallon)
f <- rename_with(mtcars, toupper)
#> rename_with: renamed 11 variables (MPG, CYL, DISP, HP, DRAT, …)

summarize

a <- mtcars %>%
    group_by(cyl, carb) %>%
    summarize(total_weight = sum(wt))
#> group_by: 2 grouping variables (cyl, carb)
#> summarize: now 9 rows and 3 columns, one group variable remaining (cyl)

b <- iris %>%
    group_by(Species) %>%
    summarize_all(list(min, max))
#> group_by: one grouping variable (Species)
#> summarize_all: now 3 rows and 9 columns, ungrouped

tally, count, add_tally, add_count

a <- mtcars %>% group_by(gear, carb) %>% tally
#> group_by: 2 grouping variables (gear, carb)
#> tally: now 11 rows and 3 columns, one group variable remaining (gear)
b <- mtcars %>% group_by(gear, carb) %>% add_tally()
#> group_by: 2 grouping variables (gear, carb)
#> add_tally (grouped): new variable 'n' (integer) with 5 unique values and 0% NA

c <- mtcars %>% count(gear, carb)
#> count: now 11 rows and 3 columns, ungrouped
d <- mtcars %>% add_count(gear, carb, name = "count")
#> add_count: new variable 'count' (integer) with 5 unique values and 0% NA

pivot_longer, pivot_wider

longer <- mtcars %>%
    mutate(id = 1:n()) %>%
    pivot_longer(-id, names_to = "var", values_to = "value")
#> mutate: new variable 'id' (integer) with 32 unique values and 0% NA
#> pivot_longer: reorganized (mpg, cyl, disp, hp, drat, …) into (var, value) [was 32x12, now 352x3]
wider <- longer %>%
    pivot_wider(names_from = var, values_from = value)
#> pivot_wider: reorganized (var, value) into (mpg, cyl, disp, hp, drat, …) [was
#> 352x3, now 32x12]

Tidylog also supports gather and spread.

Turning logging off, registering additional loggers

To turn off the output for just a particular function call, you can simply call the dplyr and tidyr functions directly, e.g. dplyr::filter or tidyr::drop_na.

To turn off the output more permanently, set the global option tidylog.display to an empty list:

options("tidylog.display" = list())  # turn off
a <- filter(mtcars, mpg > 20)

options("tidylog.display" = NULL)    # turn on
a <- filter(mtcars, mpg > 20)
#> filter: removed 18 rows (56%), 14 rows remaining

This option can also be used to register additional loggers. The option tidylog.display expects a list of functions. By default (when tidylog.display is set to NULL), tidylog will use the message function to display the output, but if you prefer a more colorful output, simply overwrite the option:

library("crayon")  # for terminal colors
crayon <- function(x) cat(red$bold(x), sep = "\n")
options("tidylog.display" = list(crayon))
a <- filter(mtcars, mpg > 20)
#> filter: removed 18 rows (56%), 14 rows remaining

To print the output both to the screen and to a file, you could use:

log_to_file <- function(text) cat(text, file = "log.txt", sep = "\n", append = TRUE)
options("tidylog.display" = list(message, log_to_file))
a <- filter(mtcars, mpg > 20)
#> filter: removed 18 rows (56%), 14 rows remaining

Namespace conflicts

Tidylog redefines several of the functions exported by dplyr and tidyr, so it should be loaded last, otherwise there will be no output. A more explicit way to resolve namespace conflicts is to use the conflicted package:

library("dplyr")
library("tidyr")
library("tidylog")
library("conflicted")
for (f in getNamespaceExports("tidylog")) {
    conflicted::conflict_prefer(f, "tidylog", quiet = TRUE)
}

tidylog's People

Contributors

Stargazers

Watchers

tidylog's Issues

Tidylog non-functional in quarto documents

Hi!

Love the package and thus also wanted to continue using it in quarto documents where it seemingly doesn't do anything at least for me.

Anything I can do to help debug?

arrange is not supported

Considering arrange() is part of dplyr and tidyr I would expected support for it. For example:

iris %>%
  group_by(Species) %>%
  summarise(mean_Sepal_Length = mean(Sepal.Length), .groups = "drop") %>%
  arrange(desc(mean_Sepal_Length))

now gives:

group_by: one grouping variable (Species)
summarise: now 3 rows and 2 columns, ungrouped

while I would expect something like

group_by: one grouping variable (Species)
summarise: now 3 rows and 2 columns, ungrouped
arrange: now arranged in descending order by one variable (mean_Sepal_Length)

I think that would be a nice addition, if it would fit within the logic of this package.

support tally, count, add_tally, add_count

tally, count should probably go to summarize
add_tally, add_count should probably go to mutate

https://dplyr.tidyverse.org/reference/tally.html

use rlang::inform instead of message

have to wait for this bug to be fixed: r-lib/rlang#880

Print filter again when removing rows?

Thanks for this awesome package!

I was wondering if it would be possible to not only print

filter: removed 29800 out of 74790 rows (40%)

but, for example:

filter: removed 29800 out of 74790 rows (40%), used filter(s): !is.na(user_id)

That is, the log could print what was actually being filtered. This would make it much easier to debug sources of errors in long pipes.

Write tidylog messages to log file?

I have some scripts where it would be useful to skim a hypothetical log of the messages.

Perhaps via options or an analogous {logger} function? E.g.:

library(tidylog, include.only = c("filter", "mutate", "left_join"))
tidylog_log <- "tidylog"
tidylog_appender(appender_file(tidylog_log))
...
<script>

format "X fewer NA"

 tibble(x=rep(NA_real_, 1000000)) %>% mutate(x = 1)                                                                           
# mutate: changed 1,000,000 values (100%) of 'x' (1000000 fewer NA)

1000000 is not formatted

Tidylog masks dplyr warnings/messages

Tidylog masks dplyr warnings/messages. Can it be updated to reproduce these? For example

library(nycflights13)
library(tidyverse)
library(tidylog)

flights |> group_by(year, month, day) %>% dplyr::summarise(n = n())
This provides the following message:
summarise() has grouped output by 'year', 'month'. You can override using the .groups argument.

flights |> group_by(year, month, day) %>% tidylog::summarise(n = n())
This provides the following message:
group_by: 3 grouping variables (year, month, day)
summarise: now 365 rows and 4 columns, 2 group variables remaining (year, month)

Is this an error?

In one of the examples provided,

b <- filter(mtcars, mpg > 100)
#> filter: removed all rows (100%)

seems to be an error. Zero should be removed

Rd warnings when installing

I like this package so far.
I am receiving a warning during installation regarding file links to 'inner_join' and 'transmute' in dplyr.

This seems related to an issue opened in roxygen:
r-lib/roxygen2#707

My setup:

R version 3.5.2
Windows 10
Installing with packrat on

Install messages:

* installing *source* package 'tidylog' ...
** R
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
  converting help for package 'tidylog'
    finding HTML links ... done
    filter                                  html  
    group_by                                html  
    inner_join                              html  
Rd warning: C:/Users/Byron/AppData/Local/Temp/Rtmp4K2WMi/R.INSTALL4a2837776241/tidylog/man/inner_join.Rd:26: file link 'inner_join' in package 'dplyr' does not exist and so has been treated as a topic
Rd warning: C:/Users/Byron/AppData/Local/Temp/Rtmp4K2WMi/R.INSTALL4a2837776241/tidylog/man/inner_join.Rd:28: file link 'inner_join' in package 'dplyr' does not exist and so has been treated as a topic
Rd warning: C:/Users/Byron/AppData/Local/Temp/Rtmp4K2WMi/R.INSTALL4a2837776241/tidylog/man/inner_join.Rd:31: file link 'inner_join' in package 'dplyr' does not exist and so has been treated as a topic
    mutate                                  html  
    select                                  html  
    tidylog                                 html  
    transmute                               html  
Rd warning: C:/Users/Byron/AppData/Local/Temp/Rtmp4K2WMi/R.INSTALL4a2837776241/tidylog/man/transmute.Rd:20: file link 'transmute' in package 'dplyr' does not exist and so has been treated as a topic
Rd warning: C:/Users/Byron/AppData/Local/Temp/Rtmp4K2WMi/R.INSTALL4a2837776241/tidylog/man/transmute.Rd:22: file link 'transmute' in package 'dplyr' does not exist and so has been treated as a topic
Rd warning: C:/Users/Byron/AppData/Local/Temp/Rtmp4K2WMi/R.INSTALL4a2837776241/tidylog/man/transmute.Rd:25: file link 'transmute' in package 'dplyr' does not exist and so has been treated as a topic
** building package indices
** testing if installed package can be loaded
*** arch - i386
*** arch - x64
* DONE (tidylog)
In R CMD INSTALL

Little histograms in terminal output

skimr uses these, seen here: https://cran.r-project.org/web/packages/skimr/vignettes/skimr.html

Could also be nice for mutate calls.

relevant functions from tidyr

see https://tidyr.tidyverse.org/reference/index.html

pivoting
uncount (-> summarize)

Other functions (hoist, unnest_longer, unnest_wider, unnest_auto, nest, unnest, pack, chop) I'd not consider for now.

Format large numbers

For an output such as;
filter: removed 7375373 out of 10541429 rows (70%)
It would be nice for this to display as
filter: removed 7,375,373 out of 10,541,429 rows (70%)
to make reading easier, this would probably have to be locale dependant.

Mutate doesn't work with set_units()

The title says it all. See code below. Also, thanks for the great package. You have no idea how frustrating debugging pipes had been as a newbie until I discovered tidylog :)

library(tidyverse)
library(units)

df <- tribble(~A, ~B,
        1, 2,
        2, 3)

df %>%
  mutate(B = set_units(B , mg)) %>%
  print()
# A tibble: 2 x 2

library(tidylog)

df %>%
  mutate(B = set_units(B , mg)) %>%
  print() 
# Error in Ops.units(new, old) : 
#  both operands of the expression should be "units" objects

Overwriting of nested columns not possible with tidylog::mutate

When I overwrite a nested column with mutate it works with dplyr::mutate() but not with tidylog::mutate()

Check example below

library(tidyverse)
library(tidylog)

# This doesn't work: Error in new != old : comparison of these types is not implemented
as_tibble(iris) %>% nest(-Species) %>% tidylog::mutate(data = data %>% map(function(data) {data$Sepal.Length + data$Sepal.Width}))

# This works
as_tibble(iris) %>% nest(-Species) %>% dplyr::mutate(data = data %>% map(function(data) {data$Sepal.Length + data$Sepal.Width}))

tidylog output in html markdown

I'm creating an HTML document with R-Markdown. The resulting file includes all the messages from tidylog. Is there an option to turn these off in html/output mode?

Display actual dataset names in join message

It would be neat it the message for joins would display the actual dataset names rather then the generic x and y.

joined <- left_join(nycflights13::flights, nycflights13::weather,
    by = c("year", "month", "day", "origin", "hour", "time_hour"))
#> left_join: added 9 columns (temp, dewp, humid, wind_dir, wind_speed, …)
#>            > rows only in nycflights13::flights     1,556
#>            > rows only in nycflights13::weather  (  6,737)
#>            > matched rows                         335,220
#>            >                                     =========
#>            > rows total                           336,776

Can/Should tidylog::filter throw a message or warning on 'missing' filter group?

Suppose I want to filter some data for three groups ('A', 'B', 'C'). I assume the presence of all three, but my data for whatever reason only has 'A' and 'B' and 'D'. It would be useful to receive immediate feedback that part of my filter wasn't there (maybe because of a typo or other data pipeline reasons). A warning might be too strong, but a message as part of the tidylog message maybe? See the reprex below.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
suppressPackageStartupMessages(library(tidylog))

df <- data.frame(
  x = sample(c("A", "B", "D"), 10, replace = TRUE),
  y = rnorm(10)
)

df %>% 
  filter(x %in% c("A", "B", "C")) %>% # Warning that there are no 'C' rows?
  summarize(my = mean(y))
#> filter: removed 2 rows (20%), 8 rows remaining
#> summarize: now one row and one column, ungrouped
#>           my
#> 1 -0.2840769

^{Created on 2022-01-24 by the reprex package (v0.3.0)}

Renaming while selecting gives confusing message.

The code

tibble(x=1,y=2,z=3) %>% 
  select(a=x,b=y)

results in select: dropped 3 variables (x, y, z). While technically true, I would rather call it "renamed 2 variables, dropped 1 variable".

`mutate` does not report dropped variables

Currently if you execute the following, we do not get information about the dropped mass column:

require(tidyverse)
require(tidylog)
starwars %>%
 select(name, height, mass, homeworld) %>%
 mutate(
  mass = NULL,
  height = height * 0.0328084 # convert to feet
)

We only get back:

mutate: converted height from integer to double (0 new NA)

I believe some of the same logic used to generate the dropped variables message for transmute could be used to cover this case as well.

Include dplyr function rename in tidylog package

While using this very useful package, I find it would be nice to get information also about column renaming while working with dplyr::rename() in pipes.

any chance the package messes up the computation/ objects in memory?

Hi,

This package is really a good idea. I just wonder, are there any possible weird side effects that can happen? Is the logging interfering with the dplyr computations at any point?

Thanks!

Function masking workaround

I saw your post on Twitter about version 1.0.0 and I wanted to thank you for your work on this package! Inspired by one of the comments that expressed concern about overloading the dplyr and tidylog functions, I started work on a package, https://github.com/TylerGrantSmith/mask, that would allow you to use tidylog without loading it into the search path.

It is still just an early concept, but I wanted to get your opinion on functionality.

Here is an example reprex:

library(magrittr)

# clean searchpath
searchpaths()
#>  [1] ".GlobalEnv"                                            
#>  [2] "C:/Users/e014307/Documents/R/R-3.6.2/library/magrittr" 
#>  [3] "C:/Users/e014307/Documents/R/R-3.6.2/library/stats"    
#>  [4] "C:/Users/e014307/Documents/R/R-3.6.2/library/graphics" 
#>  [5] "C:/Users/e014307/Documents/R/R-3.6.2/library/grDevices"
#>  [6] "C:/Users/e014307/Documents/R/R-3.6.2/library/utils"    
#>  [7] "C:/Users/e014307/Documents/R/R-3.6.2/library/datasets" 
#>  [8] "C:/Users/e014307/Documents/R/R-3.6.2/library/methods"  
#>  [9] "Autoloads"                                             
#> [10] "tools:callr"                                           
#> [11] "C:/Users/e014307/DOCUME~1/R/R-36~1.2/library/base"

# mask expressions
mask::tidylog_mask(mtcars %>% 
  dplyr::select(mpg, cyl) %>% 
  dplyr::filter(mpg < 15) %>% 
  dplyr::group_by(cyl) %>%
  dplyr::summarise(mean = mean(mpg)))
#> select: dropped 9 variables (disp, hp, drat, wt, qsec, …)
#> filter: removed 27 rows (84%), 5 rows remaining
#> group_by: one grouping variable (cyl)
#> summarise: now one row and 2 columns, ungrouped
#> # A tibble: 1 x 2
#>     cyl  mean
#>   <dbl> <dbl>
#> 1     8  12.6

# mask individual functions
select <- mask::tidylog_mask(dplyr::select)
mtcars %>% 
  select(mpg, cyl) %>% 
  dplyr::filter(mpg < 15)
#> select: dropped 9 variables (disp, hp, drat, wt, qsec, …)
#>    mpg cyl
#> 1 14.3   8
#> 2 10.4   8
#> 3 10.4   8
#> 4 14.7   8
#> 5 13.3   8

^{Created on 2020-01-08 by the reprex package (v0.3.0)}

In addition, there is an RStudio addin that allows you to run selected code with tidylog using a keybinding.

Crayon the information

I cant express how much this package is making my workflow better! Brilliant.

I was wondering if you might consider crayoning the logs so they are more easily distinguishable from errors and warnings? This would at least help me alot when running larger scripts where I have known warnings I ignore, but want to catch errors.

report type for new variables

currently:

mutate: new variable 'NAME' with X unique values and Y% NA

better:

mutate: new variable 'NAME' (logical) with X unique values and Y% NA
mutate: new variable 'NAME' (numeric) with X unique values and Y% NA
mutate: new variable 'NAME' (date) with X unique values and Y% NA
mutate: new variable 'NAME' (factor) with X unique values and Y% NA
# etc.

format_list() unicode character does not print correctly

tidylog 0.1.0
R 3.5.2

The character used to indicate a truncated list in format_list() is a unicode character that does not print correctly with any of the fonts I've tried. When I render my document to an HTML Notebook, it creates this:

㠼㸵

And in RStudio it just appears as the 'missing symbol' diamond glyph.

Interest in expanding this to other tidyverse packages?

Ben: love the approach. I've been focused on hacking on the pipe operator itself for far too long.

Before I go forking around with your package, I'm thinking of such things as tidyr and purrr. Feedback such as you provide for dplyr would be very helpful for spread, gather, map_df, etc...

Would you be open to such pull requests and expanding the scope of tidylog beyond just dplyr?

add number of remaining rows

For a lazy person, who does not want to do the math :), is it possible to add the number of remaining rows in the data (in addition to the current numbers?)

What I often care about this the number of remaining

filter: removed 287 out of 761 rows (38%), 474 remaining
filter: removed 230 out of 474 rows (49%), 244 remaining

Thanks!

Negative new NAs

Loving the package!

Very minor issue: when mutate is used to replace NA values with non-NA values, tidylog will report (-X new NA), where X is the number of values that were NA but no longer are.

It's still providing all the information it should, but the message is a little strange.

Example code:

library(tidyverse)
library(tidylog)
df <- tibble(a = NA,
                 b = rnorm(100)) %>%
  mutate(a=ifelse(b<0,1,a))

print indicator when data frame is grouped?

Obviously, some operations in dplyr depend on whether the data frame is grouped. a will have different values in this example:

mtcars %>% group_by(mpg) %>% mutate(a = mean(wt)) 
mtcars %>% mutate(a = mean(wt))

Usually, this should be pretty clear, but when the data frame is grouped earlier in the code, one might have forgotten about this.

mutate, transmute and filter could indicate whether the data frame is grouped. Maybe just like this:

mutate (grouped): new variable 'a' with 23 unique values and 0% NA

mutate: new variable 'a' with 23 unique values and 0% NA (within 25 groups)

error when variable name is f, fu, fun etc.

library("tidyverse")
library("tidylog", warn.conflicts = FALSE)
mutate(mtcars, f = 1)
#> Error in log_mutate(.data, dplyr::mutate, "mutate", ...): argument 4 matches multiple formal arguments
mutate(mtcars, fu = 1)
#> Error in log_mutate(.data, dplyr::mutate, "mutate", ...): argument 4 matches multiple formal arguments
mutate(mtcars, fun = 1)
#> Error in fun(.data, ...): could not find function "fun"
mutate(mtcars, g = 1)
#> mutate: new variable 'g' with one unique value and 0% NA

^{Created on 2019-05-24 by the reprex package (v0.2.1)}

Add log about ungroup()

While doing this:

df %>% 
  group_by() %>% 
  summarize() %>% 
  ungroup()

I get this as last message: summarize: now xx rows and yy columns, zz group variables remaining, where xx, yy and zz are numbers.
It would be nice, I think, to have a message for ungroup() as well.
Something like: ungroup: no grouping variables left or just ungrouped.
What do you think? Thanks.

log for bind_*() functions

Hi @elbersb!
Have you ever considered to implement logs for dplyr functions bind_rows() and bind_cols()?
I am using them in pipes in combination with other dplyr functions and it seems the pipe instructions didn't work as expected, although it is just a missing step in the log. What do you think about it?

relevant functions from dplyr

These all have filter semantics, so should be straightforward.

top_frac (top_n already implemented)
sample_n, sample_frac (simple extension of filter)
slice (simple extension of filter)

Also:

rename and variants (see #27)

And that should be it for dplyr.

Explicit messages even when no changes result

I'm not sure I have a strong opinion about it, but I would like to hear others about the following example mtcars %>% select(hp, everything()). tidylog produces no message because nothing really changed except the column order, but I still think it should produce a notification like select: no variables were dropped consistent with how filter works when currently when no rows are dropped.

improve test coverage

Hints in the autocomplete

I love tidylog, but a caveat is that we don't have the same hints about the parameters, when you are coding and hit ctrl+space, for instance, that you have when you use the original tidyverse packages.

Do you plan to improve this?

error in join when pipe passes to y

I'm really liking this package - thank you! I did come across an error: if you pass the data from the pipe to the y in join(as opposed to x) it throws an error. You can fix it by calling the dplyr version.

# ## merge data sets to include all visits (even those w/o Rx)
analysis <- analysis %>% 
   left_join(x = patient, y = .) %>% 
   mutate_at(vars(prescribed_opioid), factor, levels = c(NA, TRUE, FALSE), 
             labels = c('No Rx', 'Prescribed Opioid', 'Not Prescribed Opioid'), 
             exclude = NULL)
# Error in log_join(.data, dplyr::left_join, "left_join", ...) :  
#   argument ".data" is missing, with no default

### Works:
analysis <- analysis %>% 
   dplyr::left_join(x = patient, y = .) %>% 
  mutate_at(vars(prescribed_opioid), factor, levels = c(NA, TRUE, FALSE), 
             labels = c('No Rx', 'Prescribed Opioid', 'Not Prescribed Opioid'), 
             exclude = NULL)
# Joining, by = "Encounter_id"

### Also works:
analysis <- patient %>% 
  left_join(analysis) %>% 
  mutate_at(vars(prescribed_opioid), factor, levels = c(NA, TRUE, FALSE), 
             labels = c('No Rx', 'Prescribed Opioid', 'Not Prescribed Opioid'), 
             exclude = NULL)
# Joining, by = "Encounter_id"
# left_join: added 0 rows and added one column (prescribed_opioid)
# mutate_at: converted 'prescribed_opioid' from logical to factor (-101770 new NA)

The later also works but it tells you how many rows you added to patient not analysis.
This is my first public issue and I'm sorry if I've done anything incorrectly.

Minor issue: uncount() breaks if data argument is explicit

It appears tidyr::uncount() breaks if data = argument is explicit. See example below:

library(tidyr)
library(tidylog)

library(conflicted)
for (f in getNamespaceExports("tidylog")) {
    conflicted::conflict_prefer(f, "tidylog", quiet = TRUE)
}

df <- tibble(x = c("a", "b"), n = c(1, 2))
uncount(df, n) # works fine
# uncount: now 3 rows and one column, ungrouped

uncount(data = df, weights = n)
# Error in log_summarize(.data, .fun = tidyr::uncount, .funname = "uncount",  : 
#   argument ".data" is missing, with no default

Make ungroup log message more explicit

Hi folks, firstly: I LOVE this package, it's so helpful!

Small one, relating to #32 : ungrouping a grouped object tells you:

df %<>% ungroup

ungroup: no grouping variables

Which (IMHO) is unclear/ambiguous as to whether the input object HAD no grouping variables to begin with (i.e. "ungroup didn't have anything to do"), or if it HAS no grouping variables NOW, thanks to the ungroup call (i.e. "ungrouped successfully").

Maybe:

ungroup: previously grouped by ID [count], now ungrouped

Cheers!

Making output optional

I think I really could use this in my classes for demos.
But I would prefer to habe an option to switch the additional output on and off.
Can this be done (or added)?

overwriting ordered factors using fct_recode fails

library(tidyverse)
d = tibble(x = ordered(c("apple", "bear", "banana", "dear"))) 
# works
fct_recode(d$x, fruit = "apple", fruit = "banana")
#> [1] fruit bear  fruit dear 
#> Levels: fruit < bear < dear

# works when using a new variable
tidylog::mutate(d, y = fct_recode(x, fruit = "apple", fruit = "banana"))
#> mutate: new variable 'y' with 3 unique values and 0% NA
#> # A tibble: 4 x 2
#>   x      y    
#>   <ord>  <ord>
#> 1 apple  fruit
#> 2 bear   bear 
#> 3 banana fruit
#> 4 dear   dear

# fails when overwriting
tidylog::mutate(d, x = fct_recode(x, fruit = "apple", fruit = "banana"))
#> Error in Ops.factor(new, old): level sets of factors are different

# works with factor
d = tibble(x = factor(c("apple", "bear", "banana", "dear"))) 
tidylog::mutate(d, x = fct_recode(x, fruit = "apple", fruit = "banana"))
#> mutate: changed 2 values (50%) of 'x' (0 new NA)
#> # A tibble: 4 x 1
#>   x    
#>   <fct>
#> 1 fruit
#> 2 bear 
#> 3 fruit
#> 4 dear

Mutating column to date

Maybe it's intentional, but shouldn't this say "Date" and not "double"?

data.frame(x = c("2016-01-01", "2016-01-02")) %>% 
  tidylog::mutate(x = as.Date(x))
#> mutate: converted 'x' from factor to double (0 new NA)
#>            x
#> 1 2016-01-01
#> 2 2016-01-02

print message when summarizing?

The tidylog package could also print a message after a summarize command to let the user know which groups are remaining, for instance:

data.frame(Titanic) %>%  
        group_by(Age, Class) %>%  
        summarize(Freq = sum(Freq)) %>%  
        mutate(Class = reorder(Class, Freq))                                                              
#> group_by: 8 groups [Age, Class] 
#> summarize: 2 groups remaining [Age]
#> mutate: changed 8 values (100%) of 'Class', factor levels updated

https://community.rstudio.com/t/understanding-group-by-order-matters/22685/6

New dplyr join_by breaks with tidylog

Love the package. Has been immensely helpful for verifying operations are running as expected when walking through long chains.

I noticed I now receive an error when running the new join_by functionality in dplyr 1.1.0 while tidylog is loaded. Reproducible example:

library(tidylog)
library(dplyr)

sales <- tibble(
    id = c(1L, 1L, 1L, 2L, 2L),
    sale_date = as.Date(c("2018-12-31", "2019-01-02", "2019-01-05", "2019-01-04", "2019-01-01"))
)

promos <- tibble(
    id = c(1L, 1L, 2L),
    promo_date = as.Date(c("2019-01-01", "2019-01-05", "2019-01-02"))
)

by <- join_by(id, sale_date == promo_date)
left_join(sales, promos, by)

Produces the following error:

Error in `dplyr::common_by()`:
! `by` must be a (named) character vector, list, or NULL for natural joins (not recommended in production code), not a <dplyr_join_by> object.

Can `tidylog::filter` also report the number of groups

Hope this is a simple questions: can tidylog::filter also report the number of groups? Or what would be a workaround?

For example, current
filter (grouped): removed 78,720 rows (40%), 118,956 rows remaining
and desired
filter (grouped): removed 78,720 rows (40%), 118,956 rows remaining; removed 100 groups (10%), 900 groups remaining

group_by logs incorrect value

Great idea! I was planning on using it to teach tidy verse next week and noticed group_by() throws an incorrect value. I am using your code example too, so not sure what is going on here. sessionInfo follows, in case that helps.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidylog)
#> 
#> Attaching package: 'tidylog'
#> The following objects are masked from 'package:dplyr':
#> 
#>     anti_join, distinct, filter, filter_all, filter_at, filter_if,
#>     full_join, group_by, group_by_all, group_by_at, group_by_if,
#>     inner_join, left_join, mutate, mutate_all, mutate_at,
#>     mutate_if, right_join, select, select_all, select_at,
#>     select_if, semi_join, transmute, transmute_all, transmute_at,
#>     transmute_if
#> The following object is masked from 'package:stats':
#> 
#>     filter
summary <- mtcars %>%
  select(mpg, cyl, hp) %>%
  filter(mpg > 15) %>%
  mutate(mpg_round = round(mpg)) %>%
  group_by(cyl, mpg_round) %>%
  tally() %>%
  filter(n >= 1)
#> select: dropped 8 variables (disp, drat, wt, qsec, vs, …) 
#> filter: removed 6 rows (19%) 
#> mutate: new variable 'mpg_round' with 15 unique values and 0% NA 
#> group_by: 0 groups [] 
#> filter: no rows removed


sessionInfo()
#> R version 3.5.2 (2018-12-20)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Sierra 10.12.6
#> 
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_3.5.2  magrittr_1.5    tools_3.5.2     htmltools_0.3.6
#>  [5] yaml_2.2.0      Rcpp_1.0.0      stringi_1.2.4   rmarkdown_1.11 
#>  [9] highr_0.7       knitr_1.21      stringr_1.3.1   xfun_0.4       
#> [13] digest_0.6.18   evaluate_0.12

Fewer NA bug on character conversion

Hi,

I believe I have loaded the most recent dev version and I am having an issue where tidylog thinks NAs are removed (n fewer NA) or that 0% are NAs when creating a new variable, but this is not the case.

Here is my code to reproduce the issue

library(tidyverse)
library(tidylog)
# set up df with some NA's in 'codes' variable
id <- 1:10
codes <- 1:10
codes[8:10] <- NA
df <- data.frame(id,codes)

#sometimes converting to character results in 'negative' new NA
# creating a new variable codes2 reports 0% NA when that is not the case
df2 <- df %>% 
  mutate(codes2 = sprintf('%02d',codes)) %>% 
  mutate(codes  = sprintf('%02d',codes))
identical(df2$codes,df2$codes2) 
df3 <- df %>% 
  mutate(codes2 = formatC(codes)) %>% 
  mutate(codes  = formatC(codes))
identical(df3$codes,df3$codes2) 

#but as.character reports correctly so bug isn't for all conversion functions
df4 <- df %>% 
  mutate(codes2 = as.character(codes)) %>% 
  mutate(codes  = as.character(codes))
identical(df4$codes,df4$codes2)

Report merge stats in joins (enhancement)

Hey Benjamin,

Thanks for the cool package!

Wouldn't it be useful to report what share of rows have been dropped?

So for this:

x <- inner_join(band_members, band_instruments, by = "name")

show:
33% of left dataframe and 33% of right dataframe's rows dropped.

I find that these joins are often a great source for bugs in my programs and it often has to do with losing many rows in the either left or right dataframe.

Best,
Lukas

Tidylog function masking using `...` means RStudio auto-complete and function definition help is not available.

tidylog has been hugely helpful for me in my data analysis workflows (thank you!). However, its masking of tidyr and dplyr functions using ... notation means that I can no longer use RStudio auto-complete for function arguments, or hover over the function name to see the arguments.

I used to use this all the time with pivot_wider and pivot_longer where I always forget names_to vs names_from, for example. Now in order to do this I have to explicitly type tidyr::pivot_wider( to start the autocomplete or search for the help.

Is there a way to perform the masking without replacing the help functionality?