elbersb / tidylog Goto Github PK

Tidylog provides feedback about dplyr and tidyr operations. It provides wrapper functions for the most common functions, such as filter, mutate, select, and group_by, and provides detailed output for joins.

License: Other

R 100.00%

dplyr r tidyr tidyverse wrapper-functions

tidylog's Issues

Include dplyr function rename in tidylog package

While using this very useful package, I find it would be nice to get information also about column renaming while working with dplyr::rename() in pipes.

format_list() unicode character does not print correctly

tidylog 0.1.0
R 3.5.2

The character used to indicate a truncated list in format_list() is a unicode character that does not print correctly with any of the fonts I've tried. When I render my document to an HTML Notebook, it creates this:

㠼㸵

And in RStudio it just appears as the 'missing symbol' diamond glyph.

print indicator when data frame is grouped?

Obviously, some operations in dplyr depend on whether the data frame is grouped. a will have different values in this example:

mtcars %>% group_by(mpg) %>% mutate(a = mean(wt)) 
mtcars %>% mutate(a = mean(wt))

Usually, this should be pretty clear, but when the data frame is grouped earlier in the code, one might have forgotten about this.

mutate, transmute and filter could indicate whether the data frame is grouped. Maybe just like this:

mutate (grouped): new variable 'a' with 23 unique values and 0% NA

mutate: new variable 'a' with 23 unique values and 0% NA (within 25 groups)

Is this an error?

In one of the examples provided,

b <- filter(mtcars, mpg > 100)
#> filter: removed all rows (100%)

seems to be an error. Zero should be removed

any chance the package messes up the computation/ objects in memory?

Hi,

This package is really a good idea. I just wonder, are there any possible weird side effects that can happen? Is the logging interfering with the dplyr computations at any point?

Thanks!

relevant functions from dplyr

These all have filter semantics, so should be straightforward.

top_frac (top_n already implemented)
sample_n, sample_frac (simple extension of filter)
slice (simple extension of filter)

Also:

rename and variants (see #27)

And that should be it for dplyr.

Format large numbers

For an output such as;
filter: removed 7375373 out of 10541429 rows (70%)
It would be nice for this to display as
filter: removed 7,375,373 out of 10,541,429 rows (70%)
to make reading easier, this would probably have to be locale dependant.

Interest in expanding this to other tidyverse packages?

Ben: love the approach. I've been focused on hacking on the pipe operator itself for far too long.

Before I go forking around with your package, I'm thinking of such things as tidyr and purrr. Feedback such as you provide for dplyr would be very helpful for spread, gather, map_df, etc...

Would you be open to such pull requests and expanding the scope of tidylog beyond just dplyr?

Rd warnings when installing

I like this package so far.
I am receiving a warning during installation regarding file links to 'inner_join' and 'transmute' in dplyr.

This seems related to an issue opened in roxygen:
r-lib/roxygen2#707

My setup:

R version 3.5.2
Windows 10
Installing with packrat on

Install messages:

* installing *source* package 'tidylog' ...
** R
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
  converting help for package 'tidylog'
    finding HTML links ... done
    filter                                  html  
    group_by                                html  
    inner_join                              html  
Rd warning: C:/Users/Byron/AppData/Local/Temp/Rtmp4K2WMi/R.INSTALL4a2837776241/tidylog/man/inner_join.Rd:26: file link 'inner_join' in package 'dplyr' does not exist and so has been treated as a topic
Rd warning: C:/Users/Byron/AppData/Local/Temp/Rtmp4K2WMi/R.INSTALL4a2837776241/tidylog/man/inner_join.Rd:28: file link 'inner_join' in package 'dplyr' does not exist and so has been treated as a topic
Rd warning: C:/Users/Byron/AppData/Local/Temp/Rtmp4K2WMi/R.INSTALL4a2837776241/tidylog/man/inner_join.Rd:31: file link 'inner_join' in package 'dplyr' does not exist and so has been treated as a topic
    mutate                                  html  
    select                                  html  
    tidylog                                 html  
    transmute                               html  
Rd warning: C:/Users/Byron/AppData/Local/Temp/Rtmp4K2WMi/R.INSTALL4a2837776241/tidylog/man/transmute.Rd:20: file link 'transmute' in package 'dplyr' does not exist and so has been treated as a topic
Rd warning: C:/Users/Byron/AppData/Local/Temp/Rtmp4K2WMi/R.INSTALL4a2837776241/tidylog/man/transmute.Rd:22: file link 'transmute' in package 'dplyr' does not exist and so has been treated as a topic
Rd warning: C:/Users/Byron/AppData/Local/Temp/Rtmp4K2WMi/R.INSTALL4a2837776241/tidylog/man/transmute.Rd:25: file link 'transmute' in package 'dplyr' does not exist and so has been treated as a topic
** building package indices
** testing if installed package can be loaded
*** arch - i386
*** arch - x64
* DONE (tidylog)
In R CMD INSTALL

Fewer NA bug on character conversion

Hi,

I believe I have loaded the most recent dev version and I am having an issue where tidylog thinks NAs are removed (n fewer NA) or that 0% are NAs when creating a new variable, but this is not the case.

Here is my code to reproduce the issue

library(tidyverse)
library(tidylog)
# set up df with some NA's in 'codes' variable
id <- 1:10
codes <- 1:10
codes[8:10] <- NA
df <- data.frame(id,codes)

#sometimes converting to character results in 'negative' new NA
# creating a new variable codes2 reports 0% NA when that is not the case
df2 <- df %>% 
  mutate(codes2 = sprintf('%02d',codes)) %>% 
  mutate(codes  = sprintf('%02d',codes))
identical(df2$codes,df2$codes2) 
df3 <- df %>% 
  mutate(codes2 = formatC(codes)) %>% 
  mutate(codes  = formatC(codes))
identical(df3$codes,df3$codes2) 

#but as.character reports correctly so bug isn't for all conversion functions
df4 <- df %>% 
  mutate(codes2 = as.character(codes)) %>% 
  mutate(codes  = as.character(codes))
identical(df4$codes,df4$codes2)

Little histograms in terminal output

skimr uses these, seen here: https://cran.r-project.org/web/packages/skimr/vignettes/skimr.html

Could also be nice for mutate calls.

Add log about ungroup()

While doing this:

df %>% 
  group_by() %>% 
  summarize() %>% 
  ungroup()

I get this as last message: summarize: now xx rows and yy columns, zz group variables remaining, where xx, yy and zz are numbers.
It would be nice, I think, to have a message for ungroup() as well.
Something like: ungroup: no grouping variables left or just ungrouped.
What do you think? Thanks.

improve test coverage

Minor issue: uncount() breaks if data argument is explicit

It appears tidyr::uncount() breaks if data = argument is explicit. See example below:

library(tidyr)
library(tidylog)

library(conflicted)
for (f in getNamespaceExports("tidylog")) {
    conflicted::conflict_prefer(f, "tidylog", quiet = TRUE)
}

df <- tibble(x = c("a", "b"), n = c(1, 2))
uncount(df, n) # works fine
# uncount: now 3 rows and one column, ungrouped

uncount(data = df, weights = n)
# Error in log_summarize(.data, .fun = tidyr::uncount, .funname = "uncount",  : 
#   argument ".data" is missing, with no default

Crayon the information

I cant express how much this package is making my workflow better! Brilliant.

I was wondering if you might consider crayoning the logs so they are more easily distinguishable from errors and warnings? This would at least help me alot when running larger scripts where I have known warnings I ignore, but want to catch errors.

Print filter again when removing rows?

Thanks for this awesome package!

I was wondering if it would be possible to not only print

filter: removed 29800 out of 74790 rows (40%)

but, for example:

filter: removed 29800 out of 74790 rows (40%), used filter(s): !is.na(user_id)

That is, the log could print what was actually being filtered. This would make it much easier to debug sources of errors in long pipes.

Mutating column to date

Maybe it's intentional, but shouldn't this say "Date" and not "double"?

data.frame(x = c("2016-01-01", "2016-01-02")) %>% 
  tidylog::mutate(x = as.Date(x))
#> mutate: converted 'x' from factor to double (0 new NA)
#>            x
#> 1 2016-01-01
#> 2 2016-01-02

Tidylog non-functional in quarto documents

Hi!

Love the package and thus also wanted to continue using it in quarto documents where it seemingly doesn't do anything at least for me.

Anything I can do to help debug?

tidylog output in html markdown

I'm creating an HTML document with R-Markdown. The resulting file includes all the messages from tidylog. Is there an option to turn these off in html/output mode?

log for bind_*() functions

Hi @elbersb!
Have you ever considered to implement logs for dplyr functions bind_rows() and bind_cols()?
I am using them in pipes in combination with other dplyr functions and it seems the pipe instructions didn't work as expected, although it is just a missing step in the log. What do you think about it?

Making output optional

I think I really could use this in my classes for demos.
But I would prefer to habe an option to switch the additional output on and off.
Can this be done (or added)?

relevant functions from tidyr

see https://tidyr.tidyverse.org/reference/index.html

pivoting
uncount (-> summarize)

Other functions (hoist, unnest_longer, unnest_wider, unnest_auto, nest, unnest, pack, chop) I'd not consider for now.

Function masking workaround

I saw your post on Twitter about version 1.0.0 and I wanted to thank you for your work on this package! Inspired by one of the comments that expressed concern about overloading the dplyr and tidylog functions, I started work on a package, https://github.com/TylerGrantSmith/mask, that would allow you to use tidylog without loading it into the search path.

It is still just an early concept, but I wanted to get your opinion on functionality.

Here is an example reprex:

library(magrittr)

# clean searchpath
searchpaths()
#>  [1] ".GlobalEnv"                                            
#>  [2] "C:/Users/e014307/Documents/R/R-3.6.2/library/magrittr" 
#>  [3] "C:/Users/e014307/Documents/R/R-3.6.2/library/stats"    
#>  [4] "C:/Users/e014307/Documents/R/R-3.6.2/library/graphics" 
#>  [5] "C:/Users/e014307/Documents/R/R-3.6.2/library/grDevices"
#>  [6] "C:/Users/e014307/Documents/R/R-3.6.2/library/utils"    
#>  [7] "C:/Users/e014307/Documents/R/R-3.6.2/library/datasets" 
#>  [8] "C:/Users/e014307/Documents/R/R-3.6.2/library/methods"  
#>  [9] "Autoloads"                                             
#> [10] "tools:callr"                                           
#> [11] "C:/Users/e014307/DOCUME~1/R/R-36~1.2/library/base"

# mask expressions
mask::tidylog_mask(mtcars %>% 
  dplyr::select(mpg, cyl) %>% 
  dplyr::filter(mpg < 15) %>% 
  dplyr::group_by(cyl) %>%
  dplyr::summarise(mean = mean(mpg)))
#> select: dropped 9 variables (disp, hp, drat, wt, qsec, …)
#> filter: removed 27 rows (84%), 5 rows remaining
#> group_by: one grouping variable (cyl)
#> summarise: now one row and 2 columns, ungrouped
#> # A tibble: 1 x 2
#>     cyl  mean
#>   <dbl> <dbl>
#> 1     8  12.6

# mask individual functions
select <- mask::tidylog_mask(dplyr::select)
mtcars %>% 
  select(mpg, cyl) %>% 
  dplyr::filter(mpg < 15)
#> select: dropped 9 variables (disp, hp, drat, wt, qsec, …)
#>    mpg cyl
#> 1 14.3   8
#> 2 10.4   8
#> 3 10.4   8
#> 4 14.7   8
#> 5 13.3   8

^{Created on 2020-01-08 by the reprex package (v0.3.0)}

In addition, there is an RStudio addin that allows you to run selected code with tidylog using a keybinding.

print message when summarizing?

The tidylog package could also print a message after a summarize command to let the user know which groups are remaining, for instance:

data.frame(Titanic) %>%  
        group_by(Age, Class) %>%  
        summarize(Freq = sum(Freq)) %>%  
        mutate(Class = reorder(Class, Freq))                                                              
#> group_by: 8 groups [Age, Class] 
#> summarize: 2 groups remaining [Age]
#> mutate: changed 8 values (100%) of 'Class', factor levels updated

https://community.rstudio.com/t/understanding-group-by-order-matters/22685/6

Explicit messages even when no changes result

I'm not sure I have a strong opinion about it, but I would like to hear others about the following example mtcars %>% select(hp, everything()). tidylog produces no message because nothing really changed except the column order, but I still think it should produce a notification like select: no variables were dropped consistent with how filter works when currently when no rows are dropped.

Tidylog function masking using `...` means RStudio auto-complete and function definition help is not available.

tidylog has been hugely helpful for me in my data analysis workflows (thank you!). However, its masking of tidyr and dplyr functions using ... notation means that I can no longer use RStudio auto-complete for function arguments, or hover over the function name to see the arguments.

I used to use this all the time with pivot_wider and pivot_longer where I always forget names_to vs names_from, for example. Now in order to do this I have to explicitly type tidyr::pivot_wider( to start the autocomplete or search for the help.

Is there a way to perform the masking without replacing the help functionality?

Tidylog masks dplyr warnings/messages

Tidylog masks dplyr warnings/messages. Can it be updated to reproduce these? For example

library(nycflights13)
library(tidyverse)
library(tidylog)

flights |> group_by(year, month, day) %>% dplyr::summarise(n = n())
This provides the following message:
summarise() has grouped output by 'year', 'month'. You can override using the .groups argument.

flights |> group_by(year, month, day) %>% tidylog::summarise(n = n())
This provides the following message:
group_by: 3 grouping variables (year, month, day)
summarise: now 365 rows and 4 columns, 2 group variables remaining (year, month)

Hints in the autocomplete

I love tidylog, but a caveat is that we don't have the same hints about the parameters, when you are coding and hit ctrl+space, for instance, that you have when you use the original tidyverse packages.

Do you plan to improve this?

error in join when pipe passes to y

I'm really liking this package - thank you! I did come across an error: if you pass the data from the pipe to the y in join(as opposed to x) it throws an error. You can fix it by calling the dplyr version.

# ## merge data sets to include all visits (even those w/o Rx)
analysis <- analysis %>% 
   left_join(x = patient, y = .) %>% 
   mutate_at(vars(prescribed_opioid), factor, levels = c(NA, TRUE, FALSE), 
             labels = c('No Rx', 'Prescribed Opioid', 'Not Prescribed Opioid'), 
             exclude = NULL)
# Error in log_join(.data, dplyr::left_join, "left_join", ...) :  
#   argument ".data" is missing, with no default

### Works:
analysis <- analysis %>% 
   dplyr::left_join(x = patient, y = .) %>% 
  mutate_at(vars(prescribed_opioid), factor, levels = c(NA, TRUE, FALSE), 
             labels = c('No Rx', 'Prescribed Opioid', 'Not Prescribed Opioid'), 
             exclude = NULL)
# Joining, by = "Encounter_id"

### Also works:
analysis <- patient %>% 
  left_join(analysis) %>% 
  mutate_at(vars(prescribed_opioid), factor, levels = c(NA, TRUE, FALSE), 
             labels = c('No Rx', 'Prescribed Opioid', 'Not Prescribed Opioid'), 
             exclude = NULL)
# Joining, by = "Encounter_id"
# left_join: added 0 rows and added one column (prescribed_opioid)
# mutate_at: converted 'prescribed_opioid' from logical to factor (-101770 new NA)

The later also works but it tells you how many rows you added to patient not analysis.
This is my first public issue and I'm sorry if I've done anything incorrectly.

format "X fewer NA"

 tibble(x=rep(NA_real_, 1000000)) %>% mutate(x = 1)                                                                           
# mutate: changed 1,000,000 values (100%) of 'x' (1000000 fewer NA)

1000000 is not formatted

Write tidylog messages to log file?

I have some scripts where it would be useful to skim a hypothetical log of the messages.

Perhaps via options or an analogous {logger} function? E.g.:

library(tidylog, include.only = c("filter", "mutate", "left_join"))
tidylog_log <- "tidylog"
tidylog_appender(appender_file(tidylog_log))
...
<script>

Mutate doesn't work with set_units()

The title says it all. See code below. Also, thanks for the great package. You have no idea how frustrating debugging pipes had been as a newbie until I discovered tidylog :)

library(tidyverse)
library(units)

df <- tribble(~A, ~B,
        1, 2,
        2, 3)

df %>%
  mutate(B = set_units(B , mg)) %>%
  print()
# A tibble: 2 x 2

library(tidylog)

df %>%
  mutate(B = set_units(B , mg)) %>%
  print() 
# Error in Ops.units(new, old) : 
#  both operands of the expression should be "units" objects

add number of remaining rows

For a lazy person, who does not want to do the math :), is it possible to add the number of remaining rows in the data (in addition to the current numbers?)

What I often care about this the number of remaining

filter: removed 287 out of 761 rows (38%), 474 remaining
filter: removed 230 out of 474 rows (49%), 244 remaining

Thanks!

Make ungroup log message more explicit

Hi folks, firstly: I LOVE this package, it's so helpful!

Small one, relating to #32 : ungrouping a grouped object tells you:

df %<>% ungroup

ungroup: no grouping variables

Which (IMHO) is unclear/ambiguous as to whether the input object HAD no grouping variables to begin with (i.e. "ungroup didn't have anything to do"), or if it HAS no grouping variables NOW, thanks to the ungroup call (i.e. "ungrouped successfully").

Maybe:

ungroup: previously grouped by ID [count], now ungrouped

Cheers!

Negative new NAs

Loving the package!

Very minor issue: when mutate is used to replace NA values with non-NA values, tidylog will report (-X new NA), where X is the number of values that were NA but no longer are.

It's still providing all the information it should, but the message is a little strange.

Example code:

library(tidyverse)
library(tidylog)
df <- tibble(a = NA,
                 b = rnorm(100)) %>%
  mutate(a=ifelse(b<0,1,a))

Overwriting of nested columns not possible with tidylog::mutate

When I overwrite a nested column with mutate it works with dplyr::mutate() but not with tidylog::mutate()

Check example below

library(tidyverse)
library(tidylog)

# This doesn't work: Error in new != old : comparison of these types is not implemented
as_tibble(iris) %>% nest(-Species) %>% tidylog::mutate(data = data %>% map(function(data) {data$Sepal.Length + data$Sepal.Width}))

# This works
as_tibble(iris) %>% nest(-Species) %>% dplyr::mutate(data = data %>% map(function(data) {data$Sepal.Length + data$Sepal.Width}))

group_by logs incorrect value

Great idea! I was planning on using it to teach tidy verse next week and noticed group_by() throws an incorrect value. I am using your code example too, so not sure what is going on here. sessionInfo follows, in case that helps.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidylog)
#> 
#> Attaching package: 'tidylog'
#> The following objects are masked from 'package:dplyr':
#> 
#>     anti_join, distinct, filter, filter_all, filter_at, filter_if,
#>     full_join, group_by, group_by_all, group_by_at, group_by_if,
#>     inner_join, left_join, mutate, mutate_all, mutate_at,
#>     mutate_if, right_join, select, select_all, select_at,
#>     select_if, semi_join, transmute, transmute_all, transmute_at,
#>     transmute_if
#> The following object is masked from 'package:stats':
#> 
#>     filter
summary <- mtcars %>%
  select(mpg, cyl, hp) %>%
  filter(mpg > 15) %>%
  mutate(mpg_round = round(mpg)) %>%
  group_by(cyl, mpg_round) %>%
  tally() %>%
  filter(n >= 1)
#> select: dropped 8 variables (disp, drat, wt, qsec, vs, …) 
#> filter: removed 6 rows (19%) 
#> mutate: new variable 'mpg_round' with 15 unique values and 0% NA 
#> group_by: 0 groups [] 
#> filter: no rows removed


sessionInfo()
#> R version 3.5.2 (2018-12-20)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Sierra 10.12.6
#> 
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_3.5.2  magrittr_1.5    tools_3.5.2     htmltools_0.3.6
#>  [5] yaml_2.2.0      Rcpp_1.0.0      stringi_1.2.4   rmarkdown_1.11 
#>  [9] highr_0.7       knitr_1.21      stringr_1.3.1   xfun_0.4       
#> [13] digest_0.6.18   evaluate_0.12

New dplyr join_by breaks with tidylog

Love the package. Has been immensely helpful for verifying operations are running as expected when walking through long chains.

I noticed I now receive an error when running the new join_by functionality in dplyr 1.1.0 while tidylog is loaded. Reproducible example:

library(tidylog)
library(dplyr)

sales <- tibble(
    id = c(1L, 1L, 1L, 2L, 2L),
    sale_date = as.Date(c("2018-12-31", "2019-01-02", "2019-01-05", "2019-01-04", "2019-01-01"))
)

promos <- tibble(
    id = c(1L, 1L, 2L),
    promo_date = as.Date(c("2019-01-01", "2019-01-05", "2019-01-02"))
)

by <- join_by(id, sale_date == promo_date)
left_join(sales, promos, by)

Produces the following error:

Error in `dplyr::common_by()`:
! `by` must be a (named) character vector, list, or NULL for natural joins (not recommended in production code), not a <dplyr_join_by> object.

use rlang::inform instead of message

have to wait for this bug to be fixed: r-lib/rlang#880

report type for new variables

currently:

mutate: new variable 'NAME' with X unique values and Y% NA

better:

mutate: new variable 'NAME' (logical) with X unique values and Y% NA
mutate: new variable 'NAME' (numeric) with X unique values and Y% NA
mutate: new variable 'NAME' (date) with X unique values and Y% NA
mutate: new variable 'NAME' (factor) with X unique values and Y% NA
# etc.

support tally, count, add_tally, add_count

tally, count should probably go to summarize
add_tally, add_count should probably go to mutate

https://dplyr.tidyverse.org/reference/tally.html

Report merge stats in joins (enhancement)

Hey Benjamin,

Thanks for the cool package!

Wouldn't it be useful to report what share of rows have been dropped?

So for this:

x <- inner_join(band_members, band_instruments, by = "name")

show:
33% of left dataframe and 33% of right dataframe's rows dropped.

I find that these joins are often a great source for bugs in my programs and it often has to do with losing many rows in the either left or right dataframe.

Best,
Lukas

error when variable name is f, fu, fun etc.

library("tidyverse")
library("tidylog", warn.conflicts = FALSE)
mutate(mtcars, f = 1)
#> Error in log_mutate(.data, dplyr::mutate, "mutate", ...): argument 4 matches multiple formal arguments
mutate(mtcars, fu = 1)
#> Error in log_mutate(.data, dplyr::mutate, "mutate", ...): argument 4 matches multiple formal arguments
mutate(mtcars, fun = 1)
#> Error in fun(.data, ...): could not find function "fun"
mutate(mtcars, g = 1)
#> mutate: new variable 'g' with one unique value and 0% NA

^{Created on 2019-05-24 by the reprex package (v0.2.1)}

Display actual dataset names in join message

It would be neat it the message for joins would display the actual dataset names rather then the generic x and y.

joined <- left_join(nycflights13::flights, nycflights13::weather,
    by = c("year", "month", "day", "origin", "hour", "time_hour"))
#> left_join: added 9 columns (temp, dewp, humid, wind_dir, wind_speed, …)
#>            > rows only in nycflights13::flights     1,556
#>            > rows only in nycflights13::weather  (  6,737)
#>            > matched rows                         335,220
#>            >                                     =========
#>            > rows total                           336,776

Can/Should tidylog::filter throw a message or warning on 'missing' filter group?

Suppose I want to filter some data for three groups ('A', 'B', 'C'). I assume the presence of all three, but my data for whatever reason only has 'A' and 'B' and 'D'. It would be useful to receive immediate feedback that part of my filter wasn't there (maybe because of a typo or other data pipeline reasons). A warning might be too strong, but a message as part of the tidylog message maybe? See the reprex below.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
suppressPackageStartupMessages(library(tidylog))

df <- data.frame(
  x = sample(c("A", "B", "D"), 10, replace = TRUE),
  y = rnorm(10)
)

df %>% 
  filter(x %in% c("A", "B", "C")) %>% # Warning that there are no 'C' rows?
  summarize(my = mean(y))
#> filter: removed 2 rows (20%), 8 rows remaining
#> summarize: now one row and one column, ungrouped
#>           my
#> 1 -0.2840769

^{Created on 2022-01-24 by the reprex package (v0.3.0)}

Can `tidylog::filter` also report the number of groups

Hope this is a simple questions: can tidylog::filter also report the number of groups? Or what would be a workaround?

For example, current
filter (grouped): removed 78,720 rows (40%), 118,956 rows remaining
and desired
filter (grouped): removed 78,720 rows (40%), 118,956 rows remaining; removed 100 groups (10%), 900 groups remaining

Renaming while selecting gives confusing message.

The code

tibble(x=1,y=2,z=3) %>% 
  select(a=x,b=y)

results in select: dropped 3 variables (x, y, z). While technically true, I would rather call it "renamed 2 variables, dropped 1 variable".

`mutate` does not report dropped variables

Currently if you execute the following, we do not get information about the dropped mass column:

require(tidyverse)
require(tidylog)
starwars %>%
 select(name, height, mass, homeworld) %>%
 mutate(
  mass = NULL,
  height = height * 0.0328084 # convert to feet
)

We only get back:

mutate: converted height from integer to double (0 new NA)

I believe some of the same logic used to generate the dropped variables message for transmute could be used to cover this case as well.

overwriting ordered factors using fct_recode fails

library(tidyverse)
d = tibble(x = ordered(c("apple", "bear", "banana", "dear"))) 
# works
fct_recode(d$x, fruit = "apple", fruit = "banana")
#> [1] fruit bear  fruit dear 
#> Levels: fruit < bear < dear

# works when using a new variable
tidylog::mutate(d, y = fct_recode(x, fruit = "apple", fruit = "banana"))
#> mutate: new variable 'y' with 3 unique values and 0% NA
#> # A tibble: 4 x 2
#>   x      y    
#>   <ord>  <ord>
#> 1 apple  fruit
#> 2 bear   bear 
#> 3 banana fruit
#> 4 dear   dear

# fails when overwriting
tidylog::mutate(d, x = fct_recode(x, fruit = "apple", fruit = "banana"))
#> Error in Ops.factor(new, old): level sets of factors are different

# works with factor
d = tibble(x = factor(c("apple", "bear", "banana", "dear"))) 
tidylog::mutate(d, x = fct_recode(x, fruit = "apple", fruit = "banana"))
#> mutate: changed 2 values (50%) of 'x' (0 new NA)
#> # A tibble: 4 x 1
#>   x    
#>   <fct>
#> 1 fruit
#> 2 bear 
#> 3 fruit
#> 4 dear

arrange is not supported

Considering arrange() is part of dplyr and tidyr I would expected support for it. For example:

iris %>%
  group_by(Species) %>%
  summarise(mean_Sepal_Length = mean(Sepal.Length), .groups = "drop") %>%
  arrange(desc(mean_Sepal_Length))

now gives:

group_by: one grouping variable (Species)
summarise: now 3 rows and 2 columns, ungrouped

while I would expect something like

group_by: one grouping variable (Species)
summarise: now 3 rows and 2 columns, ungrouped
arrange: now arranged in descending order by one variable (mean_Sepal_Length)

I think that would be a nice addition, if it would fit within the logic of this package.

elbersb / tidylog Goto Github PK

tidylog's Issues

Recommend Projects

Recommend Topics

Recommend Org