Code Monkey home page Code Monkey logo

visdat's Introduction

visdat

rOpenSci BadgeJOSS statusDOIR-CMD-checkCodecov test coverageCRAN_Status_BadgeCRAN LogsProject Status: Active – The project has reached a stable, usable state and is being actively developed.

How to install

visdat is available on CRAN

install.packages("visdat")

If you would like to use the development version, install from github with:

# install.packages("devtools")
devtools::install_github("ropensci/visdat")

What does visdat do?

Initially inspired by csv-fingerprint, vis_dat helps you visualise a dataframe and “get a look at the data” by displaying the variable classes in a dataframe as a plot with vis_dat, and getting a brief look into missing data patterns using vis_miss.

visdat has 6 functions:

  • vis_dat() visualises a dataframe showing you what the classes of the columns are, and also displaying the missing data.

  • vis_miss() visualises just the missing data, and allows for missingness to be clustered and columns rearranged. vis_miss() is similar to missing.pattern.plot from the mi package. Unfortunately missing.pattern.plot is no longer in the mi package (as of 14/02/2016).

  • vis_compare() visualise differences between two dataframes of the same dimensions

  • vis_expect() visualise where certain conditions hold true in your data

  • vis_cor() visualise the correlation of variables in a nice heatmap

  • vis_guess() visualise the individual class of each value in your data

  • vis_value() visualise the value class of each cell in your data

  • vis_binary() visualise the occurrence of binary values in your data

You can read more about visdat in the vignette, [“using visdat”]https://docs.ropensci.org/visdat/articles/using_visdat.html).

Code of Conduct

Please note that the visdat project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Examples

Using vis_dat()

Let’s see what’s inside the airquality dataset from base R, which contains information about daily air quality measurements in New York from May to September 1973. More information about the dataset can be found with ?airquality.

library(visdat)

vis_dat(airquality)

The plot above tells us that R reads this dataset as having numeric and integer values, with some missing data in Ozone and Solar.R. The classes are represented on the legend, and missing data represented by grey. The column/variable names are listed on the x axis.

The vis_dat() function also has a facet argument, so you can create small multiples of a similar plot for a level of a variable, e.g., Month:

vis_dat(airquality, facet = Month)

These currently also exist for vis_miss(), and the vis_cor() functions.

Using vis_miss()

We can explore the missing data further using vis_miss():

vis_miss(airquality)

Percentages of missing/complete in vis_miss are accurate to the integer (whole number). To get more accurate and thorough exploratory summaries of missingness, I would recommend the naniar R package

You can cluster the missingness by setting cluster = TRUE:

vis_miss(airquality, 
         cluster = TRUE)

Columns can also be arranged by columns with most missingness, by setting sort_miss = TRUE:

vis_miss(airquality,
         sort_miss = TRUE)

vis_miss indicates when there is a very small amount of missing data at <0.1% missingness:

test_miss_df <- data.frame(x1 = 1:10000,
                           x2 = rep("A", 10000),
                           x3 = c(rep(1L, 9999), NA))

vis_miss(test_miss_df)

vis_miss will also indicate when there is no missing data at all:

vis_miss(mtcars)

To further explore the missingness structure in a dataset, I recommend the naniar package, which provides more general tools for graphical and numerical exploration of missing values.

Using vis_compare()

Sometimes you want to see what has changed in your data. vis_compare() displays the differences in two dataframes of the same size. Let’s look at an example.

Let’s make some changes to the chickwts, and compare this new dataset:

set.seed(2019-04-03-1105)
chickwts_diff <- chickwts
chickwts_diff[sample(1:nrow(chickwts), 30),sample(1:ncol(chickwts), 2)] <- NA

vis_compare(chickwts_diff, chickwts)

Here the differences are marked in blue.

If you try and compare differences when the dimensions are different, you get an ugly error:

chickwts_diff_2 <- chickwts
chickwts_diff_2$new_col <- chickwts_diff_2$weight*2

vis_compare(chickwts, chickwts_diff_2)
# Error in vis_compare(chickwts, chickwts_diff_2) : 
#   Dimensions of df1 and df2 are not the same. vis_compare requires dataframes of identical dimensions.

Using vis_expect()

vis_expect visualises certain conditions or values in your data. For example, If you are not sure whether to expect values greater than 25 in your data (airquality), you could write: vis_expect(airquality, ~.x>=25), and you can see if there are times where the values in your data are greater than or equal to 25:

vis_expect(airquality, ~.x >= 25)

This shows the proportion of times that there are values greater than 25, as well as the missings.

Using vis_cor()

To make it easy to plot correlations of your data, use vis_cor:

vis_cor(airquality)

Using vis_value

vis_value() visualises the values of your data on a 0 to 1 scale.

vis_value(airquality)

It only works on numeric data, so you might get strange results if you are using factors:

library(ggplot2)
vis_value(iris)
data input can only contain numeric values, please subset the data to the numeric values you would like. dplyr::select_if(data, is.numeric) can be helpful here!

So you might need to subset the data beforehand like so:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

iris %>%
  select_if(is.numeric) %>%
  vis_value()

Using vis_binary()

vis_binary() visualises binary values. See below for use with example data, dat_bin

vis_binary(dat_bin)

If you don’t have only binary values a warning will be shown.

vis_binary(airquality)
Error in test_if_all_binary(data) : 
  data input can only contain binary values - this means either 0 or 1, or NA. Please subset the data to be binary values, or see ?vis_value.

Using vis_guess()

vis_guess() takes a guess at what each cell is. It’s best illustrated using some messy data, which we’ll make here:

messy_vector <- c(TRUE,
                  T,
                  "TRUE",
                  "T",
                  "01/01/01",
                  "01/01/2001",
                  NA,
                  NaN,
                  "NA",
                  "Na",
                  "na",
                  "10",
                  10,
                  "10.1",
                  10.1,
                  "abc",
                  "$%TG")

set.seed(2019-04-03-1106)
messy_df <- data.frame(var1 = messy_vector,
                       var2 = sample(messy_vector),
                       var3 = sample(messy_vector))
vis_guess(messy_df)
vis_dat(messy_df)

So here we see that there are many different kinds of data in your dataframe. As an analyst this might be a depressing finding. We can see this comparison above.

Thank yous

Thank you to Ivan Hanigan who first commented this suggestion after I made a blog post about an initial prototype ggplot_missing, and Jenny Bryan, whose tweet got me thinking about vis_dat, and for her code contributions that removed a lot of errors.

Thank you to Hadley Wickham for suggesting the use of the internals of readr to make vis_guess work. Thank you to Miles McBain for his suggestions on how to improve vis_guess. This resulted in making it at least 2-3 times faster. Thanks to Carson Sievert for writing the code that combined plotly with visdat, and for Noam Ross for suggesting this in the first place. Thank you also to Earo Wang and Stuart Lee for their help in getting capturing expressions in vis_expect.

Finally thank you to rOpenSci and it’s amazing onboarding process, this process has made visdat a much better package, thanks to the editor Noam Ross (@noamross), and the reviewers Sean Hughes (@seaaan) and Mara Averick (@batpigandme).

ropensci_footer

visdat's People

Contributors

anthonyraborn avatar batpigandme avatar cpsievert avatar cregouby avatar jeroen avatar jimhester avatar jrosell avatar maelle avatar muschellij2 avatar njtierney avatar noamross avatar rekyt avatar romainfrancois avatar seaaan avatar thisisnic avatar tierneyn avatar zeehio avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

visdat's Issues

vis_dat_ly currently doesn't work

I'm probably not doing a great job of following the instructions from the plotly site, and I haven't used pure plotly syntax that much, and I don’t work with plotting matrices that often, so I'm probably missing something obvious here, but when I try:

 # apply the fingerprint function to get the class
  d <- x %>% purrr::dmap(fingerprint) %>% as.matrix()

  # heatmap fails due to not being a numeric matrix
  # heatmap(d)

  # plotly fails due to the number of colours being too many?
  plotly::plot_ly(z = d,
                  type = "heatmap")

I get the error message

Warning messages:
1: In RColorBrewer::brewer.pal(N, "Set2") :
  n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors

2: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
  duplicated levels in factors are deprecated

I get the feeling that I might going to need to convert the classes into numbers, and provide distinct colours for each class, and perhaps a link/dictionary for each class. Or perhaps be a little more descriptive with the axes.

@cpsievert , any chance you could point me in the right direction to fix this?

make `vis_guess` fast

Currently vis_guess is a bit slow and it'd be nice to make it faster.

We found that using vapply was much faster than using purrr, which is a bit of a shame as I think that the current code doesn't read as nice as purrr code (dare I say it, it doesn't purrr).

The way I see it we have two options to really speed things up.

  1. Write some C++ code based on the readr::::CollectorGuess code, so that it deals with a vector, rather than a single element.
  2. Use one of the functions from the parallel library, like mcapply.

add ORCID to DESCRIPTION

For example:

Title: Generalised Management Strategy Evaluation Simulator 
Description: Integrates game theory and ecological theory to construct social-ecological models that simulate the management of populations and stakeholder actions. These models build off of a previously developed management strategy evaluation (MSE) framework to simulate all aspects of management: population dynamics, manager observation of populations, manager decision making, and stakeholder responses to management decisions. The newly developed generalised management strategy evaluation (GMSE) framework uses genetic algorithms to mimic the decision-making process of managers and stakeholders under conditions of change, uncertainty, and conflict. Simulations can be run using gmse(), gmse_apply(), and gmse_gui() functions. 
Author: A. Bradley Duthie [aut, cre] (0000-0001-8343-4995), Nils Bunnefeld [ctb, fnd] (0000-0002-1349-4463), Jeremy Cusack [ctb] (0000-0003-3004-1586), Isabel Jones [ctb] (0000-0002-8361-1370), Erlend Nilsen [ctb] (0000-0002-5119-8331), Rocio Pozo [ctb] (0000-0002-7546-8076), Sarobidy Rakotonarivo [ctb] (0000-0002-8032-1431), Bram Van Moorter [ctb] (0000-0002-3196-1993) 
Maintainer: A. Bradley Duthie <[email protected]> 

Column headers not aligning with many/long names present.

It looks like the col headers need to be moved right and upward to align with the columns. It gets a bit tricky with long names as they might approach the plot boundary.

As a workaround I replaced name by indicie.

library(tibble)
library(visdat)
library(magrittr)
tibble::tribble(
  ~left_eye_center_x, ~left_eye_center_y, ~right_eye_center_x, ~right_eye_center_y, ~left_eye_inner_corner_x, ~left_eye_inner_corner_y, ~left_eye_outer_corner_x, ~left_eye_outer_corner_y, ~right_eye_inner_corner_x, ~right_eye_inner_corner_y, ~right_eye_outer_corner_x, ~right_eye_outer_corner_y, ~left_eyebrow_inner_end_x, ~left_eyebrow_inner_end_y, ~left_eyebrow_outer_end_x, ~left_eyebrow_outer_end_y, ~right_eyebrow_inner_end_x, ~right_eyebrow_inner_end_y, ~right_eyebrow_outer_end_x, ~right_eyebrow_outer_end_y,   ~nose_tip_x,   ~nose_tip_y, ~mouth_left_corner_x, ~mouth_left_corner_y, ~mouth_right_corner_x, ~mouth_right_corner_y, ~mouth_center_top_lip_x, ~mouth_center_top_lip_y, ~mouth_center_bottom_lip_x, ~mouth_center_bottom_lip_y,
       66.0335639098,      39.0022736842,       30.2270075188,       36.4216781955,             59.582075188,            39.6474225564,            73.1303458647,            39.9699969925,             36.3565714286,             37.3894015038,             23.4528721805,             37.3894015038,             56.9532631579,             29.0336481203,             80.2271278195,             32.2281383459,              40.2276090226,              29.0023218045,              16.3563789474,              29.6474706767, 44.4205714286, 57.0668030075,        61.1953082707,        79.9701654135,         28.6144962406,         77.3889924812,           43.3126015038,           72.9354586466,              43.1307067669,              84.4857744361,
       64.3329361702,      34.9700765957,       29.9492765957,       33.4487148936,            58.8561702128,            35.2743489362,            70.7227234043,            36.1871659574,             36.0347234043,             34.3615319149,             24.4725106383,             33.1444425532,             53.9874042553,             28.2759489362,              78.634212766,             30.4059234043,              42.7288510638,              26.1460425532,              16.8653617021,              27.0588595745, 48.2062978723, 55.6609361702,        56.4214468085,               76.352,         35.1223829787,         76.0476595745,           46.6845957447,           70.2665531915,              45.4679148936,              85.4801702128,
       65.0570526316,      34.9096421053,       30.9037894737,       34.9096421053,                   59.412,            36.3209684211,            70.9844210526,            36.3209684211,             37.6781052632,             36.3209684211,             24.9764210526,             36.6032210526,             55.7425263158,             27.5709473684,             78.8873684211,             32.6516210526,              42.1938947368,              28.1354526316,              16.7911578947,              32.0871157895, 47.5572631579, 53.5389473684,        60.8229473684,        73.0143157895,         33.7263157895,                72.732,           47.2749473684,           70.1917894737,              47.2749473684,              78.6593684211,
       65.2257391304,       37.261773913,       32.0230956522,        37.261773913,            60.0033391304,            39.1271791304,            72.3147130435,            38.3809669565,             37.6186434783,             38.7541147826,             25.3072695652,             38.0079026087,             56.4338086957,             30.9298643478,             77.9102608696,             31.6657252174,              41.6715130435,              31.0499895652,              20.4580173913,              29.9093426087, 51.8850782609, 54.1665391304,        65.5988869565,        72.7037217391,         37.2454956522,         74.1954782609,           50.3031652174,           70.0916869565,              51.5611826087,              78.2683826087,
       66.7253006135,      39.6212613497,        32.244809816,       38.0420319018,            58.5658895706,            39.6212613497,            72.5159263804,            39.8844662577,             36.9823803681,             39.0948515337,             22.5061104294,             38.3052368098,             57.2495705521,             30.6721766871,             77.7629447853,             31.7372466258,              38.0354355828,              30.9353815951,              15.9258699387,              30.6721766871, 43.2995337423, 64.8895214724,        60.6714110429,        77.5232392638,         31.1917546012,         76.9973006135,           44.9627484663,           73.7073865031,              44.2271411043,              86.8711656442,
       69.6807476636,      39.9687476636,       29.1835514019,        37.563364486,            62.8642990654,             40.169271028,            76.8982429907,            41.1718878505,              36.401046729,             39.3676261682,             21.7655327103,             38.5655327103,             59.7662803738,             31.6512897196,             83.3136448598,             35.3580560748,                     39.408,              30.5463925234,              14.9490841121,              32.1501308411, 52.4684859813,          58.8,        64.8690841121,        82.4711775701,         31.9904299065,         81.6690841121,           49.3081121495,           78.4876261682,              49.4323738318,              93.8987663551
  ) %>%
  vis_dat()

should vis_miss and vis_dat be consistent with each other?

Make some design decisions regarding whether or not vis_miss and vis_dat should be consistent.

E.g., the default column order for vis_miss is different to the column order for vis_dat.

Should it be the same? I like that vis_dat automatically orders by column type, but that's just what I think.

Anyone?

Suggestion: return the percentage of missing data for each column

Hi,

I think a nice addition to vis_dat() would be the possibility of returning a named vector (or list) containing the names of the columns which have missing data, and the percentage of them. This extra object would not be returned by default, in order not to break existing code which assumes that vis_dat() only returns a ggplot object. However, you could turn it on through an argument.
The same addition would be useful for vis_miss(): in this case maybe adding a last element in the vector/list containing the total percentage of misses may be useful.

I often have to recompute these quantities by hand in my R Markdown reports which use visdat, and it sounds like a waste, given that visdat already computes them.

Error: cannot convert object to a data frame

Hi. Whenever I try to call either functions I get this error:

vis_dat(airquality)
Error: cannot convert object to a data frame
vis_miss(airquality)
Error: cannot convert object to a data frame

I tried with several different data.frames but always get the same error. What's wrong?

Changes to be made in Examples

Examples

As kindly described in the rOpenSci onboarding: addressing review in ropensci/software-review#87

  • Don't need to call library(visdat) in examples (1afb83e)
  • palette is misspelled (31909e2)
  • fix warning messages in vis_compare (6564ea2)
  • fix warning messages in vis_guess (6564ea2)
  • replace purrr::dmap with purrr::map_df (6564ea2)
  • does vis_compare cluster by default?

From what I can see, vis_compare does not cluster by default, that was just due to the example that I gave.

library(visdat)
# make a new dataset of iris that contains some NA values
iris_diff <- iris
iris_diff[sample(x = 1:nrow(iris), 10), sample(1:4,2)] <- NA

vis_compare(iris, iris_diff)
#> vis_compare is in BETA! If you have suggestions or errors
#> post an issue at https://github.com/njtierney/visdat/issues
#> Warning: attributes are not identical across measure variables; they will
#> be dropped

#> Warning: attributes are not identical across measure variables; they will
#> be dropped

Ooops, this reminds me I need to fix suppress those warnings

fork the repo into "dev" and "master"

in master I just want vis_dat and vis_miss, and in "dev" there will be:

  • vis_guess
  • vis_dat_ly
  • vis_miss_ly
  • vis_compare

Also, I think that vis_dat_ly and vis_miss_ly should be their own function, but they will get triggered inside an option, e.g. vis_miss(data, interactive = TRUE)

implement vis_dat_ly, extending from this code

vis_dat_ly is not working at the moment, for reasons that I don't fully understand, so I'm going to dump the code here so I don't forget it. I would like to avoid unused, untested code in visdat.

#' Produces an interactive visualisation of a data.frame to tell you what it contains.
#'
#' \code{vis_dat_ly} uses plotly to provide an interactive version of vis_dat, providing an at-a-glance plotly object of what is inside a dataframe. Cells are coloured according to what class they are and whether the values are missing.
#'
#' @param x a \code{data.frame}
#'
#' @return a \code{plotly} object
#'
#' @examples
#'
#' \dontrun{
#' # currently does not work, some problems with palletes and other weird messages.
#' vis_dat_ly(airquality)
#'
#'}
#'
#'
vis_dat_ly <- function(x) {

  # x = data.frame(x = 1L:10L,
  #                y = letters[1:10],
  #                z = runif(10))

  # apply the fingerprint function to get the class
  d <- x %>% purrr::dmap(fingerprint) %>% as.matrix()

  # heatmap fails due to not being a numeric matrix
  # heatmap(d)

  # plotly fails due to the number of colours being too many?
  plotly::plot_ly(z = d,
                  type = "heatmap")


}

incorporate "expectation" statements into visdat

Sometimes you are looking for specific features in your data and I think that stating some expectations on the matter could be handy.

For example, you might want to show all the negative 1s in the dataset, so you might write some code like this:

vis_dat(data, expectation = "== -1")

Perhaps there are some better verbs that could be used around this. It shouldn't be too hard to write the code, it'll probably end up being something similar to guess_type in vis_guess, where the data is mutated like so:

x %>%
tidyr_code %>%
mutate(value = expectation) %>%
plotting_code

But I'll need to test it out.

And perhaps more importantly, determine whether "expectation" is the right sorta name for this process, and decide whether this should exist in vis_guess or not.

Or perhaps it's another verb in itself: vis_expect ?

Any thoughts on the matter, people of the internet?

create a vignette

Expand on the README and demonstrate how vis_dat and vis_miss can be used with dplyr, tidyr, etc.

Changes to be made in the vignette

Vignettes

Changes to the vignettes to address reviewer comments in onboarding review

  • Link to vignettes (4351546)
  • introduce airquality at first usage (58529ef)
  • sentence about <0.1% of missingness and >1% missingness (58529ef)
  • Change figure size in experimental vignette: knitr::opts_chunk$set(fig.width = 5, fig.height = 4) (bcb01a7)
  • move visdat_ly() and vis_miss_ly() to the experimental vignette. (4351546)
  • Make sure that the excessive warning messages are fixed
  • Decide whether I really need to discuss the figure from R4DS.

Address changes in functionality

These were kindly provided in the rOpenSci onboarding review - ropensci/software-review#87

  • change purrr::dmap to purrr::map_df to remove warnings (6564ea2)

  • suppress the warning message and add helper function for: (8cd0bfd)

  d$value <- tidyr::gather_(x, "variables", "value", names(x))$value
  • suppress warning message: ("Warning message: attributes are not identical across measure variables; they will be dropped") from the call tidyr::gather_(x, "variables", "value", names(x))$value. This message is not relevant to the end user and should be suppressed. (8cd0bfd)
  • duplication in code: (0a6b459)
dplyr::mutate(rows = seq_len(nrow(.))) %>%
  tidyr::gather_(key_col = "variables",
                 value_col = "valueType",
                 gather_cols = names(.)[-length(.)])
  • remove duplication of flipping rows:
 if (flip == TRUE){
    suppressMessages({
      vis_dat_plot +
        ggplot2::scale_y_reverse() +
        ggplot2::scale_x_discrete(position = "top",
                                  limits = type_order_index) +
        ggplot2::theme(axis.text.x = ggplot2::element_text(hjust = 0.5)) +
        ggplot2::labs(x = "")
    })
  } else if (flip == FALSE){
    vis_dat_plot
  }
  
  • Add helper function for the plots, since they seem to be the same or very similar. Perhaps this could be named something like: create_vis_
  • Put the rows in vis_ and friends as they would be in a dataframe, so it is more intuitive. Note: This was the original use of flip, so that by default the visualisations "read" from top to bottom. This should extend to all plots.
  • Should the rows and columns appear to be in same order as the input - sort_type should be off by default.
  • Should the titles at the bottom of the columns make sense?
  • turn if (sort_type == TRUE) into if (sort_type), and if (x == FALSE) is into if (!x).
  • Move vis_compare and vis_guess should to a development version?

Affixing colours for R classes in `vis_dat`

So vis_dat uses the default ggplot colours, which are great! But, I wonder if perhaps it might be more informative to use specific colours for specific classes. For example, if the colour red was always associated with characters, and blue was integers, etc.

I'm not sure if there's an established palette for this sort of thing, but I guess I could look into using a nice text editor scheme for a starting point.

Thoughts are very welcome!

Internal error -3 in R_decompress1

Hey, Dude!

I'm getting this error :
Error in get(Info[i, 1], envir = env) : lazy-load database '/usr/local/Cellar/r/3.3.3_1/R.framework/Versions/3.3/Resources/library/tidyr/R/tidyr.rdb' is corrupt Além disso: Warning message: In get(Info[i, 1], envir = env) : internal error -3 in R_decompress1

Even in the example code :
vis_dat(airquality)

remove pipes

As much as it pains me to say it, I think that in a developer setting like when writing a package, pipes should be avoided, as they tend to produce wierd error messages.

The code will still be readable, but it would just be changed like so:

From

d <- 
x %>%
    purrr::dmap(fingerprint) %>%
    mutate(rows = row_number()) %>%
    tidyr::gather_(key_col = "variables",
                   value_col = "valueType",
                   gather_cols = names(.)[-length(.)])

To

d <- purrr::dmap(x, fingerprint)
d <- mutate(d, rows = row_number())
d <- tidyr::gather_(key_col = "variables,
                         value_col = "valueType",
                         gather_cols = names(.)[-length(,)])

Question: error about varying attributes

An error message:

Warning message:
attributes are not identical across measure variables; they will be dropped

Can you explain whats happening here? Is it that some variables have different metadata or classes?
And is dropped NA or removing?

Changes to documentation

Documentation

Address suggested changes to documentation, as kindly provided in ropensci/software-review#87

  • correct use of roxygen links, code, emphasis, and seeAlso (\code{\link[visdat]{vis_miss}}) (05ab392)
  • add return return value in roxygen, e.g., @return A \code{ggplot} object displaying the type of values in the data frame and the position of any missing values. (05ab392)
  • Update vis_compare documentation, change from list of ideas to description (dbdbb7e)
  • vis_guess, fix first sentence (b1e33d7)
  • Add content to roxygen file for visdat so that ?visdat provides documentation (fb29a39)

create and add `vis_compare`

Just a note to myself so I don't forget.

vis_compare would show the difference between two dataframes. It would be very similar to vis_miss, where you basically colouring cells according to "match" and "non-match". The code for this would be pretty crazy simple...maybe

x <- 1:10

y <- c(1:5, 10:14)

x == y

returns

 [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

Similar to vis_miss, black could indicate a match, and white a non match.Although, it would be interesting to maybe use another colour.

One of the challenges with this would be cases where the datasets are different dimensions.One option could be to return as a message the columns that are not in the same dataset. Matching rows could be done by row number, and just lopping off the trailing ones and spitting out a note. Then, if the user wants, it could use an ID/key to match by.

missing values can be hidden in the presence of large (enough) N

Sometimes if there is only one cell missing in a large dataset of a few thousand, you cannot see the missing cell.

So I think that a little message for vis_miss and vis_dat that just spits out:

There are X number of missing values in dataset

this could just be

paste("There are", sum(is.na(df)), "number of missing values in dataset")

And perhaps if there are ZERO missing values, it could state that "No missing values found".

fix `vis_miss_ly` legend etc

Make vis_miss_ly similar to vis_miss in terms of style / syntax. Would be good to have clustering, etc.

I want vis_dat to be simple, so I'm not sure how I feel about adding a separate plotly graphic - another option would be to have vis_dat gain an argument like vis_dat(data, interactive = TRUE). But I'm not sure if it is better or worse to do it this way.

I'm not sure if I'll continue developing vis_dat_ly, as the colours / legend make it more complicated than just having a missing data legend.

Refuses to build

When building visdat I get the following error:

** preparing package for lazy loading
Error in as_function(.f, ...) : object 'compare_print' not found
Error : unable to load R code in package ‘visdat’
ERROR: lazy loading failed for package ‘visdat’
* removing ‘/usr/local/lib/R/site-library/visdat’
Error: Command failed (1)

I believe that "compare_print" no longer exists?

vis_miss() column %s NA when % = 0.1

I have a column with 0.1% missing, and the column percentage is NA instead of 0.1%. I think the issue is that the case_when() call in label_col_missing_pct doesn't account for the case when x == 0.1; guessing it should be fixed by changing line 80 to >= instead of >. My attempts at a reprex were failing, but happy to try again or submit a PR if either would help!

visdat/R/internals.R

Lines 78 to 82 in de3186b

dplyr::case_when(
x == 0 ~ "0%",
x > 0.1 ~ paste0(x,"%"),
x < 0.1 ~ "<0.1%"
)

Individual cells do not have an individual class

Due to the fact that R coerces a vector to be the same type, this means that you cannot have something like c("a", 1L, 10.555) together as a vector, as it will just convert this to [1] "a" "1" "10.555". This means that you don't get the nice feature of picking up on nuances such as individuals cells that are different classes in the dataframe. Perhaps there is a way to read in a csv as a list so that these features are preserved?

colourblind palette shows false NA values

When using palette = "cb_safe" in vis_dat, it shows missing values where there are none

library(visdat)
library(naniar)
library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag():    dplyr, stats

# shows missing values where there are none
pedestrian %>% 
  filter(month=="January", 
         year==2016) %>%
  vis_dat(palette = "cb_safe") +
  theme(aspect.ratio=1)

But if you use regular vis_dat, then there are no missing values shown.

# missing values not shown
pedestrian %>% 
  filter(month=="January", 
         year==2016) %>%
  vis_dat() +
  theme(aspect.ratio=1)

Need to dig into what happens here.

Thanks @dicook for pointing this out!

Use readr:::collectorGuess

readr:::collectorGuess("abc", locale_ = readr::locale())
#> [1] "character"
readr:::collectorGuess("1,500", locale_ = readr::locale())
#> [1] "number"
readr:::collectorGuess("2010/10/10", locale_ = readr::locale())
#> [1] "date"

It's currently internal, but if you find it useful, I can export it.

Using plotly

I wonder if you might consider using plotly because of core support for interaction - it allows you to hover over a point to get the row number and variable. Plotly has support for plots created in ggplot, but it's not perfect.

Here's my quick hack using the internals from vis_miss. Unfortunately plotly's ggplotly() doesn't yet support geom_raster() or geom_tile(), so you need o do things manually with geom_rect() and do some messing with labels:

library(dplyr)
library(plotly)
data = airquality %>% is.na %>% as.data.frame %>% mutate(rows = row_number()) %>%
  tidyr::gather_(key_col = "variables", value_col = "value",
                 gather_cols = names(.)[-length(.)]) %>%
  mutate(variables = as.factor(variables))

plot = data %>% 
    ggplot(aes(xmin = as.numeric(variables) - 0.5, xmax = as.numeric(variables) + 0.5,
                  ymin = rows - 0.5, ymax = rows + 0.5)) + geom_rect(aes(text = paste(variables, rows, sep=", "), fill = value)) +
  scale_x_discrete(breaks = 1:length(levels(data$variables)), labels = levels(data$variables), expand=c(0,0), range=c(-0.5, length(levels(data$variables) + 0.5))) +
  scale_fill_grey(name = "", labels = c("Present", "Missing")) +
  theme_minimal() + theme(axis.text.x = element_text(angle = 45,
                                                     vjust = 0.5)) + labs(x = "Variables in Dataset", y = "Rows / observations")

plot

ggplotly(plot)

I'm sure one can do better, I'm no plotly expert. It might require making the plots in base plotly rather than using their ggplot conversion mechanism.

Missing Data not listed in legend

When running the example2 through vis_dat, the grey bars indicate missing values, but these are currently not listed as missing values in the legend.

write `vis_dat` and `vis_miss` in `plotly` and give them good names

It's awesome that I can run vis_dat and then ggplotly(), and get an interactive plot.

But, it's really slow to do that, and from what I understand of plotly, it'll be way quicker if it's written in regular plotly code.

Possible names for these interactive plots?

vis_dat_ly? vis_miss_ly?

Or perhaps I include an "interactive == TRUE" argument inside vis_dat and vis_miss? I think I like this option more.

Any thoughts, @cpsievert ?

Changes to be made in README

These are the changes to be made to the README as described in rOpenSci onboarding.

  • Introduce airquality at first usage, not after second plot (1985f6d)
  • Fix wording around <0.1% missingness: "When there is <0.1% of missingness, vis_miss indicates that there is >1% missingness." (1985f6d)
  • Link to the vignettes from the README. (1985f6d) (822cc87)
  • Link "experimental" parts of visdat README to "experimental" vignette (1985f6d)
  • Demonstrate differences between vis_dat and vis_guess, by putting them side by side by adding chunk options: fig.show='hold', out.width='50%'. (1985f6d)
  • Elaborate on what vis_dat_ly is doing. (385b52a)
  • Construct a small example of what might happen when visualising expectations. (445b974)
  • Add guidelines for contribution in README, and CONTRIBUTION.md (c7ea5bb)
  • Make sure that the excessive warnings messages are fixed

Speeding up visdat

After some discussion with Mike, here are some ways to speedup visdat:

  • Revisit fingerprint - change so that I don't paste in every element (minor speedup)
  • Draw visdat as a series of rectangles with segment lines drawn over the top to show the missing values. This would then require specifying two datasets - one for the coordinates of the rectangles, and one for the positions of the NA rows.

allow `vis_dat` and `vis_miss` to use variable names with spaces

Just a little note here for now. I'll add in some reproducible code soon, but basically I have some new data I read in with readxl::read_excel() that has variables named with spaces in them.

Normally these are handled with backticks - e.g., a bad variable name, but it looks like aes_string() doesn't really like that.

As an important feature of visdat is to visualise data when it is hot off the presses, it should be able to handle poorly named variables.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.