nischalshrestha / unravel Goto Github PK

View Code? Open in Web Editor NEW

102.0 3.0 3.0 6.73 MB

A fluent code explorer for R. 🔍

License: Other

R 74.53% CSS 8.30% JavaScript 17.17%

shiny data-science datawrangling r tidyverse dplyr tidyr rstats

unravel's Introduction

Unravel: A fluent code explorer for R.

Unravel, inspect, and explore fluent code in R.

Unravel is an R package / Addin designed to help data scientists understand and explore tidyverse R code which makes use of the fluent interface (function composition via pipes). You can read about the tool in my paper which covers its motivation, design, and results of a user study. Optionally, you can watch the talk I gave at UIST 2021.

NOTE: The package is early on in its lifecycle and is still undergoing development. But, you can install it with:

# install.packages('devtools')
devtools::install_github('nischalshrestha/Unravel')

Usage

With Unravel, you can unravel dplyr or tidyr code which opens up a Shiny app in RStudio. You can then click on the lines to inspect the intermediate outputs (typically dataframes) of the tidyverse code. Both the code and output will be highlighted according to what type of changes occurred (no change, visible change, internal change, error).

Unravel also produces automated function summaries accessed through the dataframe box. Each function summary (if supported --- see below) describe how the function transformed the previous dataframe in terms of dimensions (shape), whether or not changes were visible or internal (for e.g. grouping).

You can also perform structural edits to the code via toggles (comment/uncomment), and reordering lines with drag and drop interactions.

Demo

The easiest way to use Unravel is to highlight the tidyverse code you want to unravel, then go to Addins -> Unravel code.

This will open up the app on the Viewer pane in RStudio by default. If you want to respect your currently chosen browser window, you can pass viewer = FALSE using the programmatic way shown below.

This style of coding always involves starting with a source of data. So, the first expression or line is "locked" such that you can't enable/disable or reorder it and other operations can't be reordered before the first line (as shown at the end of the GIF above).

You can also invoke it programmatically using the unravel function by wrapping or piping your code to the function:

# wrapped
Unravel::unravel(
  mtcars %>%
    group_by(cyl) %>% 
    summarise(mean_mpg = mean(mpg))
)
# piped
mtcars %>%
  group_by(cyl) %>% 
  summarise(mean_mpg = mean(mpg)) %>%
  Unravel::unravel()

Data Details (new)

For any intermediate step, a Data Details view is now available that provides a brief overview of the stats of each variable and some warnings about potential problems such as sneaky missing value representations like -99. This feature was added to provide a glimpse of the characteristics of the data as you examine transformations and their effects:

Other data types

It's also possible to unravel code where steps may produce non-dataframe outputs such as lists or vectors. For example, if we unravel the following code:

mtcars %>%
  names() %>%
  map(~ count(mtcars, .data[[.x]]))

The UI now visualizes list/vectors as a slimmer, wider rectangle with only its length reported on the left:

The summaries for lists currently only report the number of elements, but in the future will include more details especially as we add support for {purrr} functions.

Chain outputs

You can also programmatically collect the intermediate outputs of the tidyverse code into a list structure with get_chain_outputs:

get_chain_outputs(rlang::expr(
  mtcars %>%
    group_by(cyl) %>% 
    summarise(mean_mpg = mean(mpg))
))

which returns:

[[1]]
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
...

[[2]]
# A tibble: 32 x 11
# Groups:   cyl [3]
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
...

[[3]]
# A tibble: 3 x 2
    cyl mean_mpg
  <dbl>    <dbl>
1     4     26.7
2     6     19.7
3     8     15.1

What verbs have summaries?

Currently, any dplyr/tidyr piped code working on single tables will work execution-wise, but only a handful of the functions in each package has explicit support for summaries / has been tested. The summaries are generated by an extension package of the amazing original tidylog package.

In the extension, I have added some enhancements (like data shape summary for every verb and rephrasing summaries) and is specially designed to work with Unravel so that I can access the messages in a convenient cache. All verbs supported by tidylog besides joins will work and some more I added like arrange, rowwise.

Performance limitations

Unravel currently starts to lag when handling dataframes that are larger than 100K rows. In the future, I will find ways to optimize the app so it can startup and respond faster for larger datasets. However, at this time try using Unravel for smaller datasets or work on subsets since the tool is more geared towards learning the tidyverse rather than a super scalable tool that is used in 'production' systems.

Contributions

Currently Unravel is only maintained by me, and that means limited capacity to reliably maintain and evolve the project. So, please feel free to open up issues, and suggest changes to improve Unravel!

Please note that the Unravel project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Related tools

There are several other similar tools that provide inspection and/or summary of tidyverse code operations and intermediates, which you may find useful as well:

tidylog: a drop-in solution that logs summaries of steps through message() on console output
ViewPipeSteps: an RStudio Addin that opens up tabs of intermediate outputs
breakerofchains: an RStudio/VSCode Addin that allows inspection of steps through cursor placement in editor
datamations: a framework to generate and visualize pipeline steps through explanations/animations
Tidy Data Tutor: visualizations of tidyverse code focusing on visualizing how each step transforms dataframes

unravel's People

Contributors

Stargazers

Watchers

Forkers

andreweatherman runlinwang keller-mark

unravel's Issues

External storage

Places to look at and extend:

Implementation for the local filesystem: https://github.com/rstudio/learnr/blob/51b4fcb26f128bc7f8617bd1b04b097ba8a0034c/R/storage.R#L342
Need to extend learnr to accept stepper "submissions": https://github.com/rstudio/learnr/blob/51b4fcb26f128bc7f8617bd1b04b097ba8a0034c/R/storage.R#L120

Reset R/JS state upon a Reset button click

We'd like to be able to go back to the original state of things. A Reset button to go back to original state in explorable code

Stepper: text annotation with styling using css

Stepper: custom knitr chunks for code + list of explanations.

It would be ideal for instructor to be able to write a chunk that's just for the code:


Setup code:

```{python nba_stepper_setup, include=FALSE}
import pandas as pd
import numpy as np
nba = pd.read_csv(r.nba_data_path)
column_names = {
  'Date': 'date',
  'Start (ET)': 'start',
  'Unamed: 2': 'box',
  'Visitor/Neutral': 'away_team',
  'PTS': 'away_points',
  'Home/Neutral': 'home_team',
  'PTS.1': 'home_points',
  'Unamed: 7': 'n_ot'
}
```

Code to step through:

```{python nba_code, include=FALSE, eval=FALSE}
nba = (nba.rename(columns=column_names)
  .dropna(thresh=4)
  [['date', 'away_team', 'away_points', 'home_team', 'home_points']]
  .assign(date=lambda x: pd.to_datetime(x['date'], format='%a, %b %d, %Y'))
  .set_index('date', append=True)
  .rename_axis(["game_id", "date"])
  .sort_index()
)
```

Stepper:

```{r nba_stepper, echo=FALSE}
stepper(
  setup_label = "nba_stepper_setup",
  code_label = "nba_code",
  explanations = list(
    "The column names of `nba` are quite messy so, we first rename column names with `rename` using our earlier `column_names` dict.",
    "Then, we drop any rows that contains more than 4 NAs."
  )
)
```

TODOs:

Setup/Code: the setup code and the stepping code
Explanations: thinking of list structure corresponding to relevant lines
Stepper: a Shiny stepper widget that renders the widget

Connector btw callout text and code

Ideas:

Stepper: try to modularize the stepper widget

Read: be able to plug in stepper widget that's based on particular knitr chunks.

Examples of multioperations

Explaining method chaining:

tumble_after(
    broke(
        fell_down(
            fetch(went_up(jack_jill, "hill"), "water"),
            jack),
        "crown"),
    "jill"
)

In R:

jack_jill %>%
    went_up("hill") %>%
    fetch("water") %>%
    fell_down("jack") %>%
    broke("crown") %>%
    tumble_after("jill")

In Python:

# assume you owned JackAndJill object:
jack_jill = JackAndJill()
(jack_jill.went_up('hill')
    .fetch('water')
    .fell_down('jack')
    .broke('crown')
    .tumble_after('jill')
)

# but if you don't like in the case of pandas, need to use `pipe`
jack_jill = pd.DataFrame()
(jack_jill.pipe(went_up, 'hill')
    .pipe(fetch, 'water')
    .pipe(fell_down, 'jack')
    .pipe(broke, 'crown')
    .pipe(tumble_after, 'jill')
)

src: https://tomaugspurger.github.io/method-chaining

munging:

arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
         np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]

s = pd.Series(np.random.randn(8), index=arrays)
s

bar  one   -0.861849
     two   -2.104569
baz  one   -0.494929
     two    1.071804
foo  one    0.721555
     two   -0.706771
qux  one   -1.039575
     two    0.271860
dtype: float64

df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df
                0         1         2         3
bar one -0.424972  0.567020  0.276232 -1.087401
    two -0.673690  0.113648 -1.478427  0.524988
baz one  0.404705  0.577046 -1.715002 -1.039268
    two -0.370647 -1.157892 -1.344312  0.844885
foo one  1.075770 -0.109050  1.643563 -1.469388
    two  0.357021 -0.674600 -1.776904 -0.968914
qux one -1.294524  0.413738  0.276662 -0.472035
    two -0.013960 -0.362543 -0.006154 -0.923061

df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)
df

first        bar                 baz                 foo                 qux          
second       one       two       one       two       one       two       one       two
A       0.895717  0.805244 -1.206412  2.565646  1.431256  1.340309 -1.170299 -0.226169
B       0.410835  0.813850  0.132003 -0.827317 -0.076467 -1.187678  1.130127 -1.436737
C      -1.413681  1.607920  1.024180  0.569605  0.875906 -2.211372  0.974466 -2.006747

src: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#creating-a-multiindex-hierarchical-index-object

split-apply-combine:

df = pd.DataFrame({'CID':[2,2,3],
                   'FE':[5,5,6],
                   'FID':[1,7,9]})

print (df)
   CID  FE  FID
0    2   5    1
1    2   5    7
2    3   6    9

df = df.groupby(by=['CID','FE'])['FID']
       .count()
       .unstack()
       .reset_index()
       .rename_axis(None, axis=1)

print (df)    
   CID    5    6
0    2  2.0  NaN
1    3  NaN  1.0

src: https://stackoverflow.com/questions/44023770/pandas-getting-rid-of-the-multiindex

Stepper: render markdown text for code summary

As before, for the new stepper cadence, we need to render the strings that may have markdown formatted text via stepper_text

learnrhash: use the useful bits for our purposes

Places to look at:

encode/decode: https://github.com/rundel/learnrhash/blob/master/R/encode.R
extracting from question/submission: https://github.com/rundel/learnrhash/blob/master/R/extract.R (we would a extract_stepper for extracting data from hash)

Look into bdb/pbd debugger module to simplify Python trace execution

Ideas:

Stepper: don't put `render()` calls inside `observeEvent`

Summary generation for tidyverse verb action

Make tidylog like summary but enhanced with annotations and better text/info.

TODO:

capture tidylog verb summary
annotate [row x col] change text
generate the data tidylog(data) summary by default (if indeed the line is the data line)
helper function to color text with data change semantics
annotate column change text
annotate standalone row or column number change

Stepper: use an arrow on the left and so that on-demand highlight works better visually.

User-activated decorations for focusing on text/snippet.

The idea behind this is that instructors need to highlight certain parts of the text explanation, the code snippet, or the output. For example, if there are X, Y, Z distinct terms to highlight between code (action) <-> output (effect) <-> explanation. It would be useful to accentuate the decoration for X, while muting the decorations Y, and Z. This would help highlight the parts that matter when user activates those related things.

Stepper: highlight code syntax on-demand (either on hover/click on code text).

knitr hooks: try to use reticulate capabilities as much as possible

For e.g. could we just get the last_value somehow instead of doing extra work of parsing text, eval'ing getting last line etc?

Places to look:

https://github.com/rstudio/reticulate/blob/a8701771fe759161c76a8e4a81d4ccc506269de1/R/utils.R#L130

Bug where reordering lines while all lines are toggled off causes crash

Refactor: the pretty print knitr hook for dataframes.

Currently, the knitr custom hook for pretty printing Python dataframes is pretty disorganized and fragile. I propose the following:

resolve #19 and #7
use the new solution to detect type of object for last expression
add tests

Handle wizard of oz for Python data structures pretty print

Some things I'd like to handle/include in pprints:

MultiIndex (useful for (un)stack and pivot/pivot_table agg operations); multiple row names

e.g. this block of code eventually sets index which makes the df Multindex with other columns raised up (as if grouped).
.rename_axis(["game_id", "date"] introduces two named rownames

 nba = (nba.rename(columns=column_names)
    .dropna(thresh=4)
    [['date', 'away_team', 'away_points', 'home_team', 'home_points']]
    .assign(date=lambda x: pd.to_datetime(x['date'], format='%a, %b %d, %Y'))
    .set_index('date', append=True)
    .rename_axis([\"game_id\", \"date\"])
    .sort_index()
  )

Plot stepper

See if you can make headway with hooking into python engine for plotly output

Improve authoring interface for explanations to instead use `block2` chunks

It feels odd to have to write the explanation text in strings especially with the lack of distinguishing code text bits from the narrative bits.

We could instead write knitr chunks that uses the block2 and contains information for callouts and other things like "starting expression".

JS: Explore "cells" of tidyverse verbs

Todos:

reorder
add/remove or enable/disable
summary box
dim/brighten on enable/disable
data prompt tooltip
code text annotation
update summary box on on rearrange/enable/disable
update pipes on rearrange/enable/disable
update outputs on action on rearrange/enable/disable

Stepper: make it work for R too.

Upon detection of language engine, we should have the same functionality of parse/execute/store outputs, and stepping to work with R as well.

It could also be easier to do this because R is easier to introspect and no worries about output rendering as much.

Condense data diffs for larger datasets

Ideas:

Hover tooltip uses markdown text so we can include things like images and links.

Stepper: move event recorder setup to stepper.R itself

Stepper: explanations to support markdown/custom HTML.

It would be nice to be able to customize the text within explanation summaries.

E.g.:

"The column names of `nba` are quite messy so, we first rename column names with `rename` using our earlier `column_names` dict."

<div id="rename-hint">
**Note:** dictionaries (dict) are closest to R's environment structure in which we have keys to value mappings. This is a very common data structure in Python that can be handy for renaming columns such as above.
</div>

One can imagine us embedded custom elements within the explanations, so seems like a good idea to support that from the get-go.

Error state omits empty spaces and lacks error message

For e.g. in

diamonds %>%
  select(-everything()) %>%
  group_by(color)

we get this:

This might be specific to group_by because it is attempting to take a row 0 tibble and attempting to group and failing. The error message strangely does not ripple up to frontend either even though I do see it in console log:

$ :List of 4
  ..$ line  : int 3
  ..$ code  : chr "\tgroup_by(color) %>%"
  ..$ change: chr "error"
  ..$ err   : chr "<strong>Error:</strong> Must group by variables found in `.data`.\n* Column `color` is not found."

Consider using {reactable} instead

The {reactable} has some nice hooks for JS / css styling and might be more performant than {kableExtra}, so consider using it for our tables: https://glin.github.io/reactable/articles/conditional-styling.html

reactable also would allow us to style header on data diff tables: https://glin.github.io/reactable/articles/examples.html#header-rendering

Stepper: switch to new code summary function

Switch to the new way of producing summary based on a function that takes in variable number of character or callout objects. This will also end up cleaning up the summary input on the stepper via explain since we won't have to use R inline function. Instead, we use the functions directly very similar to callout_ for dataframes.

Use python ast and astunparse to eval last expression for last value

Currently, we are using an extremely hacky function get_last_line which is not going to be robust.

Instead, let's use ast + astunparse to get the very last ast.Expr to evaluate as the "last result"

This will help to fix the following issues:

#10 (use a small python script to return last Expr from code string)
#6 (we won't need to manually handle comments etc.)

Lesson for selecting/deriving data via tidyverse

Use three functions for isolating data within a table:

select()
filter()
arrange()

Use three functions for deriving new data from a table:

summarise()
group_by()
mutate()

source: RStudio Cloud Primers: Work with data

Bug for invalid order of verbs

The app does not gracefully handle invalid orders, for e.g.:

  select(...) %>%
diamonds

Text tooltips

Stepper: syntax highlight the code text

For this to work, we need to make sure that we highlight the code text in the Stepper shiny module. There are 2 viable options:

use the prism.js library with Prism.highlightAll()
even better would be to somehow make use of highlight.js (but this is not working currently)

Detect columns used in a verb

Be able to detect columns used in a verb so we can allow data transparency of them in UI as well.

Hyperlink code text for functions to trigger help pages

~~Check out https://github.com/r-lib/downlit.~~

Allow user to invoke the help page for a particular function within the code in Unravel.

Shiny: hook up R with JS

TODO

Integrate the poc codemirror stepper + twin highlight callouts into stepper function

POC for these work:

codemirror editor in Shiny
codemirror with gutter markers in Shiny
linked highlights for text -> code
linked highlights for code -> text
function to specify these callouts
final integration into stepper

Sticky header does not work yet for kableExtra

Stepper: don't show paginated tables since it's confusing

With Pagination on tables, we have too many next/previous going on, so remove it for stepper case.

Last line as # needs to be handled for Python

Ex:

```{python ex-nba-load, exercise=TRUE, exercise.setup="pew_setup", opts.label="pprinter"}
nba
#

```

Cover:

#
""" docstrings

Same table output for Python and R

Rn, I'm a bit iffy on this. The pro is we work on similar looking presentation, but con is we are sorta "lying" in a sense of true repr.

I think we could for supporting this for consistent look and feel of dataframes, except we still can be honest about repr for components of the data (indexes, NaNs, etc.)

Some challenges:

groupby object repr

DataFrame callouts

TODO for reactable:

columns highlighting
column labels highlighting

Stepper: stepper function should accept an expand param for expanded vs condensed layout

This is to make it easy to write in 2 different styles depending on author's desire of compact vs spread versions of stepper.

Stepper: automatically retrieve stepper setup and code chunks

kableExtra issue rendering on `set_index`

Stack:

Warning: Error in UseMethod: no applicable method for 'mutate_' applied to an object of class "c('pandas.core.frame.DataFrame', 'pandas.core.generic.NDFrame', 'pandas.core.base.PandasObject', 'pandas.core.accessor.DirNamesMixin', 'pandas.core.base.SelectionMixin', 'pandas.core.indexing.IndexingMixin', 'python.builtin.object')"
  95: mutate_
  94: mutate.default
  92: function_list[[i]]
  91: freduce
  90: _fseq
  89: eval
  88: eval
  86: %>%
  85: python_df [/Users/nischal/Documents/rstudio/DataTutor/R/utils.R#64]
  84: kable_pandas [/Users/nischal/Documents/rstudio/DataTutor/R/kable_pandas.R#32]
  83: df_kable [/Users/nischal/Documents/rstudio/DataTutor/R/stepper.R#305]
  82: output$nba_stepper-line_table [/Users/nischal/Documents/rstudio/DataTutor/R/stepper.R#410]
   3: <Anonymous>
   1: rmarkdown::run

Multi-line code is failing to be parsed and evaluated properly

I believe this was happening because of the hacky solution to get the "last expression/value" and should be updated with a more robust way to get that either via reticulate or python script.

Ex:

column_names = {'Date': 'date', 'Start (ET)': 'start',
  'Unamed: 2': 'box', 'Visitor/Neutral': 'away_team',
  'PTS': 'away_points', 'Home/Neutral': 'home_team',
  'PTS.1': 'home_points', 'Unamed: 7': 'n_ot'}
nba = (
  nba.rename(columns=column_names)
  .dropna(thresh=4)
  [['date', 'away_team', 'away_points', 'home_team', 'home_points']]
  .assign(date=lambda x: pd.to_datetime(x['date'], format='%a, %b %d, %Y'))
  .set_index('date', append=True)
  .rename_axis(["game_id", "date"])
  .sort_index()
)
nba

Better data diff that's more representative of Python dfs

This could be solved with the python version of Daff, but it looked a bit clunky: https://github.com/paulfitz/daff/blob/a2a013cef4c842bef1e8babacdc99fc6e5a1be07/scripts/example.py

Stepper: better layout to accommodate dynamic text summary.

The problem with the way text summary is placed in the center is that the size of the area is fixed. This is a problem because what if text takes up a lot of space? It would overflow. In the dynamic case, it's also annoying because it would keep pushing the output downwards.

An alternate layout would be code and output are right on top of each other and almost fused as one entity (see this and this). The output df can then be fixed for dataframe outputs and other types of outputs can also be fixed.

nischalshrestha / unravel Goto Github PK

unravel's Introduction

Unravel: A fluent code explorer for R.

Usage

Demo

Data Details (new)

Other data types

Chain outputs

What verbs have summaries?

Performance limitations

Contributions

Related tools

unravel's People

Contributors

Stargazers

Watchers

Forkers

unravel's Issues

Recommend Projects

Recommend Topics

Recommend Org