Code Monkey home page Code Monkey logo

unravel's Introduction

Unravel: A fluent code explorer for R.

Unravel, inspect, and explore fluent code in R.

lifecycle

Unravel is an R package / Addin designed to help data scientists understand and explore tidyverse R code which makes use of the fluent interface (function composition via pipes). You can read about the tool in my paper which covers its motivation, design, and results of a user study. Optionally, you can watch the talk I gave at UIST 2021.

NOTE: The package is early on in its lifecycle and is still undergoing development. But, you can install it with:

# install.packages('devtools')
devtools::install_github('nischalshrestha/Unravel')

Usage

With Unravel, you can unravel dplyr or tidyr code which opens up a Shiny app in RStudio. You can then click on the lines to inspect the intermediate outputs (typically dataframes) of the tidyverse code. Both the code and output will be highlighted according to what type of changes occurred (no change, visible change, internal change, error).

Unravel also produces automated function summaries accessed through the dataframe box. Each function summary (if supported --- see below) describe how the function transformed the previous dataframe in terms of dimensions (shape), whether or not changes were visible or internal (for e.g. grouping).

You can also perform structural edits to the code via toggles (comment/uncomment), and reordering lines with drag and drop interactions.

Demo

The easiest way to use Unravel is to highlight the tidyverse code you want to unravel, then go to Addins -> Unravel code.

Demo of Unravel showing a user highlighting code, clicking on Addins and selecting Unravel. The user then interacts with the app by clicking lines, toggling and reordering lines.

This will open up the app on the Viewer pane in RStudio by default. If you want to respect your currently chosen browser window, you can pass viewer = FALSE using the programmatic way shown below.

This style of coding always involves starting with a source of data. So, the first expression or line is "locked" such that you can't enable/disable or reorder it and other operations can't be reordered before the first line (as shown at the end of the GIF above).

You can also invoke it programmatically using the unravel function by wrapping or piping your code to the function:

# wrapped
Unravel::unravel(
  mtcars %>%
    group_by(cyl) %>% 
    summarise(mean_mpg = mean(mpg))
)
# piped
mtcars %>%
  group_by(cyl) %>% 
  summarise(mean_mpg = mean(mpg)) %>%
  Unravel::unravel()

Data Details (new)

For any intermediate step, a Data Details view is now available that provides a brief overview of the stats of each variable and some warnings about potential problems such as sneaky missing value representations like -99. This feature was added to provide a glimpse of the characteristics of the data as you examine transformations and their effects:

Demo of Unravel's Data Details view where clicking on the tab opens a view that shows stats for each variable in the intermediate output as well as potential data quality issues such as missing values.

Other data types

It's also possible to unravel code where steps may produce non-dataframe outputs such as lists or vectors. For example, if we unravel the following code:

mtcars %>%
  names() %>%
  map(~ count(mtcars, .data[[.x]]))

The UI now visualizes list/vectors as a slimmer, wider rectangle with only its length reported on the left:

The summaries for lists currently only report the number of elements, but in the future will include more details especially as we add support for {purrr} functions.

Chain outputs

You can also programmatically collect the intermediate outputs of the tidyverse code into a list structure with get_chain_outputs:

get_chain_outputs(rlang::expr(
  mtcars %>%
    group_by(cyl) %>% 
    summarise(mean_mpg = mean(mpg))
))

which returns:

[[1]]
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
...

[[2]]
# A tibble: 32 x 11
# Groups:   cyl [3]
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
...

[[3]]
# A tibble: 3 x 2
    cyl mean_mpg
  <dbl>    <dbl>
1     4     26.7
2     6     19.7
3     8     15.1

What verbs have summaries?

Currently, any dplyr/tidyr piped code working on single tables will work execution-wise, but only a handful of the functions in each package has explicit support for summaries / has been tested. The summaries are generated by an extension package of the amazing original tidylog package.

In the extension, I have added some enhancements (like data shape summary for every verb and rephrasing summaries) and is specially designed to work with Unravel so that I can access the messages in a convenient cache. All verbs supported by tidylog besides joins will work and some more I added like arrange, rowwise.

Performance limitations

Unravel currently starts to lag when handling dataframes that are larger than 100K rows. In the future, I will find ways to optimize the app so it can startup and respond faster for larger datasets. However, at this time try using Unravel for smaller datasets or work on subsets since the tool is more geared towards learning the tidyverse rather than a super scalable tool that is used in 'production' systems.

Contributions

Currently Unravel is only maintained by me, and that means limited capacity to reliably maintain and evolve the project. So, please feel free to open up issues, and suggest changes to improve Unravel!

Please note that the Unravel project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Related tools

There are several other similar tools that provide inspection and/or summary of tidyverse code operations and intermediates, which you may find useful as well:

  • tidylog: a drop-in solution that logs summaries of steps through message() on console output
  • ViewPipeSteps: an RStudio Addin that opens up tabs of intermediate outputs
  • breakerofchains: an RStudio/VSCode Addin that allows inspection of steps through cursor placement in editor
  • datamations: a framework to generate and visualize pipeline steps through explanations/animations
  • Tidy Data Tutor: visualizations of tidyverse code focusing on visualizing how each step transforms dataframes

unravel's People

Contributors

myko101 avatar nischalshrestha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

unravel's Issues

Stepper: custom knitr chunks for code + list of explanations.

It would be ideal for instructor to be able to write a chunk that's just for the code:


Setup code:

```{python nba_stepper_setup, include=FALSE}
import pandas as pd
import numpy as np
nba = pd.read_csv(r.nba_data_path)
column_names = {
  'Date': 'date',
  'Start (ET)': 'start',
  'Unamed: 2': 'box',
  'Visitor/Neutral': 'away_team',
  'PTS': 'away_points',
  'Home/Neutral': 'home_team',
  'PTS.1': 'home_points',
  'Unamed: 7': 'n_ot'
}
```

Code to step through:

```{python nba_code, include=FALSE, eval=FALSE}
nba = (nba.rename(columns=column_names)
  .dropna(thresh=4)
  [['date', 'away_team', 'away_points', 'home_team', 'home_points']]
  .assign(date=lambda x: pd.to_datetime(x['date'], format='%a, %b %d, %Y'))
  .set_index('date', append=True)
  .rename_axis(["game_id", "date"])
  .sort_index()
)
```

Stepper:

```{r nba_stepper, echo=FALSE}
stepper(
  setup_label = "nba_stepper_setup",
  code_label = "nba_code",
  explanations = list(
    "The column names of `nba` are quite messy so, we first rename column names with `rename` using our earlier `column_names` dict.",
    "Then, we drop any rows that contains more than 4 NAs."
  )
)
```

TODOs:

  • Setup/Code: the setup code and the stepping code
  • Explanations: thinking of list structure corresponding to relevant lines
  • Stepper: a Shiny stepper widget that renders the widget

Examples of multioperations

Explaining method chaining:

tumble_after(
    broke(
        fell_down(
            fetch(went_up(jack_jill, "hill"), "water"),
            jack),
        "crown"),
    "jill"
)

In R:

jack_jill %>%
    went_up("hill") %>%
    fetch("water") %>%
    fell_down("jack") %>%
    broke("crown") %>%
    tumble_after("jill")

In Python:

# assume you owned JackAndJill object:
jack_jill = JackAndJill()
(jack_jill.went_up('hill')
    .fetch('water')
    .fell_down('jack')
    .broke('crown')
    .tumble_after('jill')
)

# but if you don't like in the case of pandas, need to use `pipe`
jack_jill = pd.DataFrame()
(jack_jill.pipe(went_up, 'hill')
    .pipe(fetch, 'water')
    .pipe(fell_down, 'jack')
    .pipe(broke, 'crown')
    .pipe(tumble_after, 'jill')
)

src: https://tomaugspurger.github.io/method-chaining

munging:

arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
         np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]

s = pd.Series(np.random.randn(8), index=arrays)
s

bar  one   -0.861849
     two   -2.104569
baz  one   -0.494929
     two    1.071804
foo  one    0.721555
     two   -0.706771
qux  one   -1.039575
     two    0.271860
dtype: float64

df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df
                0         1         2         3
bar one -0.424972  0.567020  0.276232 -1.087401
    two -0.673690  0.113648 -1.478427  0.524988
baz one  0.404705  0.577046 -1.715002 -1.039268
    two -0.370647 -1.157892 -1.344312  0.844885
foo one  1.075770 -0.109050  1.643563 -1.469388
    two  0.357021 -0.674600 -1.776904 -0.968914
qux one -1.294524  0.413738  0.276662 -0.472035
    two -0.013960 -0.362543 -0.006154 -0.923061

df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)
df

first        bar                 baz                 foo                 qux          
second       one       two       one       two       one       two       one       two
A       0.895717  0.805244 -1.206412  2.565646  1.431256  1.340309 -1.170299 -0.226169
B       0.410835  0.813850  0.132003 -0.827317 -0.076467 -1.187678  1.130127 -1.436737
C      -1.413681  1.607920  1.024180  0.569605  0.875906 -2.211372  0.974466 -2.006747

src: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#creating-a-multiindex-hierarchical-index-object

split-apply-combine:

df = pd.DataFrame({'CID':[2,2,3],
                   'FE':[5,5,6],
                   'FID':[1,7,9]})

print (df)
   CID  FE  FID
0    2   5    1
1    2   5    7
2    3   6    9

df = df.groupby(by=['CID','FE'])['FID']
       .count()
       .unstack()
       .reset_index()
       .rename_axis(None, axis=1)

print (df)    
   CID    5    6
0    2  2.0  NaN
1    3  NaN  1.0

src: https://stackoverflow.com/questions/44023770/pandas-getting-rid-of-the-multiindex

Summary generation for tidyverse verb action

Make tidylog like summary but enhanced with annotations and better text/info.

TODO:

  • capture tidylog verb summary
  • annotate [row x col] change text
  • generate the data tidylog(data) summary by default (if indeed the line is the data line)
  • helper function to color text with data change semantics
  • annotate column change text
  • annotate standalone row or column number change

User-activated decorations for focusing on text/snippet.

The idea behind this is that instructors need to highlight certain parts of the text explanation, the code snippet, or the output. For example, if there are X, Y, Z distinct terms to highlight between code (action) <-> output (effect) <-> explanation. It would be useful to accentuate the decoration for X, while muting the decorations Y, and Z. This would help highlight the parts that matter when user activates those related things.

Handle wizard of oz for Python data structures pretty print

Some things I'd like to handle/include in pprints:

  • MultiIndex (useful for (un)stack and pivot/pivot_table agg operations); multiple row names

e.g. this block of code eventually sets index which makes the df Multindex with other columns raised up (as if grouped).
.rename_axis(["game_id", "date"] introduces two named rownames

 nba = (nba.rename(columns=column_names)
    .dropna(thresh=4)
    [['date', 'away_team', 'away_points', 'home_team', 'home_points']]
    .assign(date=lambda x: pd.to_datetime(x['date'], format='%a, %b %d, %Y'))
    .set_index('date', append=True)
    .rename_axis([\"game_id\", \"date\"])
    .sort_index()
  )
  • grouped dfs
  • meta info like # rows etc
  • retain NaN for empty values
  • convert data properly
    • NAs
    • ints / doubles / floats / strings
    • pd.datetime64

Plot stepper

See if you can make headway with hooking into python engine for plotly output

JS: Explore "cells" of tidyverse verbs

Todos:

  • reorder
  • add/remove or enable/disable
  • summary box
  • dim/brighten on enable/disable
  • data prompt tooltip
  • code text annotation
  • update summary box on on rearrange/enable/disable
  • update pipes on rearrange/enable/disable
  • update outputs on action on rearrange/enable/disable

Stepper: make it work for R too.

Upon detection of language engine, we should have the same functionality of parse/execute/store outputs, and stepping to work with R as well.

It could also be easier to do this because R is easier to introspect and no worries about output rendering as much.

Stepper: explanations to support markdown/custom HTML.

It would be nice to be able to customize the text within explanation summaries.

E.g.:

"The column names of `nba` are quite messy so, we first rename column names with `rename` using our earlier `column_names` dict."

<div id="rename-hint">
**Note:** dictionaries (dict) are closest to R's environment structure in which we have keys to value mappings. This is a very common data structure in Python that can be handy for renaming columns such as above.
</div>

One can imagine us embedded custom elements within the explanations, so seems like a good idea to support that from the get-go.

Error state omits empty spaces and lacks error message

For e.g. in

diamonds %>%
  select(-everything()) %>%
  group_by(color) 

we get this:

Screen Shot 2021-03-07 at 2 54 47 PM

This might be specific to group_by because it is attempting to take a row 0 tibble and attempting to group and failing. The error message strangely does not ripple up to frontend either even though I do see it in console log:

$ :List of 4
  ..$ line  : int 3
  ..$ code  : chr "\tgroup_by(color) %>%"
  ..$ change: chr "error"
  ..$ err   : chr "<strong>Error:</strong> Must group by variables found in `.data`.\n* Column `color` is not found."

Stepper: switch to new code summary function

Switch to the new way of producing summary based on a function that takes in variable number of character or callout objects. This will also end up cleaning up the summary input on the stepper via explain since we won't have to use R inline function. Instead, we use the functions directly very similar to callout_ for dataframes.

Use python ast and astunparse to eval last expression for last value

Currently, we are using an extremely hacky function get_last_line which is not going to be robust.

Instead, let's use ast + astunparse to get the very last ast.Expr to evaluate as the "last result"

This will help to fix the following issues:

  • #10 (use a small python script to return last Expr from code string)
  • #6 (we won't need to manually handle comments etc.)

Stepper: syntax highlight the code text

For this to work, we need to make sure that we highlight the code text in the Stepper shiny module. There are 2 viable options:

  • use the prism.js library with Prism.highlightAll()
  • even better would be to somehow make use of highlight.js (but this is not working currently)

Shiny: hook up R with JS

TODO

  • basic mockup of JS + R
  • dynamically create verb editors
  • user input ๐ŸŽ‰
  • dynamically render row/col
  • handle updates after toggle updates
  • handle updates after reorder updates

Same table output for Python and R

Rn, I'm a bit iffy on this. The pro is we work on similar looking presentation, but con is we are sorta "lying" in a sense of true repr.

I think we could for supporting this for consistent look and feel of dataframes, except we still can be honest about repr for components of the data (indexes, NaNs, etc.)

Some challenges:

  • groupby object repr

kableExtra issue rendering on `set_index`

Screen Shot 2020-11-22 at 8 40 51 PM

Stack:

Warning: Error in UseMethod: no applicable method for 'mutate_' applied to an object of class "c('pandas.core.frame.DataFrame', 'pandas.core.generic.NDFrame', 'pandas.core.base.PandasObject', 'pandas.core.accessor.DirNamesMixin', 'pandas.core.base.SelectionMixin', 'pandas.core.indexing.IndexingMixin', 'python.builtin.object')"
  95: mutate_
  94: mutate.default
  92: function_list[[i]]
  91: freduce
  90: _fseq
  89: eval
  88: eval
  86: %>%
  85: python_df [/Users/nischal/Documents/rstudio/DataTutor/R/utils.R#64]
  84: kable_pandas [/Users/nischal/Documents/rstudio/DataTutor/R/kable_pandas.R#32]
  83: df_kable [/Users/nischal/Documents/rstudio/DataTutor/R/stepper.R#305]
  82: output$nba_stepper-line_table [/Users/nischal/Documents/rstudio/DataTutor/R/stepper.R#410]
   3: <Anonymous>
   1: rmarkdown::run

Multi-line code is failing to be parsed and evaluated properly

I believe this was happening because of the hacky solution to get the "last expression/value" and should be updated with a more robust way to get that either via reticulate or python script.

Ex:

column_names = {'Date': 'date', 'Start (ET)': 'start',
  'Unamed: 2': 'box', 'Visitor/Neutral': 'away_team',
  'PTS': 'away_points', 'Home/Neutral': 'home_team',
  'PTS.1': 'home_points', 'Unamed: 7': 'n_ot'}
nba = (
  nba.rename(columns=column_names)
  .dropna(thresh=4)
  [['date', 'away_team', 'away_points', 'home_team', 'home_points']]
  .assign(date=lambda x: pd.to_datetime(x['date'], format='%a, %b %d, %Y'))
  .set_index('date', append=True)
  .rename_axis(["game_id", "date"])
  .sort_index()
)
nba

Stepper: better layout to accommodate dynamic text summary.

The problem with the way text summary is placed in the center is that the size of the area is fixed. This is a problem because what if text takes up a lot of space? It would overflow. In the dynamic case, it's also annoying because it would keep pushing the output downwards.

An alternate layout would be code and output are right on top of each other and almost fused as one entity (see this and this). The output df can then be fixed for dataframe outputs and other types of outputs can also be fixed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.