vjd / satchel Goto Github PK
View Code? Open in Web Editor NEWcarry your data with you
Home Page: http://devinpastoor.com/satchel
License: Other
carry your data with you
Home Page: http://devinpastoor.com/satchel
License: Other
Satchel =========== [![Travis build status](https://travis-ci.org/dpastoor/satchel.svg?branch=master)](https://travis-ci.org/dpastoor/satchel) [![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/dpastoor/satchel?branch=master&svg=true)](https://ci.appveyor.com/project/dpastoor/satchel) Carry around your data See a project example in action at [satchel_examples](https://github.com/dpastoor/satchel_examples) ## Philosophy The driving factor behind satchel is helping to solve two issues that become more and more painful as projects get larger: 1) a project should be split into multiple files to keep knit times manageable and a semblance of organization. This fits nicely with [bookdown](https://github.com/rstudio/bookdown) as a mechanism for organizing Rmd's. 2) Reproducible research should strive to keep code embedded in reports #### How does `satchel` mitigate these pain points? First, it allows an easy mechanism for transferring data or objects between Rmd files so that there is only a single source of 'truth' for any piece of code (minimize copy and pasting code between files). For example, given raw data that needs consistent modifications before use in multiple files, the modifications should occur once, and the intermediate object be shared to the other files. This allows a single source of the change, so any future modifications will be properly propogated to all other instances. Second, an 'elephant in the room' of literate programming, is the the progression of a project does not proceed in a straight line, instead more like: ![success](assets/success.jpg) The total progression should be tracked in the `lab-notebook` to document false leads etc. When trying to create a final report though, only specific components will be retained. Satchel allows the necessary data/objects/etc to be easily saved from the lab-notebook Rmd files into a location that is suitable to be pulled into a separate final-report. This further allows the lab-notebook format to be separated from the final reports (eg Rmd files in lab-notebook, and Rnw files for final report for more control over layout, references, etc.) ## Using Satchel ### Initialization Satchel's must be initialized by calling `Satchel$new(...)` For example: ```{r eval = F} library(satchel) satchel <- Satchel$new("<namespace>", "<path/to/satchel/tree>") ``` The resulting object can have any name, though the convention is to use `satchel` if only creating one, and `satchel_<additional_identifier>` if using multiple in a single file ```{r eval = F} other_name <- Satchel$new("<namespace>", "<path/to/satchel/tree>") ``` Multiple satchel nodes can be created in a single file ```{r eval=FALSE} library(satchel) satchel_data <- Satchel$new("<namespace>", "<path/to/satchel/tree>") satchel_plots <- Satchel$new("<other-namespace>", "<path/to/satchel/tree>") # completely different tree report_satchel <- Satchel$new("<namespace>", "<finalreport_path/to/satchel/tree>") ``` This can be useful to save certain objects to pass around between documents in the lab-notebook, but separately save out key objects for report(s), presentation(s), etc ### Saving objects ```{r eval = FALSE} satchel$save(<obj>) satchel$save(<obj>, data_name = "<other name>") ``` ```{r eval=FALSE} satchel$save(Theoph) # will be saved as Theoph satchel$save(Theoph, "pk_theoph") # will be saved as pk_theoph ``` The `data_name` is key for use in pipelines, which do not retain the object names being passed ```{r eval = FALSE} # these will break as the name cannot be deparsed satchel$save(Theoph %>% mutate(REP = 1)) # would try to save with the literal name `Theoph %>% mutate(REP = 1) Theoph %>% mutate(REP = 1) %>% satchel$save() # would save with the name `.` # Instead, specify a name Theoph %>% mutate(REP = 1) %>% satchel$save("Theoph") ``` At the end of an Rmarkdown file, you can get the information for all saved object during the session ```{r eval=FALSE} satchel$report() ``` ### Using objects from other places Objects saved from other files can be accessed with `satchel$use(<object_name>, <namespace>)` Given an data frame called safety_data saved in the safety namespace specified in another file, saved by: ``` satchel <- Satchel$new("safety", "satchel") ... satchel$save(safety_data) ``` In the new file, the safety_data could be acquired via: ```{r eval=FALSE} satchel <- Satchel$new("analysis", "satchel") ... safety_data <- satchel$use("safety_data") ``` Satchel is smart enough to scan all the namespaces for an object called safety_data and pull it in. However, if there are multiple safety_data objects saved, an error will occur, and the namespace must be explicitly set. For example, in a third file, given: ``` satchel <- Satchel$new("other_safety", "satchel") ... satchel$save(safety_data) ``` As `safety_data` is present in both `safety` and `other_safety` namespaces, to use it you must specify which namespace you want to refer to: ```{r eval=FALSE} safety_data <- satchel$use("safety_data", "safety") ``` To see what is available across the entire tree, use `satchel$available()` ### Previewing objects For ongoing analyses, sometimes it is convenient to get a look at the data you are interested in, without loading the entire object, especially when dealing with larger datasets. `Satchel` automatically keeps a `head()` of certain objects (dataframes, matrices, vectors) so that they can be previewed, rather than pulling in the entire object. ```{r eval=FALSE} satchel$preview("safety_data") ``` As this does not need to pull in the entire file, access will be fast and light on memory. ## Patterns ### saving all files in a folder into a satchel TODO ## Advanced ### access behavior when 'using' an object, the default behavior is to re-scan the satchel tree to make sure the object is available and current. This is especially important if using multiple satchel's concurrently, when new objects are generated after another is initialized, it may not be aware of those updates. This continual scanning may not be ideal if the desire is to load a number of objects on after another (for example, pulling in a bunch to a list). To manually control the behavior, you can turn off refreshing via `satchel$auto_refresh(FALSE)` The overall, optimized flow might be something like: ``` satchel$auto_refresh(FALSE) satchel$available() # manually query the tree to update it satchel$use(...) satchel$use(...) satchel$use(...) satchel$use(...) satchel$auto_refresh(TRUE) # turn auto-refresh back on ``` Keep in mind, this is almost guaranteed to be a unneeded optimization, as scanning the file tree takes microseconds in most cases. However for certain very large projects, or projects with thousands of very small objects saved out, it could play a role. HINT: if you find yourself saving many small, related objects, it may be better to collapse the objects into a list then just save the single list object.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.