Using pipeline tools to ensure analyses are reproducible and understandable, using R's target package
This half-day course explains why pipeline tools such as make, snakemake and targets are indispensible tools in (reproducible) data analyses.
If you have taken this course and are willing to provide a answers to a short (1-2 minute) survey, please fill these in here.
There is a short lecture here.
Applied example: there are two folders:
- this contains a blank R project along with the data in a subdirectory, which can be a starting point for developing a reproducible data analysis
- this contains a completed R project along with the required
_targets.R
file
- get and clean data
- fit a model:
ozone ~ temperature
- plot the model's fit versus data
- diagnose any issues with model fit; if necessary, change the model and rerun the above steps
- clone the repo and double click on the
r_project_blank_slate.Rproj
icon to launchRStudio
- install
targets
package viainstall.packages("targets")
- type
use_targets()
in the console, which should create an_targets.R
file - add
tidyverse
as a package under thepackage
item in the_targets.R
file - comment out
tar_source()
- remove the list of example targets that have been generated in the
_targets.R
file (but keep thelist
which had these targets within it) - create a folder called
scripts
that has in it a file calledclean_data.R
- in
clean_data.R
write a function that takes thedata\raw\airquality.csv
file, renames the columns using only lowercase letters and removes any rows that haveNA
values in them - in
_targets.R
add in the preamblesource(scripts\clean_data.R)
to ensure thattargets
has access to your function - in the
_targets.R
file, add a target for the cleaned data viatar_target(data_airquality_cleaned, clean_data(filename))
- in the
_targets.R
file, create a leaf target for the file itself viatar_target(filename, "data\raw\airquality.csv", format="file")
- visualise the network via
tar_visnetwork(names=data_airquality_cleaned)
- clean the data using
tar_make()
- revisualise the network to check that everything is up to date
- type
tar_read(data_airquality_cleaned)
in the console to view the cleaned data - continue with this methodology to create a full data analysis
- basic R skills (although not essential)
- some experience of having done data analysis