Code Monkey home page Code Monkey logo

eha-research-output-catalog's Introduction

Containerised R workflow template

Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public. Lifecycle: experimental container-workflow-template

This is a template repository of a containerised R workflow built on the targets framework, made portable using renv, and ran manually or automatically using GitHub Actions. To use this template click on the “use this template button” and then select create a new repository.

Check out the containerTemplateUtils package for handling common tasks related to this repo (sending emails, uploading files to AWS, etc. )

Note that git-crypt is not part of the template repo. See the EHA M&A handbook for how to add git-crypt.

Follow the links for more information about:

Recommendations:

  • One function per file in R/
  • Non-function R scripts in another directory like scripts/
  • Use the same names for targets and function arguments for those targets unless a function
  • Nouns for targets, verbs for functions
  • Use common suffixes for target types: _file for files, _raw for read-in but unprocessed data
  • Use fnmate and tflow RStudio Add-Ins to make this easy, create shortcuts for these add-ins (talk), or the usethis package

Quick start

  • Create repo from template
  • rename .Rproj file
  • streamline packages in packages.R
  • modify .gitattributes to include any files that may need encryption
  • initialize git-crypt for repo
  • add relevant environment variables to .env file
  • rename github actions workflows

GitHub Actions

GitHub Actions allows automation, customisation, and execution of your research project workflows right in your GitHub repository.

In gist, GitHub Actions is a workflow composed of a job or a number of jobs. The job/s are then composed of steps that control the order in which actions are run in order to complete a job/s. This workflow is scheduled or triggered by a specific event and runs on what is called a runner - a server that has the GitHub Actions runner application installed - that is either hosted by GitHub, or self-hosted on your own machines.

This whole workflow including the event trigger and the runner on which the workflow will run in are specified and detailed using a workflow .yml file that is saved inside a directory named .github within your GitHub repository in which you want to use GitHub Actions on.

This repository, contains a template GitHub Actions workflow with its corresponding .yml file that illustrates how GitHub Actions can be used to run and maintain an R workflow that uses targets and renv.

Using containers in GitHub Actions workflow

A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another.

Containers can be used within a GitHub Actions workflow and can be specified either at the job level or at the step level. If specified at the job level, all the steps within that job will be run inside that container. When specified at the steps level, different containers can be used for each step.

The example/template workflow can be found inside the .github folder and is shown below:

name: container-workflow-template

on:
  push:
    branches:
      - main
      - master
  pull_request:
    branches:
      - main
      - master
  workflow_dispatch:
    branches:
      - '*'
  #schedule:
  #  - cron: "0 8 * * *"
      
jobs:
  container-workflow-tempalte:
    runs-on: ubuntu-latest                                # Run on GitHub Actions runner
    #runs-on: [self-hosted, linux, x64, onprem-aegypti]   # Run the workflow on EHA aegypti runner
    #runs-on: [self-hosted, linux, x64, onprem-prospero]  # Run the workflow on EHA prospero runner
    container:
      image: rocker/verse:4.1.2
      
    steps:
      - uses: actions/checkout@v2
    
      - name: Install system dependencies
        run: |
          apt-get update && apt-get install -y --no-install-recommends \
          libcurl4-openssl-dev \
          libssl-dev
      
      - name: Restore R packages
        run: |
          renv::restore()
        shell: Rscript {0}
    
      - name: Run targets workflow
        run: |
          targets::tar_make()
        shell: Rscript {0}

In this example, we show a data quality check workflow report for a nutrition survey of children 6-59 months old.

The trigger

The trigger for GitHub Actions is specified in these lines in the workflow YAML file:

on:
  push:
    branches:
      - main
      - master
  pull_request:
    branches:
      - main
      - master
  workflow_dispatch:
    branches:
      - '*'
  #schedule:
  #  - cron: "0 8 * * *"

This workflow automatically runs when there is a push or pull request event to the main branch of the repository. This workflow has also been set to have the option to be run manually from the GitHub Actions page for any branch of the repository through the workflow-dispatch specification in the workflow YAML file.

GitHub Actions can also be scheduled to run at specific times and frequency using the schedule specification in the workflow YAML file using POSIX cron syntax. Scheduled workflows run on the latest commit on the default or base branch. The shortest interval you can run scheduled workflows is once every 5 minutes. In the example workflow, the schedule specification has been set to run at 8 am everyday but this has been hashed out. If you would like to schedule your workflow runs, remove the hash and then set the POSIX cron syntax to the frequency that you require. Note while github actions is highly reliable Github does not guarantee that a scheduled job will run if you’re using github servers and jobs are less likely to run if you choose a popular run time (generally on the hour).

The job

The job for GitHub Actions is specified in these lines in the workflow YAML file:

jobs:
  container-workflow-template:
    runs-on: ubuntu-latest                                # Run on GitHub Actions runner
    #runs-on: [self-hosted, linux, x64, onprem-aegypti]   # Run the workflow on EHA aegypti runner
    #runs-on: [self-hosted, linux, x64, onprem-prospero]  # Run the workflow on EHA prospero runner
    container:
      image: rocker/verse:4.1.2

The job named container-workflow-template is specified to run on runners hosted by GitHub Actions. These runners can be identified through a tag that specifies the operating software followed by the version. In the example workflow, the line specifying runs-on: ubuntu-latest runs the workflow on a machine hosted by GitHub Actions with the latest Ubuntu operating software.

The job can also be run on a self-hosted GitHub Actions runner that is installed on EHA’s high performance computing machines using the runs-on workflow YAML specification. Tags unique to this GitHub runner are used to identify the specific machine to use. Syntax on how to specify these runners are shown but hashed out.

To further make the GitHub Actions workflow more robust and reproducible, we setup a container at the job level. The container specified is a versioned R image that has tidyverse and other R publishing tools installed. This container image would generally be adequate for most workflows that require data wrangling and manipulation using the tidyverse tools and reporting using rmarkdown. Some projects/workflows (like those using spatial packages such as sf) may benefit from using a different R image so change the container specification accordingly. To read more about available R images, see https://www.rocker-project.org/images/.

Using this GitHub Actions workflow template

This repository has been set as a private template repository. This means that this can be used by EHA staff for creating new repositories with the same filesystem.

This can be done as follows:

  1. In your GitHub account, go to the EcoHealth Alliance organisation (https://github.com/ecohealthalliance) then click on the green button labeled New.

  2. You will now be directed to the Create new repository page. Here, right at the top, you will see the Repository template heading. Click on the drop down button right below this that says No template. You will then see all the available templates within EHA. Select the template named ecohealthalliance/container-template.

  3. Give your new repository a name, set the appropriate repository visibility, and then click on Create repository.

  4. You will now have a new repository the contents of which are the same files and structure as this template repository.

  5. You can now make the necessary changes and additions that are specific to your workflow.

Using git-crypt to encrypt files in your workflow

Your project may contain a mix of public and private content. Being able to encrypt the private contents of your project is very useful. It is recommended that you use PGP (Pretty Good Privacy) encryption, implemented by the program git-crypt. It takes a bit to set up but once activated makes sharing secure and seamless. To setup PGP and git-crypt on your project that is based on this template, see the Encryption chapter of the EHA Modeling and Analytics Handbook.

Once you have enabled git-crypt on your project, you will need to make the following edits to the container-workflow-template.yml file to be able to perform symmetric key decryption described here. Here is the container-workflow-template.yml file updated to allow and perform symmetric key decryption:

name: container-workflow-encrypted-template

on:
  push:
    branches:
      - main
      - master
  pull_request:
    branches:
      - main
      - master
  workflow_dispatch:
    branches:
      - '*'
  #schedule:
  #  - cron: "0 8 * * *"

env:
  GIT_CRYPT_KEY64: ${{ secrets.GIT_CRYPT_KEY64 }}
      
jobs:
  container-workflow-encrypted-tempalte:
    runs-on: ubuntu-latest                                # Run on GitHub Actions runner
    #runs-on: [self-hosted, linux, x64, onprem-aegypti]   # Run the workflow on EHA aegypti runner
    #runs-on: [self-hosted, linux, x64, onprem-prospero]  # Run the workflow on EHA prospero runner
    container:
      image: rocker/verse:4.1.2
      
    steps:
      - uses: actions/checkout@v2
    
      - name: Install system dependencies
        run: |
          apt-get update && apt-get install -y --no-install-recommends \
          git-crypt \
          libcurl4-openssl-dev \
          libssl-dev
          
      - name: Decrypt repository using symmetric key
        run: |
          echo $GIT_CRYPT_KEY64 > git_crypt_key.key64 && base64 -di git_crypt_key.key64 > git_crypt_key.key && git-crypt unlock git_crypt_key.key
          rm git_crypt_key.key git_crypt_key.key64
      
      - name: Restore R packages
        run: |
          renv::restore()
        shell: Rscript {0}
    
      - name: Run targets workflow
        run: |
          targets::tar_make()
        shell: Rscript {0}

Once you have edited your worklfow YAML file and before you push the changes to GitHub, you will then have to add the symmetric key to your GitHub repository as a secret.

First, generate a symmetric key by running this in your project directory.

git-crypt export-key git_crypt_key.key

git_crypt_key.key can now be used to decrypt the repository, and you can provide it to GitHub Actions as a secret environment variable (see https://docs.github.com/en/actions/security-guides/encrypted-secrets). However, since it is binary data, you’ll need to convert it to base64 first. So run something like:

cat git_crypt_key.key | base64 | pbcopy

to convert this file to base64 data, then paste it in GitHub’s secret environment variable field as GIT_CRYPT_KEY64.

eha-research-output-catalog's People

Contributors

collinschwantes avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.