Code Monkey home page Code Monkey logo

r-docker-tutorial's Introduction

r-docker-tutorial

A docker tutorial for reproducible research.

This is an introduction to Docker designed for participants with knowledge about R and RStudio. The introduction is intended to be helping people who need Docker for a project. We first explain what Docker is and why it is useful. Then we go into the the details on how to use it for a reproducible transportable project.

Lessons:

  1. Lesson 01 What and Why
  2. Lesson 02 Launching Docker
  3. Lesson 03 Install packages
  4. Lesson 04 Dockerhub
  5. Lesson 05 Dockerfiles
  6. Lesson 06 Sharing your analysis

Links

Credit

Formerly an rOpenSci project started at the 2016 Unconference. Credit to all the contributors!

r-docker-tutorial's People

Contributors

ajstewartlang avatar bkatiemills avatar christophm avatar emaasit avatar heidiseibold avatar jsta avatar lcpalm avatar lmguzman avatar osorensen avatar suzanbaert avatar ttimbers avatar vezy avatar wking avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

r-docker-tutorial's Issues

Update Launching Docker section

In chapter 2, the Launching section is a bit out of date, as one no longer needs to use eval "$(docker-machine env default)" or the quickstart terminal with the latest editions of Docker for Mac or Docker for Windows. Instead, you just need to start the Docker application, which runs in the background.

Not sure if anyone is currently maintaining this lesson. It might make sense to add an appropriate http://www.repostatus.org/ badge.

Two different section 4's

Both 04-commit.Rmd and 04-Dockerhub.Rmd exist. 04-commit.Rmd has a link to the "next lesson" 05-Dockerhub, which is not a thing.

More Explicit Docker Install Instructions

The first step in Lesson 02 Launching Docker is to install docker. As SWC/DC workshop participants often have installation problems, it might be efficient to be explicit regarding the installation of docker before the lesson gets started.

I suggest either:

  1. Include the install instructions on the main page of the lesson, or
  2. Add a Lesson 00 Installing Docker section

Either of these options would help make sure that participants have installed it before the lesson begins.

Static files in Dockerfile cannot be loaded

Greetings!!

I am getting stuck in Dockerfile chapter.

My Dockerfile contains 3 commands which are the same as the tutorial:
FROM rocker/verse:latest
RUN R -e "install.packages('gapminder', repos = 'http://cran.us.r-project.org')"
ADD data/gapminder-FiveYearData.csv /home/rstudio/

However, I failed to see the gapminder dataset on the Rstudio run by Docker. I am using Windows and wonder if the directory path needs to be modified.

Please help me with it

Overlap lessons 3 and 4

We already show how to commit in lesson 3 and then lesson 4 is supposed to introduce commiting:

We have the following options

  • Show it twice (3+4), then we have to make sure there is at least some new info in 4
  • Not show it in 3, which makes it difficult with the closing of the terminal with exit
  • Delete lesson 4, which makes commiting seem less important than it is

What do you think?

Sample full Dockerfile

It would be nice to have a full sample dockerfile to see, as well as doing any sort of operations on removing temporary files to help clean up the container after pkg installs.

It might also be helpful to touch on how a Makevars file can be incorporated into your Dockerfile, such as compiling rstan, which requires you to have compiler flags set up in the Makevars file.

section 06: docker pull needs tag

I tried to run section 06 with my own research project.
For pulling I needed to add the tag. So
docker pull yourhubusername/gapminder_my_analysis:firsttry
should be correct.

Links to the next tutorial

It would be nice if there were links to the next and previous tutorials at the end of the current one.

Thanks for the Good work by the way ๐Ÿ‘

max-width not working for Firefox 44.0

In Firefox 44.0, the max-width property doesn't seem to be working, and images end up unusably wide:

wide-image

I'm not clear on how Rmd is converted to HTML, but I expect the max-width setting comes from here. The HTML for the page nests the image inside a figure div:

$ curl -s http://ropenscilabs.github.io/r-docker-tutorial/01-what-and-why.html |
>   grep -1 Computerception | sed 's/src="[^"]*"/src="..."/g'
<div class="figure">
<img src="..." alt="Computerception" />
<p class="caption">Computerception</p>
</div>

The figure class is configured (I'm not sure how) to display as a table:

figure-table

Disabling (via the inspector checkbox) that table-display property gets max-width working again:

figure-table-disabled

MDN mentions max-width with table being undefined, although whether that applies to display: table is unclear to me. I'm not clear on why you'd want display: table anyway, so if someone can find where it's coming from and how to disable it, that would probably be a reasonable fix for the Firefox rendering.

For what it's worth, the current page renders fine in Chromium 48.0:

chromium

Section 03 challenges

03-install-packages is good-to-go IMO, but it of course needs challenges. Ideas?

The thing that makes me a bit anxious about the whole docker commit workflow is that it leaves you with an image with goodness-knows-what inside it, as we get super philosophical about in #11. I'd like to see a challenge problem that walks people through generating some docs about what's in this image, and then baking that into the image. Something like

  • add the response of installed.packages() to a file
  • ________ (something to capture whatever got apt-get'ed or whatever else outside of R)
  • save the file as a README in your image.

Maybe add some instructions on autobuilds

In my readings on Docker, I've seen numerous references to "don't push images to Docker Hub; use autobuilds on Docker Hub instead". This is mostly a trust issue - if people can look at the Dockerfile and see what's in the image they're more inclined to trust it for a pull than if it's just "somebody with an account pushed an image".

There's also a bandwidth issue - many ISPs provide vastly lower upload speeds than download speeds. For example, my Comcast has 60 megabits down and only 6 megabits up.

Once you have an image building locally it's easy to set up an autobuild on Docker Hub or Quay.io.

Docker Hub: https://docs.docker.com/docker-hub/builds/
Quay.io is a bit more complicated; I use it because they track GitLab and Docker Hub only tracks GitHub and Bitbucket.

motivation for dockerfiles

In lesson #5 we layout the motivation for dockerfiles to be that we'd like to add more things to our container, like R packages and data that will be pre-installed and ready to go as soon as we start up. I am not sure that this is really why you would want to make a docker file. I think a docker file is important for reproducibility & security.

You can take a docker image and add data and r-packages interactively, and then commit that version of the container to update or make a new image. But what an image alone doesn't give you is the information about what is inside that container. The dockerfile is where you get all that info. So essentially it is an install recipe for the environment you need to run your code and, from a security aspect a dockerfile tells you whether or not there would be any malicious code in the image it would build (a real worry for HPC, Compute Canada for example, which is why they won't let you use docker with root access on their systems).

@BillMills, thoughts?

Consider use of littler `install2.r` scripts in the "Writing Dockerfiles" docs?

install2.r just provides a more concise syntax for adding R packages in Dockerfiles, it's available on all rocker images since we use it in the Rocker Dockerfiles too. One can just do:

RUN install2.r -e  bookdown rticles rmdshower

to install those packages.

Note that the repo is set by default (rstudio CRAN mirror on :latest, versioned images instead pull from the MRAN snapshots by default to ensure reproducibility).

The -e flag is optional, it makes install.packages() throw an error if the package installation fails (which will cause the docker build command to fail). By default, install.packages() only throws a warning, which means that a Dockerfile can build successfully even if it has failed to install the package.

Rethinking the "Sharing your analysis" section.

I don't think we should recommend docker commit and docker push as the way to 'publish' a Dockerfile. This results in just a binary image blob being uploaded to the hub, which is not particularly transparent; a user really has no way of knowing / trusting that the image contains what it says.

I think it would be better to simply put the Dockerfile on Github and link it as an automated build. (In this case, users should probably use one of the version-specific tags instead of :latest in their FROM line to ensure long-term reproducibility / stability of the build).

section 06 minor comments

I really like the capstone in 06-sharing-all-your-analysis - it's creating a picture of docker being this tool to freeze your code, data and deps in a little blob of amber, which I think could be a really compelling way to wrap up (and we might want to call that out more). A couple tweaks:

  • I'd love to explicitly include some data as well, rather than the artificial teaching case of getting it from gapminder. Related:
  • I'd also like to install something other than gapminder, to create a slightly more authentic experience.
  • One pitfall that we don't deal with anywhere in this encased-in-amber approach to reproducibility is versioning. ie, what if I do exactly what this section says, then do something else and push it to my dockerhub repo, and then later you want to run my original version. There's got to be a way to roll back to earlier tags, and it can't be that tough (right? famous last words).
  • challenge problems: does anyone have a super cool thing following this model we can get them to download and run / play with? Or something else showing off the stack in action.

There might be room for some context here, too; I think the encased-in-amber approach makes a lot of sense for reproducing that paper you wrote that time, but there's a whole other paradigm too - I use docker as a tool to quickly stand up analysis frameworks, that I might want to run on new data and new code, rather than on a fixed thing from history. As such, rather than baking everything in, it's interesting to think about infrastructure to integrate your docker container with your github repo, your figshare data,... Probably (way) too much for this lesson, but maybe in the supplementals or as a high-level, qualitative comment.

Document native apps in place of 'Quickstart Terminal'

The native apps Docker now provides for Mac and Windows avoid the whole need for Quickstart Terminal and the cumbersome docker-machine calls to get the ip. The whole thing works basically as it does on linux, just have the docker service running in the background and you can access docker through any terminal. The RStudio session is available directly from localhost:8787 (or whatever port you bind), no need to look up ips.

Links are not highlighted

All the links are with the same style as the normal text. This is coming from the dcTemplate: https://github.com/karthik/dcTemplate/blob/master/inst/rmarkdown/templates/dc_lesson_template/resources/style.css

Not sure if you want it like that, but I found it to be confusing.

Also it would make sense to have the links open new tabs. To my knowledge that is not possible with Rmd. So the only solution for would be to directly write html code into the .Rmd files with target=_blank in the a tags.

TODOs as comments in the Rmd files

I put TODOs or questions as comments in the Rmd files. Is that okay or should I open issues for each?

I always wrote TODO, so you can easily find it.

Make it easier for people to contribute

Would would be nice to make it easier for people to contribute.

  • How to contribute?
  • What are the rules?
  • What are the technical details you need to know?
  • ...?

Do you want to include instructions on sharing images via "docker save" and "docker load"?

Pushing to Docker Hub is great, but it does have some disadvantages:

  1. Bandwidth - many ISPs have much lower upload bandwidth than download bandwidth. For example, my home Comcast has a luxurious 60 mbits / second down but only 6 mbits / second up!
  2. Unless you're paying extra for the private repositories, pushing equals publishing.

For local interchange of Docker images, there's an alternative - docker save saves an image to a tar archive and docker load loads an image from a tar archive. One collaborator can save an image to a tar archive on a USB stick and the second can load the image, or the archives can be stored on a local shared file server.

Bug BBQ Goals

Hi team,

The Software Carpentry Bug BBQ is coming up! If we want to participate, we need to lay out a roadmap of goals for the event. I think it would make sense to try and get things to a testable beta state by the end of the BBQ - what do all y'all think?

To participate in the BBQ, we need to make issues noting all the things we need to do to get from where we are now, to where we want to be (ie a decent draft). Maybe we (I'm looking at @Emaasit @wking @ttimbers @HeidiSeibold @lmguzman and myself, and anyone else who wants to jump in) can all have a read and point out:

  • what parts are outright missing?
  • are there any inconsistencies between sections (ie telling people to install r packages one way at one time and another way later)?
  • do the commands actually work as written?
  • are there adequate challenge problems in all sections?
  • _________ (other mission-critical questions I forgot)

Plus anything else you want to raise - but if we identify all those places that need work (like, just identify them - you don't need to fix them, that's what we'll do on the Bug BBQ) say by June 8 or so, we can put together a clear picture of what we want to achieve on the 13th. Sound good?

Section 02 minor comments

I think overall 02-Launching-Docker is in good shape for our first release. Here's what I would change, though none of this is mission-critical:

  • In 'Launching R Studio in Docker', move docker run --rm -p 8787:8787 rocker/hadleyverse to before the optional comments. This encourages the instructor to get something on the screen asap - they can editorialize about it while the helpers go around and get people unstuck, rather than keeping students in 'wait and listen' mode before getting the first command of the section.
  • In the same section, most students won't know what a 'port' is. Something like "-p stands for port, and tells Docker we'd like to visit RStudio in our web browser. The 8787:8787 will be part of the URL RStudio will be visible at, as we'll see right away."
  • When introducing the -v flag, we currently ask the student to figure out what directory on their physical machine they want to mount - maybe remind them to pwd? Actually, they can even go -v $PWD:/whatever/mount/point, though maybe that syntax is a bit much; in any case, asking students to deftly wield the shell knowledge they just learned is usually a stretch, so some reminder about pwd is probably in order.

Also - challenge problems. Can we rely on these folks to have RStudio installed on their local machines? If so, they could buddy up, open local RStudio, and compare the results of installed.packages(), then do the same in their docker-rstudio - or something similar that illustrates the perfect consistency of a shared docker container versus the wild west of whatever people happen to have installed. Is anyone aware of a behavior change between a couple of different versions of any given r package we could illustrate in a similar exercise?

Section 04 challenges

04-Dockerhub is in good shape, but also needs challenges. Off the top of my head:

  • download your partner's image. How did that compare speed-wise to downloading the hadleyverse image the first time? (It was super fast since docker is smart enough to not re-download the base hadleyverse image, and only had to grab your changes).
  • something about searching dockerhub for an interesting project?

also I assume 04-commit can be abandoned as 04-dockerhub covers that material?

Consider updated / versioned rocker images

Hey @BillMills @HeidiSeibold ,

We have a new stack of rocker images, aimed more specifically at reproduciblity that it would be great to get your feedback on, and maybe consider migrating to for these lessons; see README here: https://github.com/rocker-org/rocker-versioned.

Most notable is the use of tags. In Docker, if you do not specify a tag, you're implicitly using the tag :latest, which in the rocker images corresponds to the most recent version of R and R packages. We now have tags specific to R versions, e.g. 3.3.1. In this stack, using the most recent R version as the tag (e.g. 3.3.2 at this time) is the same as asking for :latest, but using an earlier version like 3.3.1 not only locks in the R version, but also installs all R packages using an MRAN snapshot with the date fixed at the last day that version was current (e.g. 2016-10-31 in for 3.3.1). This is also set as the default CRAN repo. This effectively freezes the versions of all packages and should result in a more stable, reproducible build, though it won't have the latest and greatest.

I imagine a workflow in which users still tend to try things out and develop on :latest, but when archiving projects or their own Dockerfiles, would test and use the latest R release version (patch versions of R are released every few months, minor versions 1-2 a year), for greater reproducibility down the road. Thoughts on this?

One other notable change -- we're moving to deprecate rocker/hadleyverse in favor of rocker/tidyverse (which is much smaller and more explicitly defined as just R+RStudio+tidyverse+devtools packages). Another new image, rocker/verse adds LaTeX and a few minor things to tidyverse, corresponding more closely to the old hadleyverse image. If your examples use LaTeX (e.g. via Rmd) it might be good to migrate to verse, otherwise, migrate to tidyverse for a more compact image (423 MB compressed)

initial run of "docker run rocker/rstudio" results in an unique password error.

https://github.com/ropenscilabs/r-docker-tutorial/blob/a57625fdc111ba2c18311012324b214b82e18338/02-Launching-Docker.Rmd#L49

Repro:
launch docker
copy/paste command to terminal
docker downloads the appropriate image files
receive error

[cont-init.d] userconf: executing...

ERROR: You must set a unique PASSWORD (not 'rstudio') first! e.g. run with:
docker run -e PASSWORD=<YOUR_PASS> -p 8787:8787 rocker/rstudio

[cont-init.d] userconf: exited 1.
[cont-finish.d] executing container finish scripts...
[cont-finish.d] done.
[s6-finish] syncing disks.
[s6-finish] sending all processes the TERM signal.
[s6-finish] sending all processes the KILL signal and exiting.

Two suggestions:

  1. the information changes to instruct users to come up with a unique password for the command.
  2. line 49 be updated to reflect the command supplied by the error message.

teaching tips

Most SWC/DC lessons come with a supplemental 'teaching tips' file, which provide guidance to the instructor on pitfalls to avoid, techniques for getting people involved etc. Since no one has actually taught this yet, there might not be a ton to put here, but even just a stub will provide a target for people to push their tips at in future. For example:

  • section 1
    • lead a 5 min discussion to help students imagine where they'll use Docker. Places to guide the conversation towards if they're stuck: department clusters, cloud services, their new computer etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.