ypriverol / containers-rules-manuscript Goto Github PK

Recommendations to contenarized your bioinformatics software

Shell 100.00%

biocontainers docker rkt conda bioconda

containers-rules-manuscript's Introduction

Recommendations to package and containerize bioinformatics software

Description

This repository contains the manuscript entitled: Recommendations to package and containerize bioinformatics software. Our aim here is to describe how the bioinformatics community can produce better software containers and package to improve reducibility of original results.

Pre-print: Pending
Final Paper: Pending

Commenting and contributing

Feel free to comment, fork and pull-request the current version of the manuscript. If you want to discuss issues or topics around the manuscript please feel free to use the issues and/or pull requests.

Feel free to browse through existing/past issues and if one seems related, comment on it. If no existing issue seems appropriate, a new issue can be opened to discuss the suggestion. In particular, we would appreciate discussing more substantial changes (for example suggestion of new rules) in a dedicated issue before sending a pull request.

If you are new to Git, read the manuscript or Quick Guidelines to Git and GitHub - your input would be most valuable.

If, based on your contribution, you would like to be added as a co-author, please open an issue and provide your name and affiliation and a short description of your contribution or a link to the relevant issue and pull request.

Conversion to Ansible Doc

Any modifications to the text should be made to the manuscript.adoc file. This file is then converted to pdf and doc files automatically using pandoc and included in the main tex file.

Build document

Please be sure you have installed Docker. Then you can run the following command:

bash build.sh

A folder manuscript-draft would be created.

Disclaimer

The authors have no affiliation with Docker or Conda, nor any commercial entity mentioned in this article. The views described here reflect our own views without input from any third party organisation.

containers-rules-manuscript's People

Contributors

Stargazers

Watchers

Forkers

pcm32 manabuishii rajido biomadeira vdda hroest susheel blankenberg timosachsenberg hmenager

containers-rules-manuscript's Issues

5. Eschew ENTRYPOINT

Do not impose hard paths

I would also add some notes to not impose the usage of hard path when using containerised software. I've seen a lot of people expecting data to be mapped to a specific directory in the container e.g. /data or /inputs. This is a bad practice because limits the portability of the container and it makes hard to re-use it. Along the same line it should be avoided the use of custom WORKDIR in the container definition.

As general advice the container execution should be transparent, in the meaning the containers software should behave irrespective of the container usage, said in other term, the user should be able to use the containerised software independently the container usage.

Note about "One tool, one container"

Just some thoughts about the first point "One tool, one container".

While this is a good practice to maximise the containers usability and makes perfect sense in some scenarios, like for example the BioContainers project, in other contexts it's a principle to abstract IMO. In real world applications it's need to compose many tools together, just think bwa .. | samtools .. | etc.

In this scenario, having a a tool per container is a limiting factor that prevents that idiom or more in general the use of multiple tools in the same task.

Facetious Issue: GDPR Compliance

Based on a recent EBI internal conversation about source code authorship in Git repositories - Is our recommendation to add the maintainer label compliant with the new GDPR regulations? Data that will be stored will be:

Full name
Email Address

More a question for the Bioinformatics container registries (bio.tools and biocontainers.pro)

It could be argued as a legitimate interest to ask for this information - for the sake of reproducibility, but will we need to comply with the right to be forgotten directive which would mean deleting all maintainer labels of the user even for downstream multi-stage builds.

I'm not assuming the right to be forgotten trumps legitimate interest, but do we have a legitimate reason for processing this information. We could argue "we could keep personal information indefinitely in the public interest. We would need to define why it is in the public interest.

No Data

No data should be included in the container.

Switch to containers and not only docker

I have switch the topic to containers and not only docker files because making a lot of sense to include other technologies to create containers.

Upload your container image to a public registry or collection

A container should always distributed along with the Dockerfile/recipe used to create it, for transparency and documentation purpose.

However the availability of the Dockerfile does not guarantee the reproducibility of the container images, and consequently, of the associated data-analyses. When re-creating a container image, one or more software packages can be not more available.

To protect against software decay upload your container images to a public registry such as DockerHub or Quay. Even better, use community a collection such as BioContainers which manage the versioning and the long the archiving of container images.

One tool, one container

For maximum reusability and smaller containers.

Multi-stage builds

I think that multi-stage builds [1] actually solve and make most of these recommendation redundant:
https://github.com/ypriverol/containers-rules-manuscript/blob/master/manuscript.adoc#6-reduce-the-size-of-your-container-as-much-as-possible

I consider some of these to be actually harmful, combining multiple RUN commands makes the files hard to read and debug. Since multi-stage builds are actually used to specifically address point 6, I think we should recommend them or at least add them to the list of suggestions

https://docs.docker.com/develop/develop-images/multistage-build/#name-your-build-stages

4. For permissions management, should be runnable as any user within the container, not just root

First version to be submitted

@rajido @susheel @pcm32 @prvst @hroest @bgruening @blankenberg @timosachsenberg @osallou @mr-c @biomadeira and all contributors. I have reviewed the current version of the manuscript and it looks almost ready for the first submission to F1000 to the ELIXIR channel. Please give a last try and let me know. If you are happy with the current version give me a +1 on this issue.

Regards
Yasset

3. Permissible output directories: /tmp, $TMPDIR, the current working directory, and a user specifiable directory

Docker base images should be versioned too

Duplicate @bgruening's todo manuscript.adoc#L130

Need to update manuscript.adoc#L134 too.

Add link to Dockerfile best practices

See https://docs.docker.com/develop/develop-images/dockerfile_best-practices/

Review other guidance

And be prepared to justify where we made different choices

Proposal Recommendation: Use ARG for buildtime and ENV for runtime environment variables

This would give a container flexibility during build time(ARG) and runtime(ENV), whilst still being reproducible. Just a thought.

7. Make your binaries/scripts visible and easy to find

Suggestion to reorder recommendations

Just an enhancement suggestion to reorder recommendations to support a logical flow of a typical Dockerfile. The idea would be to have a example (Box) for each recommendation and have a final example box that brings every recommendation together at the thirteenth recommendation Provide reproducible builds

New suggested order:

Choose base image wisely
FROM biocontainers/biocontainers:v1.0.0
Tool and container versions should be explicit

LABEL base_image="biocontainers:v1.0.0"
LABEL version="3"
LABEL software="Comet"
LABEL software.version="2016012"

[Proposal] Add appropriate LABELs to point to software documentation, keywords and tags (Can be merged in existing recommendations)

LABEL about.summary="an open source tandem mass spectrometry sequence database search tool"
LABEL about.home="http://comet-ms.sourceforge.net"
LABEL about.documentation="http://comet-ms.sourceforge.net/parameters/parameters_2016010"
LABEL extra.identifiers.biotools="comet"
LABEL about.tags="Proteomics"

Check the license of the software and add Maintainer information

LABEL about.license_file="http://comet-ms.sourceforge.net"
LABEL about.license="SPDX:Apache-2.0"
LABEL maintainer="Felipe da Veiga Leprevost <[email protected]>"

[Proposal] Use ARG for build-time and ENV for runtime evironment variables (Can be merged in existing recommendations)

ARG COMMET_VERSION="2016012"
ENV PATH /home/biodocker/bin/Comet:$PATH

[Proposal] Add explicit WORKDIR (Can be merged in existing recommendations)
WORKDIR /data/
Reduce the size of your container as much as possible

RUN ZIP=comet_binaries_${COMMET_VERSION}.zip && \
  wget https://github.com/BioDocker/software-archive/releases/download/Comet/$ZIP -O /tmp/$ZIP && \
  unzip /tmp/$ZIP -d /home/biodocker/bin/Comet/ && \
  chmod -R 755 /home/biodocker/bin/Comet/* && \
  rm /tmp/$ZIP

Note the use of build time ARGs in the RUN process

Relevant tools and software should be executable and in the PATH

RUN mv /home/biodocker/bin/Comet/comet_binaries_${COMMET_VERSION}/comet.${COMMET_VERSION}.linux.exe /home/biodocker/bin/Comet/comet

Document the build files
TODO: Mention the possibility of interspesing the Dockerfile with # comments
Add functional testing logic
Avoid using ENTRYPOINT
Provide helpful usage message via CMD
Provide reproducible builds
Make your package or container discoverable

1. When possible: package software & use auto-containerization

Packaging options:

(Bio)Conda
Debian(Med)
Brew

2. Mandatory metadata: License, …

reference the biocontainers spec https://github.com/BioContainers/specs/blob/master/container-specs.md

Which fields are mandatory?