alexslemonade / refinebio Goto Github PK

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.

Home Page: https://www.refine.bio/

License: Other

Python 80.85% Shell 5.65% HCL 3.81% R 3.85% Smarty 0.08% HTML 5.71% jq 0.04% Awk 0.01%

refinebio's Introduction

Refine.bio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.

This README file is about building and running the refine.bio project source code.

If you're interested in simply using the service, you should go to the website or read the documentation.

Refine.bio currently has four sub-projects contained within this repo:

common Contains code needed by both foreman and workers.
foreman Discovers data to download/process and manages jobs.
workers Runs Downloader and Processor jobs.
infrasctructure Manages infrastructure for Refine.bio.

Refine.bio

Development

Git Workflow

refinebio uses a feature branch based workflow. New features should be developed on new feature branches, and pull requests should be sent to the dev branch for code review. Merges into master happen at the end of sprints, and tags in master correspond to production releases.

Installation

To run Refine.bio locally, you will need to have the prerequisites installed onto your local machine. This will vary depending on whether you are developing on a Mac or a Linux machine. Linux instructions have been tested on Ubuntu 16.04 or later, but other Linux distributions should be able to run the necessary services. Microsoft Windows is currently unsupported by this project.

Note: The install_all.sh script will configure a git pre-commit hook to auto-format your python code. This will format your code in the same way as the rest of the project, allowing it to pass our linting check.

Automatic

The easiest way to run Refine.bio locally is to run ./scripts/install_all.sh to install all of the necessary dependencies. As long as you are using a recent version of Ubuntu or macOS it should work. If you are using another version of Linux it should still install most of the dependencies as long as you give the appropriate INSTALL_CMD environment variable, but some dependencies may be named differently in your package manager than in Ubuntu's.

Linux (Manual)

The following services will need to be installed:

Python3 and Pip: sudo apt-get -y install python3-pip
Docker: Be sure to follow the post installation steps so Docker does not need sudo permissions.
Terraform
pip3 can be installed on Linux clients with sudo apt-get install python3-pip
black can be installed on Linux clients with pip3 install black
jq
iproute2
shellcheck

Instructions for installing Docker and Terraform can be found by following the link for each service. jq and iproute2 can be installed via sudo apt-get install jq iproute2 shellcheck.

Mac (Manual)

The following services will need to be installed:

Instructions for installing Docker and Homebrew can be found by on their respective homepages.

Once Homebrew is installed, the other required applications can be installed by running: brew install iproute2mac terraform jq black shellcheck.

Many of the computational processes running are very memory intensive. You will need to raise the amount of virtual memory available to Docker from the default of 2GB to 12GB or 24GB, if possible.

Virtual Environment

Run ./scripts/create_virtualenv.sh to set up the virtualenv. It will activate the dr_env for you the first time. This virtualenv is valid for the entire refinebio repo. Sub-projects each have their own environments managed by their containers. When returning to this project you should run source dr_env/bin/activate to reactivate the virtualenv.

Services

refinebio also depends on Postgres. Postgres can be run in a local Docker container

Postgres

To start a local Postgres server in a Docker container, use:

./scripts/run_postgres.sh

Then, to initialize the database, run:

./scripts/install_db_docker.sh

If you need to access a psql shell for inspecting the database, you can use:

./scripts/run_psql_shell.sh

or if you have psql installed this command will give you a better shell experience:

source scripts/common.sh && PGPASSWORD=mysecretpassword psql -h $(get_docker_db_ip_address) -U postgres -d data_refinery

Common Dependecies

The common sub-project contains common code which is depended upon by the other sub-projects. So before anything else you should prepare the distribution directory common/dist with this script:

./scripts/update_models.sh

(Note: This step requires the postgres container to be running and initialized.)

Note: there is a small chance this might fail with a can't stat, error. If this happens, you have to manually change permissions on the volumes directory with sudo chmod -R 740 volumes_postgres then re-run the migrations.

ElasticSearch

One of the API endpoints is powered by ElasticSearch. ElasticSearch must be running for this functionality to work. A local ElasticSearch instance in a Docker container can be executed with:

./scripts/run_es.sh

And then the ES Indexes (akin to Postgres 'databases') can be created with:

./scripts/rebuild_es_index.sh

Testing

To run the entire test suite:

./scripts/run_all_tests.sh

(Note: Running all the tests can take some time, especially the first time because it downloads a lot of files.)

For more granular testing, you can just run the tests for specific parts of the system.

API

To just run the API tests:

./api/run_tests.sh

Common

To just run the common tests:

./common/run_tests.sh

Foreman

To just run the foreman tests:

./foreman/run_tests.sh

Workers

To just run the workers tests:

./workers/run_tests.sh

If you only want to run tests with a specific tag, you can do that too. For example, to run just the salmon tests:

./workers/run_tests.sh -t salmon

All of our worker tests are tagged, generally based on the Docker image required to run them. Possible values for worker test tags are:

affymetrix
agilent
downloaders
illumina
no_op
qn (short for quantile normalization)
salmon
smasher
transcriptome

Style

R files in this repo follow Google's R Style Guide. Python Files in this repo follow PEP 8. All files (including Python and R) have a line length limit of 100 characters.

In addition to following pep8, python files must also conform to the formatting style enforced by black. black is a highly opinionated auto-formatter. (black's highly opinionated style is a strict sub-set of pep8.) The easiest way to conform to this style is to run black . --line-length=100. This will auto-format your code. Running the ./scripts/install_all.sh script will install a pre-commit git hook that will run this formatter on every commit you make locally. Under the hood this uses pre-commit, which you can also install directly by running pip3 install pre-commit & pre-commit install. Then, if you want to run pre-commit without making a git commit, you can use pre-commit run --all-files. To install black see the installation instructions. Any Pull Requests that do not conform to the style enforced by black will be flagged by our continous integration and will not be accepted until that check passes.

All user-facing scripts have been linted with shellcheck for common warnings and POSIX-correctness. If a script is user-facing, it should ideally be POSIX-compliant and have the extension .sh, but if bashisms are necessary it should have the extension .bash. To install shellcheck, you can run apt-get install shellcheck or brew install shellcheck. Then, you can lint scripts with shellcheck FILE.

Gotchas

During development, you make encounter some occasional strangeness. Here's some things to watch out for:

Since we use multiple Docker instances, don't forget to ./scripts/update_models
If builds are failing, increase the size of Docker's memory allocation. (Mac only.)
If Docker images are failing mysteriously during creation, it may be the result of Docker's Docker.qcow2 or Docker.raw file filling. You can prune old images with docker system prune -a.
If it's killed abruptly, the containerized Postgres images can be left in an unrecoverable state. Annoying.

R

We have created some utilities to help us keep R stable, reliable, and from periodically causing build errors related to version incompatibilites. The primary goal of these is to pin the version for every R package that we have. The R package devtools is useful for this, but in order to be able to install a specific version of it, we've created the R script common/install_devtools.R.

There is another gotcha to be aware of should you ever need to modify versions of R or its packages. In Dockerfiles for images that need the R language, we install apt packages that look like r-base-core=3.4.2-1xenial1. It's unclear why the version for these is so weird, but it was determined by visiting the package list here: https://cran.revolutionanalytics.com/bin/linux/ubuntu/xenial/ If it needs to be updated then a version should be selected from that list.

Additionally there are two apt packages, r-base and r-base-core, which seem to be very similar except that r-base-core is slimmed down some by not including some additional packages. For a while we were using r-base, but we switched to r-base-core when we pinned the version of the R language because the r-base package caused an apt error.

Running Locally

Once you've built the common/dist directory and have the Postgres service running, you're ready to run jobs. To run the API you also need the elasticsearch service running.

There are three kinds of jobs within Refine.bio.

API

The API can be run with:

./api/serve.sh

Surveyor Jobs

Surveyor Jobs discover samples to download/process along with recording metadata about the samples. A Surveyor Job should queue Downloader Jobs to download the data it discovers. However, at the moment there is no automated way for the downloader jobs to be run. This will be resolved ASAP, see #2775 for more information.

The Surveyor can be run with the ./foreman/run_management_command.sh script. The first argument to this script is the type of Surveyor Job to run, which will always be survey_all.

Details on these expected arguments can be viewed by running:

./foreman/run_management_command.sh survey_all -h

The Surveyor can accept a single accession code from any of the source data repositories (e.g., Sequencing Read Archive, ArrayExpress, Gene Expression Omnibus):

./foreman/run_management_command.sh survey_all --accession <ACCESSION_CODE>

Example for a GEO experiment:

./foreman/run_management_command.sh survey_all --accession GSE85217

Example for an ArrayExpress experiment:

./foreman/run_management_command.sh survey_all --accession E-MTAB-3050 # AFFY
./foreman/run_management_command.sh survey_all --accession E-GEOD-3303 # NO_OP

Transcriptome indices are a bit special. For species within the "main" Ensembl division, the species name can be provided like so:

./foreman/run_management_command.sh survey_all --accession "Homo sapiens"

However for species that are in other divisions, the division must follow the species name after a comma like so:

./foreman/run_management_command.sh survey_all --accession "Caenorhabditis elegans, EnsemblMetazoa"

The possible divisions that can be specified are:

Ensembl (this is the "main" division and is the default)
EnsemblPlants
EnsemblFungi
EnsemblBacteria
EnsemblProtists
EnsemblMetazoa

If you are unsure what division a species falls into, unfortunately the only way to tell is go to check ensembl.com. (Although googling the species name + "ensembl" may work pretty well.)

You can also supply a newline-deliminated file to survey_all which will dispatch survey jobs based on accession codes like so:

./foreman/run_management_command.sh survey_all --file MY_BIG_LIST_OF_CODES.txt

The main foreman job loop can be started with:

./foreman/run_management_command.sh retry_jobs

This must actually be running for jobs to move forward through the pipeline.

Sequence Read Archive

When surveying SRA, you can supply either run accession codes (e.g., codes beginning in SRR, DRR, or ERR) or study accession codes (SRP, DRP, ERP).

Run example (single read):

./foreman/run_management_command.sh survey_all --accession DRR002116

Run example (paired read):

./foreman/run_management_command.sh survey_all --accession SRR6718414

Study example:

./foreman/run_management_command.sh survey_all --accession ERP006872

Ensembl Transcriptome Indices

Building transcriptome indices used for quantifying RNA-seq data requires us to retrieve genome information from Ensembl. The Surveyor expects a species' scientific name in the main Ensembl division as the accession:

./foreman/run_management_command.sh survey_all --accession "Homo Sapiens"

See the Ensembl Transcriptome Index section for additional usage examples inclduing surveying additional Ensembl divisions.

Downloader Jobs

Downloader Jobs will be queued automatically when Surveyor Jobs discover new samples. However, if you just want to queue a Downloader Job yourself rather than having the Surveyor do it for you, you can use the ./workers/run_job.sh script:

./workers/run_job.sh run_downloader_job --job-name=<EXTERNAL_SOURCE> --job-id=<JOB_ID>

For example:

./workers/run_job.sh run_downloader_job --job-name=SRA --job-id=12345

./workers/run_job.sh run_downloader_job --job-name=ARRAY_EXPRESS --job-id=1

Or for more information run:

./workers/run_job.sh -h

Processor Jobs

Processor Jobs will be queued automatically by successful Downloader Jobs. However, if you just want to run a Processor Job without yourself without having a Downloader Job do it for you, the following command will do so:

./workers/run_job.sh -i <IMAGE_NAME> run_processor_job --job-name=<JOB_NAME> --job-id=<JOB_ID>

For example

./workers/run_job.sh -i affymetrix run_processor_job --job-name=AFFY_TO_PCL --job-id=54321

./workers/run_job.sh -i no_op run_processor_job --job-name=NO_OP --job-id=1

./workers/run_job.sh -i salmon run_processor_job --job-name=SALMON --job-id=1

./workers/run_job.sh -i transcriptome run_processor_job --job-name=TRANSCRIPTOME_INDEX_LONG --job-id=1

Or for more information run:

./workers/run_job.sh -h

Creating Quantile Normalization Reference Targets

If you want to quantile normalize combined outputs, you'll first need to create a reference target for a given organism or organisms. This can be done in a production environment by running the following on the Foreman instance:

./run_management_command.sh dispatch_qn_jobs --organisms=DANIO_RERIO,HOMO_SAPIENS

To create QN targets for all organisms with enough processed samples:

./run_management_command.sh dispatch_qn_jobs

This will at some point move to the foreman and then it will take a list of organisms to create QN targets for.

Creating Compendia

Creating species-wide compendia for a given species can be done in a production environment by running the following on the Foreman instance:

./run_management_command.sh create_compendia --organisms=DANIO_RERIO --svd-algorithm=ARPACK

or for a list of organisms:

./run_management_command.sh create_compendia --organisms=DANIO_RERIO,HOMO_SAPIENS --svd-algorithm=ARPACK

or for all organisms with sufficient data:

./run_management_command.sh create_compendia --svd-algorithm=ARPACK

Alternatively a compendium can be created which only includes quant.sf files by using the create_quantpentida command:

./run_management_command.sh create_quantpendia --organisms=DANIO_RERIO

Compendia jobs run on the smasher instance. However they require a very large amount of RAM to be able to complete. Our smasher instance does not generally have enough RAM to be able to run them, so if you need to run a smasher job you should temporarily increase the size of the smasher instance. This can be done by changing the terraform variable smasher_instance_type which can be found in infrastructure/variables.tf. Select an AWS instance type that has enough RAM to run the compendia jobs. At the time of writing, compendia jobs require 180GB of RAM and m5.12xlarge has 192GM of RAM so it is sufficiently large to run the jobs.

Running Tximport Early

Normally we wait until ever sample in an experiment has had Salmon run on it before we run Tximport. However Salmon won't work on every sample, so some experiments are doomed to never make it to 100% completion. Tximport can be run on such experiments by running the follow on the Foreman instance:

To run tximport on all eligible experiments:

./run_management_command.sh run_tximport

To run tximport on a single experiment if it is eligible:

./run_management_command.sh run_tximport --accession-codes=SRP095529

To run tximport on a the eligible experiments in a list:

./run_management_command.sh run_tximport --accession-codes=SRP095529,ERP006872

Note that if the experiment does not have at least 25 samples with at least 80% of them processed, this will do nothing.

Development Helpers

It can be useful to have an interactive Python interpreter running within the context of the Docker container. The scripts/run_shell.sh script has been provided for this purpose. It is in the top level directory so that if you wish to reference it in any integrations its location will be constant. However, it is configured by default for the Foreman project. The interpreter will have all the environment variables, dependencies, and Django configurations for the Foreman project. There are instructions within the script describing how to change this to another project.

Cloud Deployment

Refine.bio requires an active, credentialed AWS account with appropriate permissions to create network infrastructure, users, compute instances and databases.

Deploys are automated to run via CirlceCI whenever a signed tag starting with a v is pushed to either the dev or master branches (v as in version, i.e. v1.0.0). Tags intended to trigger a staging deploy MUST end with -dev, i.e. v1.0.0-dev. CircleCI runs a deploy on a dedicated AWS instance so that the Docker cache can be preserved between runs.

Instructions for setting up that instance can be found in the infrastructure/deploy_box_instance_data.sh script.

To trigger a new deploy, first see what tags already exist with git tag --list | sort --version-sort We have two different version counters, one for dev and one for master so a list including things like:

v1.1.2
v1.1.2-dev
v1.1.3
v1.1.3-dev

However you may see that the dev counter is way ahead, because we often need more than one staging deploy to be ready for a production deploy. This is okay, just find the latest version of the type you want to deploy and increment that to get your version. For example, if you wanted to deploy to staging and the above versions were the largest that git tag --list output, you would increment v1.1.3-dev to get v1.1.4-dev.

Once you know which version you want to deploy, say v1.1.4-dev, you can trigger the deploy with these commands:

git checkout dev
git pull origin dev
git tag -s v1.1.4-dev
git push origin v1.1.4-dev

git tag -s v1.1.4-dev will prompt you to write a tag message; please try to make it descriptive.

We use semantic versioning for this project so the last number should correspond to bug fixes and patches, the second middle number should correspond to minor changes that don't break backwards compatibility, and the first number should correspond to major changes that break backwards compatibility. Please try to keep the dev and master versions in sync for major and minor versions so only the patch version gets out of sync between the two.

Docker Images

Refine.bio uses a number of different Docker images to run different pieces of the system. By default, refine.bio will pull images from the Dockerhub repo ccdlstaging. If you would like to use images you have built and pushed to Dockerhub yourself you can pass the -r option to the deploy.sh script.

To make building and pushing your own images easier, the scripts/update_docker_images.sh has been provided. The -r option will allow you to specify which repo you'd like to push to. If the Dockerhub repo requires you to be logged in, you should do so before running the script using docker login. The -v option allows you to specify the version, which will both end up on the Docker images you're building as the SYSTEM_VERSION environment variable and also will be the docker tag for the image.

scripts/update_docker_images.sh will not build the dr_affymetrix image, because this image requires a lot of resources and time to build. It can instead be built with ./scripts/prepare_image.sh -i affymetrix -r <YOUR_DOCKERHUB_REPO>. WARNING: The affymetrix image installs a lot of data-as-R-packages and needs a lot of disk space to build the image. It's not recommended to build the image with less than 75GB of free space on the disk that Docker runs on.

Terraform

There are a few extra things that you need to install before deploying the stack:

Terraform, if you haven't already installed it
awscli, for interacting with AWS
boto3, which is necessary for some of our deployment scripts
PostgreSQL, which is necessary for some of our deployment scripts

The easiest way to install Terraform is by running ./scripts/install_all.sh, or you can also install it manually by following the directions on the website. We currently use version 0.13.5.

For awscli and boto3, you need to install them using pip3 install awscli boto3. Ubuntu's repositories contain outdated versions of both packages which do not work with our deploy script.

Postgres can be installed using either apt install psql or brew install postgresql as appropriate.

Once you have all of the dependencies installed, you're almost ready to deploy a dev stack. The only thing remaining is to make sure that you can authenticate properly. To authenticate awscli, you need to run awscli configure and follow the directions. For ssh access to the servers, which is used during the deploy, copy the RefinebioSSHKey from LastPass and save it to the file: infrastructure/data-refinery-key.pem. If you do not have access to this key in LastPass, ask another developer.

The correct way to deploy to the cloud is by running the deploy.sh script. This script will perform additional configuration steps, such as setting environment variables, setting up Batch job specifications, and performing database migrations. It can be used from the infrastructure directory like so:

./deploy.sh -u myusername -e dev -d us-east-1 -v v1.0.0 -r my-dockerhub-repo

This will spin up the whole system. It will usually take about 15 minutes, most of which is spent waiting for the Postgres instance to start. The command above would spin up a development stack in the us-east-1 region where all the resources' names would end with -myusername-dev. All of the images used in that stack would come from my-dockerhub-repo and would be tagged with v1.0.0.

The -e specifies the environment you would like to spin up. You may specify, dev, staging, or prod. dev is meant for individuals to test infrastructure changes or to run large tests. staging is to test the overall system before re-deploying to prod.

To see what's been created at any time, you can:

terraform state list

If you want to change a single entity in the state, you can use

terraform taint <your-entity-from-state-list>

And then rerun deploy.sh with the same parameters you originally ran it with.

AWS Batch

refine.bio relies on AWS Batch as its job queue and uses it to provision instances. AWS Batch has three primary components:

Compute Environments: These are what provision EC2 instances for refineb.bio. In this project each Compute Environment can either have one or zero instances. The goal is to have jobs that are run in the same compute environment be run on the same instance, so that data stored on the local disk by a downloader job will be available to the processor job. Only allowing a maximum of one instance per compute environment almost ensures this, however it is possible for an instance to be cycled in between jobs so sometimes the downloader job has to be rerun.
Job Queues: These are what track the jobs submitted to AWS Batch and assign them to compute environments. In refine.bio each Job Queue uses a single compute environment, so if two jobs are placed in the same job queue they will be run in the same Compute Environment.
Job Definitions: These are what specify the configuration to be used for each job type including what Docker Image will be used, what environment variables will be passed to it, what secrets it can access, and how many vCPUs and RAM it requires.

refine.bio uses three types of job queues:

Compendia Job Queue: This job queue is for running very large compendia-building jobs that require a large instance. The Compute Environment assigned to this queue is configured to provision very large instances.
Smasher Job Queue: This job queue is used for running smashing jobs. Having a dedicated queue for smasher jobs is useful because it ensure they won't be blocked by processing jobs and the instance provisioned by its compute environment has enough resources to run one of these jobs at a time and no more.
Worker Job Queues: This is the only job queue with multiple instances. These do the general processing, so if there is a sufficient volume of work to necessitate more than one instance the Foreman will distribute jobs to more and more queues until all the queues are in use. The lowest index queue will be assigned Surveyor and Downloader jobs if it has capacity for them, if not the next lowest index queue with capacity will be chosen. Processor jobs will always be assigned to the same job queue that ran their downloader job.

Running Jobs

Jobs can be submitted by running the following commands on the Foreman instance.

To start a job for a single accession code::

./run_management_command.sh survey_all --accession E-GEOD-3303

You can also supply a newline-deliminated file which resides in S3 to survey_all which will dispatch survey jobs based on accession codes like so:

./run_management_command.sh surveyor_dispatcher --file s3://data-refinery-test-assets/MY_BIG_LIST_OF_CODES.txt

The surveyor_dispatcher command will submit SurveyJobs to the AWS Batch queue, so it's more appropriate for running a large number of survey jobs in production.

See the Running Locally section for additional examples of survey_all usage.

Note that there is a run_management_command.sh included in the foreman directory that is completely different than the one that is created on the Foreman instance. These two scripts share a name to make the commands work in either place.

Log Consumption

All of the different Refine.bio subservices log to the same AWS CloudWatch Log Group. If you want to consume these logs, you can use the awslogs tool, which can be installed from pip like so:

pip install awslogs

or, for OSX El Capitan:

pip install awslogs --ignore-installed six

Once awslogs is installed, you can find your log group with:

awslogs groups

Then, to see all of the logs in that group for the past day, watching as they come in:

awslogs get <your-log-group> ALL --start='1 days' --watch

You can also apply a filter on these logs like so:

awslogs get <your-log-group> ALL --start='1 days' --watch --filter-pattern="DEBUG"

Or, look at a named log stream (with or without a wildcard.) For instance: (Unfortunately this feature seems to be broken at the moment: jorgebastida/awslogs#158)

awslogs get data-refinery-log-group-myusername-dev log-stream-api-nginx-access-* --watch

will show all of the API access logs made by Nginx.

Dumping and Restoring Database Backups

Automatic snapshots are created automatically by RDS. Manual database dumps can be created by priveledged users with these instructions. Postgres versions on the host (I suggest the PGBouncer instance) must match the RDS instance version:

sudo add-apt-repository "deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -sc)-pgdg main"
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
sudo apt-get update
sudo apt-get install postgresql-9.6

Archival dumps can also be provided upon request.

Dumps can be restored locally by copying the backup.sql file to the volumes_postgres directory, then executing:

docker exec -it drdb /bin/bash
psql --user postgres -d data_refinery -f /var/lib/postgresql/data/backup.sql

This can take a long time (>30 minutes)!

Tearing Down

A stack that has been spun up via deploy.sh -u myusername -e dev can be taken down with destroy_terraform.sh -u myusername -e dev -d us-east-1. The same username and environment must be passed into destroy_terraform.sh as were used to run deploy.sh either via the -e and -u options or by specifying TF_VAR_stage or TF_VAR_user so that the script knows which to take down. Note that this will prompt you for confirmation before actually destroying all of your cloud resources.

Support

Refine.bio is supported by Alex's Lemonade Stand Foundation, with some initial development supported by the Gordon and Betty Moore Foundation via GBMF 4552 to Casey Greene.

Meta-README

The table of contents for this README is generated using doctoc. doctoc can be installed with: sudo npm install -g doctoc Once doctoc is installed the table of contents can be re-generated with: doctoc README.md

License

BSD 3-Clause License.

refinebio's People

Contributors

Stargazers

Watchers

Forkers

quiltomics alabarga angrycub modulexcite davidsmejia cgreene erflynn arjunkrish mridu-enigma aakash10292 johnthomas75 silentdyl arkid15r lgtm-migrator trellixvulnteam p- tosemml

refinebio's Issues

Investigate batchless DownloaderJobs

There are records in the downloader_jobs table which do not have an entry in the downloader_jobs_batches table. I don't understand how this is possible, but I should see if I can find a way to replicate it so I can prevent it.

Consider using a JSON postgres field for batch key/values

It may be more space efficient to do it this way instead along with potentially making the data easier to use.

For reference see https://www.postgresql.org/docs/9.6/static/datatype-json.html

Use BrainArray custom CDFs for AFFY_TO_PCL processor

Startup scripts for the Nomad clients

Create a logging context which has as much information as possible

Every log message should contain the following if they are applicable:

Batch ID
Downloader/Processor Job ID
Instance ID/IP

These Should be in a standard format at the beginning of each log message to make it easy to search for logs.

File extensions aren't treated in a case insensitive manner

This causes a abc.cel file to not be treated as a .CEL type file. Fixing this would fix some edge cases.

Add field to jobs to mark why they failed

This also needs to be set everywhere a job can fail.

Startup scripts for Nomad servers

Figure out spot instance nuances

Context

Spot instances cost a fraction of the normal price for AWS instances. The downside is that they can be preempted if the demand for AWS instances becomes too high. However this is not a big deal for our system because we don't need to have data processed ASAP, it's worth the delay to save money.

Problem or idea

We should use spot instances instead of normal AWS instances.

Solution or next step

Switch the instances used for Nomad clients to spot instances. The best way to do this will probably require a bit of research to determine, but there are some blog posts and there is also a specific section in the terraform docs about it.

The one thing that is known is that we definitely should use the auto-scaling group created in #61.

Utilize soft deletes

Using https://github.com/scoursen/django-softdelete looks like it might do the trick.

Figure out how to handle/store/use production secrets

Currently these include:

Django secret keys
Database credentials

Create auto-scaling group based on cloudwatch metric

Context

Our system has a lot of work to do. We will probably not do it all in a single shot, but rather slowly expand as we build out more surveyors, downloaders, and processors. Therefore the size of our cluster will need to be elastic so it can scale to the size needed to keep up with the work we want it to do. Additionally, it's unclear exactly how many nodes we'll need in the cluster to handle all the jobs we'll be queuing at any given point in time.

Problem or idea

We should use an auto-scaling group to scale based off of the depth of the work queue. This will dovetail nicely with our planned usage of spot instances (#62) because an auto-scaling group is the recommended way to manage them via terraform anyway.

Solution or next step

Use terraform to create an auto-scaling group based off of the work queue depth metric created by #60. While we will eventually want that to be using spot instances, there may be additional concerns and/or gotchas associated with using them. Therefore to keep the size of the PR addressing this issue down spot instances do not need to be included in it.

Remove `boto` dependency

Currently there are dependencies on both boto and boto3. I think the dependency on boto is due to a now-out-of-date version of Celery. However Celery will be removed from the project when switching to Nomad, so as part of doing that I can resolve this issue rather than upgrading Celery just to rip it out shortly after.

Download both raw and processed data for types we don't know how to handle

If we don't know how to process a given raw file, then we should download it and any available processed file for the sample.

Create a script that can be used to run a Nomad job.

Rebuild workers docker image

Currently the Dockerfile has some things out of order because I didn't want to trigger a full rebuild while I was developing it since it takes over an hour. At some point I should do that (like at the end of a day or something).

Salmon -- Alignment-based method and feature counts -- STAR

TBD whether STAR or Hisat2 will be more appropriate. Will require coordination with @jaclyn-taroni and the Patro lab to clarify which to use and how to do so.

Changes `size_in_bytes` to `raw_size_in_bytes` and `processed_size_in_bytes`

Currently there is just one size field for each file and we only store the raw size. This means we won't know the size of the processed data.

Alternatively this could be addressed by changing the way files are used so that a new File object is created and saved when once it is generated via processing. File objects could then be immutable. This would clarify a lot of the operations that happen, and is actually probably the correct way to remedy this.

Move file management code into Batch class

I'm 90% sure this is something I want to do. Very low priority though because it doesn't strictly matter where the code is, but I think it might make more sense from an organizational standpoint.

Switch index-refinery from storing .tar.gz files to .tar.xz files

The Homo_sapiens_short.tar.gz file is 2.5 GB, whereas the same file compressed using xz is less than one GB. As these files don't actually feed into salmon in the gzipped format there's not much reason to use gz instead of xz other than the fact that gz may be easier for other users of the Index Refinery to use. However @jaclyn-taroni doesn't think that it would be a significant factor so xz seems superior.

Figure out how to determine work queue depth

Context

We're using Nomad not just as an orchestration tool, but also as a way of managing our work queue.

Problem or idea

We'll need a way to see what the work queue depth is so that we can build alerts if it starts to get too large and so that we can scale our cluster up or down as needed. This will be necessary for #60 and #61.

Solution or next step

Determine how to calculate the Nomad work queue depth.

Get off of development version of rpy2

For some reason the rpy2 dockerfile installs the dev version of rpy2. Doing the same thing works for me, but trying to install a fixed version it via pip didn't... I'm not sure why but this should be investigated and resolved.

Create a script that can be used to run a Nomad job.

Add convenience to creating a downloader job for a single batch.

As per @mhuyck's comment here the DownloaderJob model can take a batch as an optional argument and add it to the DownloaderJobsToBatches table.

Make version of R packages explicit in R_dependencies.R

I meant to do this when setting up the workers' Docker image, but I guess I got so happy everything actually worked that I forgot to do so. It doesn't necessarily seem easy, but @jaclyn-taroni has helpfully provided the following R code which allows brainarray packages' versions to be specified:

InstallBrainarray <- function(platform, org.code, ba.version) {
    # This function makes use of devtools::install_url to install Brainarray
    # packages for the annotation of Affymetrix data. Specifically, the packages
    # required for use with SCAN.UPC and affy (RMA) are installed.
    #
    # Args:
    #   platform: The Affymetrix platform for which brainarray
    #             packages are to be installed (e.g., "hgu133plus2")
    #   org.code: Two letter organism code -- human would be "hs"
    #        ba.version: What version of brainarray should be used? (e.g., "21.0.0")
    #
    #    Returns:
    #      NULL - this package completes installation of these packages and does not
    #                     return any values 

    # make sure platform and org.code are all lowercase and lack punctuation
    platform <- tolower(gsub("[[:punct:]]", "", platform))
    org.code <- tolower(gsub("[[:punct:]]", "", org.code))

    # probe version for use with SCAN.UPC
    probe.pkg.name <- paste0(platform, org.code, "entrezgprobe_",
                                         ba.version, ".tar.gz")
    probe.url <- paste0("http://mbni.org/customcdf/21.0.0/entrezg.download/",
                                            probe.pkg.name)
    devtools::install_url(probe.url)

    # cdf version for use with affy::RMA
    cdf.pkg.name <- paste0(platform, org.code, "entrezgcdf_",
                                         ba.version, ".tar.gz")
    cdf.url <- paste0("http://mbni.org/customcdf/21.0.0/entrezg.download/",
                                        cdf.pkg.name)
    devtools::install_url(cdf.url)


}

#### install brainarray main ---------------------------------------------------

# HGU133Plus2
InstallBrainarray(platform = "hgu133plus2",
                                    org.code = "hs",
                                    ba.version = "21.0.0")

Benchmark the best way to run salmon on interleaved paired read files.

When downloading fastq files from NCBI that are paired end reads, the files get interleaved into one .fastq.gz file. Salmon expects two separate files. @rob-p has written a bash script which will split an interleaved fastq file into two streams and then feed the streams into salmon, which can be found here. There is an alternative method though which would use python's gzip library to split the gzip file without gunzipping it. Which will be faster is not clear, so some benchmarking is in order. The benchmarking should be done on an AWS instance to best match the environment in which the production code will be run, because apparently the speed of the HDD can impact Salmon's performance on .fastq vs .fastq.gz files.

The code for the alternative python method is currently only at a POC level and has not been committed to any repo, so it is included here:

import gzip
import re

with gzip.open("sra_data.fastq.gz", "r") as interleaved:
    with gzip.open("read_1.fastq.gz", "w") as out_1:
        with gzip.open("read_2.fastq.gz", "w") as out_2:
            for line in interleaved:
                line = str(line, "utf-8")
                if re.match(".*RR\d+\.\d+\.1.*", line) is not None:
                    out_1.write(bytes(line, "utf-8"))
                    out_1.write(interleaved.readline())
                elif re.match(".*RR\d+\.\d+\.2.*", line) is not None:
                    out_2.write(bytes(line, "utf-8"))
                    out_2.write(interleaved.readline())
                else:
                    print(line)
                    print("AAAAAAAHHHHHH NO MATCHES")
                    exit()

Remove temp dir in `utils.end_job`.

Currently this is done by everything that can fail, but we should just as a rule of thumb always be doing this for every completed job, whether it is successful or not, so this should be put in that util function.

Merge data_models and common projects

I have found myself wanting to include common in data_models, and data_models is already in common. They do seem to serve a similar purpose, which is to be common code included in other data_refinery projects.

I could potentially just move the file_management namespace into the Batch object as that is the only namespace in common which uses any data_models. In fact #30 already suggests doing so. However this could be a short term fix if any other code is written for common which relies on data_models. However an argument could potentially be made that any common code relying on data_models should in fact live in that project.

I either address this issue or #30, but not both (at least for the time being).

Reduce output from R warnings

rpy2 is very very chatty. It'd be good to reduce the amount of log statements it generates.

Fix multiple platform downloader jobs

DownloaderJobs dealing with batches that have varied platforms currently fail to download because the path generation code gets tripped up. This is caused because platform name is used as part of the URL for batches.

Create common util module.

So far the only thing that should definitely be common is get_env_variable, however there will surely be more throughout the lifespan of the project.

Push work queue depth metric into cloudwatch

Context

We're using Nomad not just as an orchestration tool, but also as a way of managing our work queue.

Problem or idea

Solution or next step

It's possible there's a better solution, but one possible solution is to create a cron job on the lead Nomad server to calculate the work queue depth using the one liner provided by #59 and push it to cloudwatch. This will both allow us to see a graph of that metric over time and also use that metric for the autoscaling group in #61.

Improve metadata for batches that were replicated directly from NCBI GEO

For experiments that Array Express replicated from NCBI GEO, GEO has better metadata than Array Express does. We should add a new implementation of ExternalSourceSurveyor which replicates these directly from NCBI and populates the metadata from their API.

Terraform configuration for 3 Nomad servers

Make create_job_and_relationships happen within a transaction.

I have found myself always wanting to call it within a transaction to make sure everything gets created together. I should just do it in that function itself.

Codify processor RAM usage into nomad job specs

Context

#55 creates different Nomad job specifications for different processor job types. One benefit of this is that we can specify the resource requirements (probably just RAM/CPU) for each job type so that Nomad can schedule the work in a (hopefully) intelligent way.

Problem or idea

Assuming this works well, we'll want to do this for all job types. However we should make sure that Nomad does in fact do a good job of scheduling work before we do this for everything. Therefore we should start with just one, so why not SCAN.UPC?

Solution or next step

We need to test out SCAN.UPC on a variety of file sizes and see what a reasonable upper bound for RAM and CPU is. Once we've determined these, they should be encoded into the Nomad job specification for SCAN.UPC jobs created in #55.

Need to handle case where there is no brainarray package for a given organism.

When this is the case, we should just not pass one into SCAN.UPC.

Tagging @jaclyn-taroni, as she realized this was an issue.

Split workers image into job-specific images and configure Nomad to use the correct one for each job.

Context

As we continue to create different Processors the ccdl/dr_workers Docker image will continue to grow to contain all the dependencies and (at least some of the data) the different jobs need. This makes for a very bulky image that takes a long time to build, upload, and download along with requiring a non-trivial amount of disk space to store.

Problem or idea

We should have a Docker image for every type of job we have. We should also have different Nomad job specs for each type of job, rather than having a single processor.nomad job spec for all of them.

Solution or next step

Split the workers Docker image into separate images for each job type and create Nomad job specifications for each job type. Note that some jobs may still be able to an image, for example the SCAN.UPC image should be usable by both Affymetrix and Illumina specific jobs types.

Finally, change data_refinery_common.message_queue.send_job function to specify job types directly to Nomad rather than specifying the job type via the job parameters.

Run postgres within a container during development

I originally chose not to do this because I wanted to match production more closely, but I have since been convinced that the difference isn't really meaningful. The benefits to this are that new developers will have an easier time getting up and running with the Data Refinery. It also will be a good intro project for @dongbohu

Create interface via which jobs can be submitted to Nomad

Utilize the interface in place of sending celery messages

Re-test SCAN.UPC RAM usage with ensembl

After switching to using ensembl instead of brainarray we should retest the RAM usage so we can create a Nomad job specification for SCAN.UPC jobs.

Reorganize project to be run via management command

Implement extraction of unmapped read sequences (unmapped by Salmon)

Use salmontools extract-unmapped command.

Implement tximport for RNA-seq pipeline

Use tximport along with the gene-to-transcript mapping already contained within the processor to implement this part of the salmon pipeline:

This code does essentially what we need, we just need to include this into the Data Refinery's salmon pipeline: https://github.com/jaclyn-taroni/ref-txome/blob/79e2f64ffe6a71c5103a150bd3159efb784cddeb/4-athaliana_tximport.R
(Note that this script contains a link to a tutorial.)

Note that this should be done on a per-experiment basis, rather than a per-sample basis.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

alexslemonade / refinebio Goto Github PK

refinebio's Introduction

Refine.bio

Table of Contents

Development

Git Workflow

Installation

Automatic

Linux (Manual)

Mac (Manual)

Virtual Environment

Services

Postgres

Common Dependecies

ElasticSearch

Testing

API

Common

Foreman

Workers

Style

Gotchas

R

Running Locally

API

Surveyor Jobs

Sequence Read Archive

Ensembl Transcriptome Indices

Downloader Jobs

Processor Jobs

Creating Quantile Normalization Reference Targets

Creating Compendia

Running Tximport Early

Development Helpers

Cloud Deployment

Docker Images

Terraform

AWS Batch

Running Jobs

Log Consumption

Dumping and Restoring Database Backups

Tearing Down

Support

Meta-README

License

refinebio's People

Contributors

Stargazers

Watchers

Forkers

refinebio's Issues

Context

Problem or idea

Solution or next step

Context

Problem or idea

Solution or next step

Context

Problem or idea

Solution or next step

Context

Problem or idea

Solution or next step

Context

Problem or idea

Solution or next step

Context

Problem or idea

Solution or next step

Recommend Projects

Recommend Topics

Recommend Org