alexslemonade / refinebio Goto Github PK

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.

Home Page: https://www.refine.bio/

License: Other

Python 80.98% Shell 5.65% HCL 3.78% R 3.77% Smarty 0.08% HTML 5.66% jq 0.04% Awk 0.01% Rez 0.04%

refinebio's Issues

Download both raw and processed data for types we don't know how to handle

If we don't know how to process a given raw file, then we should download it and any available processed file for the sample.

Create interface via which jobs can be submitted to Nomad

Utilize the interface in place of sending celery messages

Salmon -- Alignment-based method and feature counts -- STAR

TBD whether STAR or Hisat2 will be more appropriate. Will require coordination with @jaclyn-taroni and the Patro lab to clarify which to use and how to do so.

Terraform configuration for 3 Nomad servers

Reduce output from R warnings

rpy2 is very very chatty. It'd be good to reduce the amount of log statements it generates.

Create a logging context which has as much information as possible

Every log message should contain the following if they are applicable:

Batch ID
Downloader/Processor Job ID
Instance ID/IP

These Should be in a standard format at the beginning of each log message to make it easy to search for logs.

Add convenience to creating a downloader job for a single batch.

As per @mhuyck's comment here the DownloaderJob model can take a batch as an optional argument and add it to the DownloaderJobsToBatches table.

Clean up top level requirements.txt file

I think there's a lot of stuff which was needed before but no longer is.

Logging/monitoring for Nomad servers

Re-test SCAN.UPC RAM usage with ensembl

After switching to using ensembl instead of brainarray we should retest the RAM usage so we can create a Nomad job specification for SCAN.UPC jobs.

Startup scripts for the Nomad clients

Benchmark the best way to run salmon on interleaved paired read files.

When downloading fastq files from NCBI that are paired end reads, the files get interleaved into one .fastq.gz file. Salmon expects two separate files. @rob-p has written a bash script which will split an interleaved fastq file into two streams and then feed the streams into salmon, which can be found here. There is an alternative method though which would use python's gzip library to split the gzip file without gunzipping it. Which will be faster is not clear, so some benchmarking is in order. The benchmarking should be done on an AWS instance to best match the environment in which the production code will be run, because apparently the speed of the HDD can impact Salmon's performance on .fastq vs .fastq.gz files.

The code for the alternative python method is currently only at a POC level and has not been committed to any repo, so it is included here:

import gzip
import re

with gzip.open("sra_data.fastq.gz", "r") as interleaved:
    with gzip.open("read_1.fastq.gz", "w") as out_1:
        with gzip.open("read_2.fastq.gz", "w") as out_2:
            for line in interleaved:
                line = str(line, "utf-8")
                if re.match(".*RR\d+\.\d+\.1.*", line) is not None:
                    out_1.write(bytes(line, "utf-8"))
                    out_1.write(interleaved.readline())
                elif re.match(".*RR\d+\.\d+\.2.*", line) is not None:
                    out_2.write(bytes(line, "utf-8"))
                    out_2.write(interleaved.readline())
                else:
                    print(line)
                    print("AAAAAAAHHHHHH NO MATCHES")
                    exit()

Create a script that can be used to run a Nomad job.

File extensions aren't treated in a case insensitive manner

This causes a abc.cel file to not be treated as a .CEL type file. Fixing this would fix some edge cases.

Codify processor RAM usage into nomad job specs

Context

#55 creates different Nomad job specifications for different processor job types. One benefit of this is that we can specify the resource requirements (probably just RAM/CPU) for each job type so that Nomad can schedule the work in a (hopefully) intelligent way.

Problem or idea

Assuming this works well, we'll want to do this for all job types. However we should make sure that Nomad does in fact do a good job of scheduling work before we do this for everything. Therefore we should start with just one, so why not SCAN.UPC?

Solution or next step

We need to test out SCAN.UPC on a variety of file sizes and see what a reasonable upper bound for RAM and CPU is. Once we've determined these, they should be encoded into the Nomad job specification for SCAN.UPC jobs created in #55.

Create a script that can be used to run a Nomad job.

Fix multiple platform downloader jobs

DownloaderJobs dealing with batches that have varied platforms currently fail to download because the path generation code gets tripped up. This is caused because platform name is used as part of the URL for batches.

Make create_job_and_relationships happen within a transaction.

I have found myself always wanting to call it within a transaction to make sure everything gets created together. I should just do it in that function itself.

Get off of development version of rpy2

For some reason the rpy2 dockerfile installs the dev version of rpy2. Doing the same thing works for me, but trying to install a fixed version it via pip didn't... I'm not sure why but this should be investigated and resolved.

Add field to jobs to mark why they failed

This also needs to be set everywhere a job can fail.

Reorganize project to be run via management command

Consider using a JSON postgres field for batch key/values

It may be more space efficient to do it this way instead along with potentially making the data easier to use.

For reference see https://www.postgresql.org/docs/9.6/static/datatype-json.html

Improve metadata for batches that were replicated directly from NCBI GEO

For experiments that Array Express replicated from NCBI GEO, GEO has better metadata than Array Express does. We should add a new implementation of ExternalSourceSurveyor which replicates these directly from NCBI and populates the metadata from their API.

Gene fusions are of interest in the childhood cancer space.

We may consider a processor that will convert fastq files -> gene fusion calls. @drtamermansour shared this paper that has a current benchmark.
https://www.nature.com/articles/srep21597

As this is not an immediate-term goal, I'm creating this to track thoughts in this space.

Figure out spot instance nuances

Context

Spot instances cost a fraction of the normal price for AWS instances. The downside is that they can be preempted if the demand for AWS instances becomes too high. However this is not a big deal for our system because we don't need to have data processed ASAP, it's worth the delay to save money.

Problem or idea

We should use spot instances instead of normal AWS instances.

Solution or next step

Switch the instances used for Nomad clients to spot instances. The best way to do this will probably require a bit of research to determine, but there are some blog posts and there is also a specific section in the terraform docs about it.

The one thing that is known is that we definitely should use the auto-scaling group created in #61.

Split workers image into job-specific images and configure Nomad to use the correct one for each job.

Context

As we continue to create different Processors the ccdl/dr_workers Docker image will continue to grow to contain all the dependencies and (at least some of the data) the different jobs need. This makes for a very bulky image that takes a long time to build, upload, and download along with requiring a non-trivial amount of disk space to store.

Problem or idea

We should have a Docker image for every type of job we have. We should also have different Nomad job specs for each type of job, rather than having a single processor.nomad job spec for all of them.

Solution or next step

Split the workers Docker image into separate images for each job type and create Nomad job specifications for each job type. Note that some jobs may still be able to an image, for example the SCAN.UPC image should be usable by both Affymetrix and Illumina specific jobs types.

Finally, change data_refinery_common.message_queue.send_job function to specify job types directly to Nomad rather than specifying the job type via the job parameters.

Need to handle case where there is no brainarray package for a given organism.

When this is the case, we should just not pass one into SCAN.UPC.

Tagging @jaclyn-taroni, as she realized this was an issue.

Push work queue depth metric into cloudwatch

Context

We're using Nomad not just as an orchestration tool, but also as a way of managing our work queue.

Problem or idea

We'll need a way to see what the work queue depth is so that we can build alerts if it starts to get too large and so that we can scale our cluster up or down as needed. This will be necessary for #61.

Solution or next step

It's possible there's a better solution, but one possible solution is to create a cron job on the lead Nomad server to calculate the work queue depth using the one liner provided by #59 and push it to cloudwatch. This will both allow us to see a graph of that metric over time and also use that metric for the autoscaling group in #61.

Use BrainArray custom CDFs for AFFY_TO_PCL processor

Changes `size_in_bytes` to `raw_size_in_bytes` and `processed_size_in_bytes`

Currently there is just one size field for each file and we only store the raw size. This means we won't know the size of the processed data.

Alternatively this could be addressed by changing the way files are used so that a new File object is created and saved when once it is generated via processing. File objects could then be immutable. This would clarify a lot of the operations that happen, and is actually probably the correct way to remedy this.

Remove temp dir in `utils.end_job`.

Currently this is done by everything that can fail, but we should just as a rule of thumb always be doing this for every completed job, whether it is successful or not, so this should be put in that util function.

Figure out how to handle/store/use production secrets

Currently these include:

Django secret keys
Database credentials

Rebuild workers docker image

Currently the Dockerfile has some things out of order because I didn't want to trigger a full rebuild while I was developing it since it takes over an hour. At some point I should do that (like at the end of a day or something).

Create additional fields on jobs tables for each project's version number.

Currently there is a single field for the version of the data refinery, however it would probably be better to have a different field for each sub-project so they can increase their versions independently.

Also it would be nice to add the git commit hash into docker images so that it can be added to jobs as an additional field as well.

Implement tximport for RNA-seq pipeline

Use tximport along with the gene-to-transcript mapping already contained within the processor to implement this part of the salmon pipeline:

This code does essentially what we need, we just need to include this into the Data Refinery's salmon pipeline: https://github.com/jaclyn-taroni/ref-txome/blob/79e2f64ffe6a71c5103a150bd3159efb784cddeb/4-athaliana_tximport.R
(Note that this script contains a link to a tutorial.)

Note that this should be done on a per-experiment basis, rather than a per-sample basis.

Remove `boto` dependency

Currently there are dependencies on both boto and boto3. I think the dependency on boto is due to a now-out-of-date version of Celery. However Celery will be removed from the project when switching to Nomad, so as part of doing that I can resolve this issue rather than upgrading Celery just to rip it out shortly after.

Create common util module.

So far the only thing that should definitely be common is get_env_variable, however there will surely be more throughout the lifespan of the project.

Investigate batchless DownloaderJobs

There are records in the downloader_jobs table which do not have an entry in the downloader_jobs_batches table. I don't understand how this is possible, but I should see if I can find a way to replicate it so I can prevent it.

Merge data_models and common projects

I have found myself wanting to include common in data_models, and data_models is already in common. They do seem to serve a similar purpose, which is to be common code included in other data_refinery projects.

I could potentially just move the file_management namespace into the Batch object as that is the only namespace in common which uses any data_models. In fact #30 already suggests doing so. However this could be a short term fix if any other code is written for common which relies on data_models. However an argument could potentially be made that any common code relying on data_models should in fact live in that project.

I either address this issue or #30, but not both (at least for the time being).

Make version of R packages explicit in R_dependencies.R

I meant to do this when setting up the workers' Docker image, but I guess I got so happy everything actually worked that I forgot to do so. It doesn't necessarily seem easy, but @jaclyn-taroni has helpfully provided the following R code which allows brainarray packages' versions to be specified:

InstallBrainarray <- function(platform, org.code, ba.version) {
    # This function makes use of devtools::install_url to install Brainarray
    # packages for the annotation of Affymetrix data. Specifically, the packages
    # required for use with SCAN.UPC and affy (RMA) are installed.
    #
    # Args:
    #   platform: The Affymetrix platform for which brainarray
    #             packages are to be installed (e.g., "hgu133plus2")
    #   org.code: Two letter organism code -- human would be "hs"
    #        ba.version: What version of brainarray should be used? (e.g., "21.0.0")
    #
    #    Returns:
    #      NULL - this package completes installation of these packages and does not
    #                     return any values 

    # make sure platform and org.code are all lowercase and lack punctuation
    platform <- tolower(gsub("[[:punct:]]", "", platform))
    org.code <- tolower(gsub("[[:punct:]]", "", org.code))

    # probe version for use with SCAN.UPC
    probe.pkg.name <- paste0(platform, org.code, "entrezgprobe_",
                                         ba.version, ".tar.gz")
    probe.url <- paste0("http://mbni.org/customcdf/21.0.0/entrezg.download/",
                                            probe.pkg.name)
    devtools::install_url(probe.url)

    # cdf version for use with affy::RMA
    cdf.pkg.name <- paste0(platform, org.code, "entrezgcdf_",
                                         ba.version, ".tar.gz")
    cdf.url <- paste0("http://mbni.org/customcdf/21.0.0/entrezg.download/",
                                        cdf.pkg.name)
    devtools::install_url(cdf.url)


}

#### install brainarray main ---------------------------------------------------

# HGU133Plus2
InstallBrainarray(platform = "hgu133plus2",
                                    org.code = "hs",
                                    ba.version = "21.0.0")

Move file management code into Batch class

I'm 90% sure this is something I want to do. Very low priority though because it doesn't strictly matter where the code is, but I think it might make more sense from an organizational standpoint.

Startup scripts for Nomad servers

Figure out how to determine work queue depth

Context

We're using Nomad not just as an orchestration tool, but also as a way of managing our work queue.

Problem or idea

Solution or next step

Determine how to calculate the Nomad work queue depth.

Utilize soft deletes

Using https://github.com/scoursen/django-softdelete looks like it might do the trick.

Implement extraction of unmapped read sequences (unmapped by Salmon)

Use salmontools extract-unmapped command.

Switch index-refinery from storing .tar.gz files to .tar.xz files

The Homo_sapiens_short.tar.gz file is 2.5 GB, whereas the same file compressed using xz is less than one GB. As these files don't actually feed into salmon in the gzipped format there's not much reason to use gz instead of xz other than the fact that gz may be easier for other users of the Index Refinery to use. However @jaclyn-taroni doesn't think that it would be a significant factor so xz seems superior.

Run postgres within a container during development

I originally chose not to do this because I wanted to match production more closely, but I have since been convinced that the difference isn't really meaningful. The benefits to this are that new developers will have an easier time getting up and running with the Data Refinery. It also will be a good intro project for @dongbohu

Create auto-scaling group based on cloudwatch metric

Context

Our system has a lot of work to do. We will probably not do it all in a single shot, but rather slowly expand as we build out more surveyors, downloaders, and processors. Therefore the size of our cluster will need to be elastic so it can scale to the size needed to keep up with the work we want it to do. Additionally, it's unclear exactly how many nodes we'll need in the cluster to handle all the jobs we'll be queuing at any given point in time.

Problem or idea

We should use an auto-scaling group to scale based off of the depth of the work queue. This will dovetail nicely with our planned usage of spot instances (#62) because an auto-scaling group is the recommended way to manage them via terraform anyway.

Solution or next step

Use terraform to create an auto-scaling group based off of the work queue depth metric created by #60. While we will eventually want that to be using spot instances, there may be additional concerns and/or gotchas associated with using them. Therefore to keep the size of the PR addressing this issue down spot instances do not need to be included in it.

alexslemonade / refinebio Goto Github PK

refinebio's Issues

Context

Problem or idea

Solution or next step

Context

Problem or idea

Solution or next step

Context

Problem or idea

Solution or next step

Context

Problem or idea

Solution or next step

Context

Problem or idea

Solution or next step

Context

Problem or idea

Solution or next step

Recommend Projects

Recommend Topics

Recommend Org