stjudecloud / workflows Goto Github PK

Bioinformatics workflows developed for and used on the St. Jude Cloud project.

License: MIT License

Dockerfile 0.42% WDL 97.67% Python 1.81% Shell 0.10%

bioinformatics-workflows workflows cromwell wdl bioinformatics computational-biology workflow workflow-engine next-generation-sequencing genomics

workflows's People

Contributors

Stargazers

Watchers

Forkers

sclan pandurang-kolekar jsunny23 ngenebio-genomics-platform truwl adthrasher lqsae jpcartailler maurya-anand peterhuene

workflows's Issues

Better document where GZIP input is required and where it's optional (but recommended)

I think it's appropriate to default to gzipped inputs and that should be the standard we support. I don't see a need to go out of our way to support uncompressed data. However many tools are capable of operating on either relatively seamlessly. When that's the case we should document it as a possibility.

FastQC `prefix` convention is fragile

See here: https://github.com/stjudecloud/workflows/blob/main/tools/fastqc.wdl#L59

The above may not work if prefix is messed with.

Add CI for checking all tasks pull current docker images

Expose more parameters of tools

In the early days of this repo we tended to only expose parameters we use. We've since gotten much better at exposing parameters as we add tools. But there are still many old tasks with very little parameterization.

Split tasks to a new repository

Finally, we plan to migrate the WDL tools to a new repository. There are several tasks required before we can make the change.

Tasks that take both pos/name sorted probably need different resources

Major culprit here is HTSEQ: https://github.com/stjudecloud/workflows/blob/main/tools/htseq.wdl

It has a pretty terrible sort algorithm and eats up resources when the input is position sorted. We've exposed the name sort option but still allocate a large amount of memory and disk. Neither are likely needed.

more dynamically allocated mem

Some of our tasks allocate a large static amount of RAM that is often an over-allocation for many inputs.
One example here: https://github.com/stjudecloud/workflows/blob/main/tools/ngsderive.wdl#L366

Modularize QC

The analyses run should be toggle-able for users only interested in a subset of the workflow (and to keep costs down)

Fragility in workflows around malformed/missing RG records

NGSderive step assumes plain text

All of the other steps that use the GTF file seem to properly handle compressed input, but this step requires the file to be plain-text. It should handle compressed input appropriately.

workflows/tools/ngsderive.wdl

Line 21 in f19e0d3

sort -k1,1 -k4,4n -k5,5n ~{gtf} | bgzip > annotation.gtf.gz

WDL/1.1

Update tasks and workflows to WDL version 1.1

Convert docker runtime keys to container
Convert sep expressions to sep function

our WDL files are too long

I'll single the 3 worst offenders out: util.wdl, samtools.wdl, and picard.wdl have so many tasks in them it becomes difficult to find the one you're looking for. At least that's my experience. They are each >700 lines long, which I think in just about every other language would be considered a behemoth for maintenance. Most languages have recommended file lengths (guessing an average consensus would be around 500 lines?), I propose we adopt something similar for WDL/this repo.
Although I don't want to base it off line number. I feel that can encourage some sloppy coding when the file in question is around the limit. e.g. collapsing lines that should be separated in order to keep below the length limit.

The below indented section would be me thinking out loud realizing all my ideas have some fatal flaw. I'm stumped as to how to solve the problem. Feel free to skip the indented section, or read it to see the thoughts I've had.

A saner approach to me is a task limit. The exact number might require some trial and error. Rough gut feeling: 5 seems too strict, 10 seems too lenient. I'd say we start looking in the 6-9 range for our task number limit.

"But we currently organize our files by tool. Are we throwing that scheme out?" No! (At least that's not my initial proposal. I'd hear someone out if they have an alternative.)
I say we start with organizing our files by tool, and then once they grow past 6-9 tasks, we make a split. What that split is will depend on some context.
For ex. picard.wdl: could be split into picard-qc.wdl and picard-manipulation.wdl.
picard-qc has all the Picard tasks which generate a report of some kind, and don't change the BAM file.
picard-manipulation has all the Picard tasks which deal with modifying BAM files.

samtools.wdl could be split into... Alright I don't see a great way to split this file.
Let's try util.wdl: could be split into util-python-scripts.wdl, and gosh this is proving more difficult than I expected.

Pivot: what about sorting our tasks? That would also accomplish the goal of making it easier to find a specific task in a long file.

Let's start with a file whose order I like: kraken2.wdl. It's ordered so well I know it off the top of my head: download_taxonomy, download_library, create_library_from_fastas. build_db, kraken. It flows in order that the tasks would be used. It's logical. This seems to be a special case I don't see a way to generalize. Unfortunate...

ngsderive.wdl is roughly in the order that the task/subcommand was created. Chronological ordering makes sense, although I'd say that knowledge is pretty esoteric. I doubt anyone besides me and Clay could rattle off the order that commands were added to ngsderive. So that works for me but is probably not an ordering we should stand by. Now that I think about it, I think we always (or nearly always) add new tasks to the bottom of the file. So really most of our files are ordered in this chronological way. Kinda works for helping us regular maintainers find what we're looking for, but not helpful for anyone else.

Alphabetize? Are there any other sorting choices? Would alphabetized tasks be an improvement? It wouldn't be of very much help to me. My brain is not great at alphabetizing. Not "can't do it" bad, but also it wouldn't be trivial for me to locate what I'm looking for in that sort. I imagine the situation would be roughly the same for me. Maybe a small improvement?

At first I thought this would be a silly suggestion but I don't hate it: shorter tasks at the top of the file, longer tasks the bottom of the file.
Con: really really annoying to get that sorted and maintain that sort (assuming we don't automate it).
Pro: it's the closest to "locate by vibe" that exists 😆

So I'm stumped. I still think our files are too long and they should be broken up. Or sorted, although I'd prefer a scheme for splitting files into smaller chunks. But I don't like any specific implementation I can come up with.

The best thing I thought of is alphabetizing tasks. I don't love it bc my brain is wired in a way finding things in an alphabetic sort isn't the easiest for me. But it's probably an overall improvement (especially while we lack any viable alternatives).
So, do we want to start alphabetizing our WDL files with many tasks? All our task files? Is there a threshold under which it's not worth the effort? Would that look strange, some files sorted, some not?

Opening the floor for proposals!

Style Guide rules under review

All quotes that can be double quotes should be (only use single quotes where necessary, such as when nesting quotes)
- I don't like this phrasing, but I'm not laboring over the words ATM
All comma delimited lists/arrays/objects/etc. should have a trailing comma
- again, not a fan of my phrasing here. Should be workshopped before entering the guide
- I can easily find the link if requested, but from memory black has a very good rationale for their stance on trailing commas and I say we follow that example.
  - The summary of that rationale is that it makes adding/reordering/removing items from a list less error prone. They might have additional arguments, but the above is enough to convince me.
All lists/arrays/objects/etc. should have one element per line (i.e. newline separate items). A key/value pair are considered one element.
- Is this controversial? It goes further than black which says short lists with few items should be collapsed to one line.
- I think this is justified by the fact it's easier to read and edit/rearrange IMO. Does anyone disagree?
- In #115 I complain our files have too many lines. This exacerbates that issue (but I think we already do this? I don't think officially adopting this rule would actually require changes to our code)
- Benefit: This rule is trivial to enforce with code. The alternative would be having some line width cut-off or other logic to calculate before deciding to collapse/split lines. Easier to implement via code is better IMO.

The common thread on these 3 rules is that they are all trivial to enforce via an auto-formatter tool (and tedious/annoying to enforce manually). Therefore, I vote we hold off adopting these officially till that auto-formatter has reached fruition. Mainly to keep our codebase compliant with our documents.

PAPIv2 Key [disk] not supported by GCP backend

Multiple included WDL files have incorrect runtime settings when running on Google Cloud.
For example:
https://github.com/stjudecloud/workflows/blob/92339ee9786e5eb55df80e2cba7b07c4822f2366/tools/star.wdl#L163

The setting should be 'disks' and not 'disk'. The following is the output from Cromwell when running the rnaseq-standard-fastq.wdl workflow:

2021-05-05 18:47:34,785 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,786 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,786 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,786 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,787 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,787 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,788 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,788 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,788 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,789 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,789 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,789 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk, dx_instance_type] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,789 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,790 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,790 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,790 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,790 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,791 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,791 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,791 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,792 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.

workflows/rnaseq: 2.1.0 is not backward compatible with 2.0.0

The recently released RNA-Seq Standard 2.1.0 is not backward compatible with 2.0.0, particularly with input and output naming.

In my case, inputs were changed: rnaseq_standard.strand was renamed to rnaseq_standard.strandedness; and rnaseq_standard.htseq_count.memory_gb changed behavior to rnaseq_standard.htseq_count.added_memory_gb. The feature counts suffix was also changed from .counts.txt to .feature-counts.txt.

These changes are unexpected, as the readme says versioned workflows follow semantic versioning.

Picard sort task fails for non-coordinate sorts

In our implementation of the Picard sort, we attempt to move the bam index file since Picard uses the wrong name. However, the index file is only created if the sort order is coordinate. For other settings, this step will fail.

workflows/tools/picard.wdl

Line 321 in 9982713

mv ~{prefix}.bai ~{outfile_name}.bai

investigate appropriate `disk_size`s

Remove duplicate marking

From feedback on RFC 0001, we have decided to remove duplicate marking. Since we have no ability to determine whether reads are actually duplicated and considering the results presented in https://www.nature.com/articles/srep25533, the best course of action appears to be no longer marking duplicates to avoid giving downstream tools inaccurate information and allow the tools to make their own determination.

support Single-Ended data

All of our workflows and (almost) all of our tasks assume that data is Paired-End. SE support would make our workflows and tools more accessible.

tools/htseq: Add override or fix htseq-count max-reads-in-buffer option

Workflow: RNA-Seq Standard 2.0.0

When the input records have mates, htseq-count keeps an arbitrarily-sized buffer to match record pairs. In extreme cases, the default buffer size --max-reads-in-buffer 30000000 is too small, causing the following error:

Error occured when processing SAM input (record #396226907 in file sample.bam):
  Maximum alignment buffer size exceeded while pairing SAM alignments.

I propose either adding an input to override the value (Int max_reads_in_buffer = 30000000) or fixing the value to an infeasibly high record count, e.g., 2^63-1. The latter is then simply bounded by memory.