stjudecloud / workflows Goto Github PK
View Code? Open in Web Editor NEWBioinformatics workflows developed for and used on the St. Jude Cloud project.
License: MIT License
Bioinformatics workflows developed for and used on the St. Jude Cloud project.
License: MIT License
I think it's appropriate to default to gzipped inputs and that should be the standard we support. I don't see a need to go out of our way to support uncompressed data. However many tools are capable of operating on either relatively seamlessly. When that's the case we should document it as a possibility.
See here: https://github.com/stjudecloud/workflows/blob/main/tools/fastqc.wdl#L59
The above may not work if prefix
is messed with.
In the early days of this repo we tended to only expose parameters we use. We've since gotten much better at exposing parameters as we add tools. But there are still many old tasks with very little parameterization.
Finally, we plan to migrate the WDL tools
to a new repository. There are several tasks required before we can make the change.
Major culprit here is HTSEQ: https://github.com/stjudecloud/workflows/blob/main/tools/htseq.wdl
It has a pretty terrible sort algorithm and eats up resources when the input is position sorted. We've exposed the name sort option but still allocate a large amount of memory and disk. Neither are likely needed.
Some of our tasks allocate a large static amount of RAM that is often an over-allocation for many inputs.
One example here: https://github.com/stjudecloud/workflows/blob/main/tools/ngsderive.wdl#L366
The analyses run should be toggle-able for users only interested in a subset of the workflow (and to keep costs down)
All of the other steps that use the GTF file seem to properly handle compressed input, but this step requires the file to be plain-text. It should handle compressed input appropriately.
Line 21 in f19e0d3
Update tasks and workflows to WDL version 1.1
docker
runtime keys to container
sep
expressions to sep
functionI'll single the 3 worst offenders out: util.wdl
, samtools.wdl
, and picard.wdl
have so many tasks in them it becomes difficult to find the one you're looking for. At least that's my experience. They are each >700 lines long, which I think in just about every other language would be considered a behemoth for maintenance. Most languages have recommended file lengths (guessing an average consensus would be around 500 lines?), I propose we adopt something similar for WDL/this repo.
Although I don't want to base it off line number. I feel that can encourage some sloppy coding when the file in question is around the limit. e.g. collapsing lines that should be separated in order to keep below the length limit.
The below indented section would be me thinking out loud realizing all my ideas have some fatal flaw. I'm stumped as to how to solve the problem. Feel free to skip the indented section, or read it to see the thoughts I've had.
A saner approach to me is a task limit. The exact number might require some trial and error. Rough gut feeling: 5 seems too strict, 10 seems too lenient. I'd say we start looking in the 6-9 range for our task number limit.
"But we currently organize our files by tool. Are we throwing that scheme out?" No! (At least that's not my initial proposal. I'd hear someone out if they have an alternative.)
I say we start with organizing our files by tool, and then once they grow past 6-9 tasks, we make a split. What that split is will depend on some context.
For ex.picard.wdl
: could be split intopicard-qc.wdl
andpicard-manipulation.wdl
.
picard-qc
has all the Picard tasks which generate a report of some kind, and don't change the BAM file.
picard-manipulation
has all the Picard tasks which deal with modifying BAM files.
samtools.wdl
could be split into... Alright I don't see a great way to split this file.
Let's tryutil.wdl
: could be split intoutil-python-scripts.wdl
, and gosh this is proving more difficult than I expected.Pivot: what about sorting our tasks? That would also accomplish the goal of making it easier to find a specific task in a long file.
Let's start with a file whose order I like:
kraken2.wdl
. It's ordered so well I know it off the top of my head:download_taxonomy
,download_library
,create_library_from_fastas
.build_db
,kraken
. It flows in order that the tasks would be used. It's logical. This seems to be a special case I don't see a way to generalize. Unfortunate...
ngsderive.wdl
is roughly in the order that the task/subcommand was created. Chronological ordering makes sense, although I'd say that knowledge is pretty esoteric. I doubt anyone besides me and Clay could rattle off the order that commands were added tongsderive
. So that works for me but is probably not an ordering we should stand by. Now that I think about it, I think we always (or nearly always) add new tasks to the bottom of the file. So really most of our files are ordered in this chronological way. Kinda works for helping us regular maintainers find what we're looking for, but not helpful for anyone else.Alphabetize? Are there any other sorting choices? Would alphabetized tasks be an improvement? It wouldn't be of very much help to me. My brain is not great at alphabetizing. Not "can't do it" bad, but also it wouldn't be trivial for me to locate what I'm looking for in that sort. I imagine the situation would be roughly the same for me. Maybe a small improvement?
At first I thought this would be a silly suggestion but I don't hate it: shorter tasks at the top of the file, longer tasks the bottom of the file.
Con: really really annoying to get that sorted and maintain that sort (assuming we don't automate it).
Pro: it's the closest to "locate by vibe" that exists ๐
So I'm stumped. I still think our files are too long and they should be broken up. Or sorted, although I'd prefer a scheme for splitting files into smaller chunks. But I don't like any specific implementation I can come up with.
The best thing I thought of is alphabetizing tasks. I don't love it bc my brain is wired in a way finding things in an alphabetic sort isn't the easiest for me. But it's probably an overall improvement (especially while we lack any viable alternatives).
So, do we want to start alphabetizing our WDL files with many tasks? All our task files? Is there a threshold under which it's not worth the effort? Would that look strange, some files sorted, some not?
Opening the floor for proposals!
black
has a very good rationale for their stance on trailing commas and I say we follow that example.
black
which says short lists with few items should be collapsed to one line.The common thread on these 3 rules is that they are all trivial to enforce via an auto-formatter tool (and tedious/annoying to enforce manually). Therefore, I vote we hold off adopting these officially till that auto-formatter has reached fruition. Mainly to keep our codebase compliant with our documents.
Multiple included WDL files have incorrect runtime settings when running on Google Cloud.
For example:
https://github.com/stjudecloud/workflows/blob/92339ee9786e5eb55df80e2cba7b07c4822f2366/tools/star.wdl#L163
The setting should be 'disks' and not 'disk'. The following is the output from Cromwell when running the rnaseq-standard-fastq.wdl workflow:
2021-05-05 18:47:34,785 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,786 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,786 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,786 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,787 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,787 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,788 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,788 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,788 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,789 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,789 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,789 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk, dx_instance_type] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,789 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,790 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,790 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,790 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,790 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,791 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,791 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,791 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,792 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
The recently released RNA-Seq Standard 2.1.0 is not backward compatible with 2.0.0, particularly with input and output naming.
In my case, inputs were changed: rnaseq_standard.strand
was renamed to rnaseq_standard.strandedness
; and rnaseq_standard.htseq_count.memory_gb
changed behavior to rnaseq_standard.htseq_count.added_memory_gb
. The feature counts suffix was also changed from .counts.txt
to .feature-counts.txt
.
These changes are unexpected, as the readme says versioned workflows follow semantic versioning.
In our implementation of the Picard sort
, we attempt to move the bam index file since Picard uses the wrong name. However, the index file is only created if the sort order is coordinate
. For other settings, this step will fail.
Line 321 in 9982713
From feedback on RFC 0001, we have decided to remove duplicate marking. Since we have no ability to determine whether reads are actually duplicated and considering the results presented in https://www.nature.com/articles/srep25533, the best course of action appears to be no longer marking duplicates to avoid giving downstream tools inaccurate information and allow the tools to make their own determination.
All of our workflows and (almost) all of our tasks assume that data is Paired-End. SE support would make our workflows and tools more accessible.
Workflow: RNA-Seq Standard 2.0.0
When the input records have mates, htseq-count keeps an arbitrarily-sized buffer to match record pairs. In extreme cases, the default buffer size --max-reads-in-buffer 30000000
is too small, causing the following error:
Error occured when processing SAM input (record #396226907 in file sample.bam):
Maximum alignment buffer size exceeded while pairing SAM alignments.
I propose either adding an input to override the value (Int max_reads_in_buffer = 30000000
) or fixing the value to an infeasibly high record count, e.g., 2^63-1. The latter is then simply bounded by memory.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.