shuang-broad / canu-wdl Goto Github PK

View Code? Open in Web Editor NEW

This project forked from marbl/canu

0.0 0.0 0.0 199.02 MB

A customized fork of canu 1.9 for WDL-ization in the cloud

Home Page: http://canu.readthedocs.io/

Perl 16.19% C 31.13% C++ 51.18% Makefile 0.82% CSS 0.19% Shell 0.31% Python 0.17% Dockerfile 0.02%

canu-wdl's Issues

Summarize what are configured in

canu::Configure::configureAssembler()
canu::Defaults::setDefaults()
canu::Defaults::setExeDefaults()

experiment with different batch sizes for meryl count

per this discussion

marbl#1565

Make copy of canu perl scripts/modules, and take out un-used parts

Not so easy task, but reduces maintenance headache.

Plan on WDL-lizing the trioBinning stage of canu v1.9

On repartitioning parental reads

The repartition step is I/O bound, takes approximately 2 hours, using no more than 4 threads and not much memory.

We may lower cost in this step by:

provision a VM with much lower CPU count and memory requirement
localize only the parental reads
save only the re-partitioned reads

On parallelizing `meryl-count`

Perfect candidate for data-parallelization. The majority of observed ~ 30 minutes/batch is actually spent on loading the parental reads into memory with 1 thread. And 30 GB memory has been observed to have been enough per batch. Reading the perl code reveals that the memory requirement is purely dependent on the number of files per batch (the choice of $k$ as well, but we expect that to change very infrequently).

We may lower cost in this step by a two-step break up of the current implementation

Step one

exit gracefully immediately after

the parental reads are re-partitioned, and
files meryl-count.sh, meryl-count.memory meryl-merge.sh & meryl-subtract.sh are generated.
We do need to make sure that the configured shell scripts, the memory file, as well as the batch-definition file and a batch count are saved for downstream use.

This requires us to implement something similar to stopAfter("trio-binning scripts are configured").

Step two

a following scatter call to meryl-count tasks (in the batch-count number of ways), each VM taking in a batch-definition file, and localize just that. Launch meryl-count.sh on the given batch with the correct batch id, then yield. Remember to save the count files and the log ("out") files in each batch.

`meryl-merge` & `meryl-subtract`

This is essentially a gather step after the scatter in meryl-count above. The task essentially just need to take the outputs of meryl-count above, and perform the merge and subtract task.

Not exactly heavy computations based on test run on an VM. The test run tells us that 30 GB is OK for the task to finish, in 2~3 hours.

On child read assignment

Seemingly can be finished in 1 hour on a rich machine. Test run with 32 threads and 256 GB memory finished in ~ 1 hour.

canu.pl

Relevant blocks:

while (scalar(@ARGV)): parses cmd line options; see comment block for canu::Defaults::addSequenceFile for relevant reading
if ((scalar(@haplotypes) > 0) && (setOptions($mode, "haplotype") eq "haplotype")): the short block that actually calls out to canu/HaplotypeReads for trio binning.

canu/Defaults.pm

addSequenceFile: parsing path of parental and child reads, DOES NOT actually load reads
getPhysicalMemorySize & getNumberOfCPUs: literally detects the number of CPUs and memory available on the machine; they are only non-trivially used in Configure::getAllowedResources()
setDefaults: see doc for Defaults.pm below
setExecDefaults($$): sets Memory, Threads, StageSpace and Concurrency of: 1) a specific stage with 2) description to undef (yes, all four).
checkParameters: makes sure options and parameters make sense

canu/Configure.pm

configureAssembler: see docs for Configure.pm below
getAllowedResources: see docs for Configure.pm below
findGridMaxMemoryAndThreads and expandRange: see docs for Configure.pm below

canu/Execution.pm

submitScript: returns without doing anything if no grid is detected in canu::Defaults, which is our intended use case.
getBinDirectoryShellCode, setWorkDirectoryShellCode, & getJobIDShellCode
stopAfter: checks that given "stop after" stage is requested by user, and exit the program.
submitOrRunParallelJob: see docs for Execution.pm below

canu/HaplotypeReads.pm

Logical steps in canu.pl

options parsing, parameters setting, resources detection
- set defaults
- parse arguments (cmd line or specs file)
- set parameters based o parsed arguments
- detect runtime info (JVM, minimap2, gnuplot)
- detect & configure computing resources
- configure assembler
- check parameters
minio IO prep
- set & move to user specified work dir
- input reads parsing
- set up other output dir under workdir
reads haplotyping
- repartition the parental reads
- configure/meta-generate the shell scripts for three meryl steps: count, merge, subtract
- launch those scripts in batches, in order
- configure/meta-generate the shell script for child read assignment
- run the assignment script

meryl scripts should `set -euo pipefail`

otherwise it continues silently, hurts debugging.

shuang-broad / canu-wdl Goto Github PK

canu-wdl's People

Contributors

Watchers

canu-wdl's Issues

On repartitioning parental reads

On parallelizing meryl-count

meryl-merge & meryl-subtract