Code Monkey home page Code Monkey logo

canu-wdl's People

Contributors

arangrhie avatar brianwalenz avatar godotgildor avatar grangersutton avatar hartzell avatar jasonrafemiller avatar jts avatar jxtx avatar lhon avatar luyang93 avatar nate-d-olson avatar rick-heig avatar robvdl avatar sgblanch avatar shuang-broad avatar skoren avatar snurk avatar suryasaha avatar txje avatar

Watchers

 avatar

canu-wdl's Issues

Plan on WDL-lizing the trioBinning stage of canu v1.9

On repartitioning parental reads

The repartition step is I/O bound, takes approximately 2 hours, using no more than 4 threads and not much memory.

We may lower cost in this step by:

  1. provision a VM with much lower CPU count and memory requirement
  2. localize only the parental reads
  3. save only the re-partitioned reads

On parallelizing meryl-count

Perfect candidate for data-parallelization. The majority of observed ~ 30 minutes/batch is actually spent on loading the parental reads into memory with 1 thread. And 30 GB memory has been observed to have been enough per batch. Reading the perl code reveals that the memory requirement is purely dependent on the number of files per batch (the choice of $k$ as well, but we expect that to change very infrequently).

We may lower cost in this step by a two-step break up of the current implementation

Step one

exit gracefully immediately after

  • the parental reads are re-partitioned, and
  • files meryl-count.sh, meryl-count.memory meryl-merge.sh & meryl-subtract.sh are generated.
    We do need to make sure that the configured shell scripts, the memory file, as well as the batch-definition file and a batch count are saved for downstream use.

This requires us to implement something similar to stopAfter("trio-binning scripts are configured").

Step two

a following scatter call to meryl-count tasks (in the batch-count number of ways), each VM taking in a batch-definition file, and localize just that. Launch meryl-count.sh on the given batch with the correct batch id, then yield. Remember to save the count files and the log ("out") files in each batch.

meryl-merge & meryl-subtract

This is essentially a gather step after the scatter in meryl-count above. The task essentially just need to take the outputs of meryl-count above, and perform the merge and subtract task.

Not exactly heavy computations based on test run on an VM. The test run tells us that 30 GB is OK for the task to finish, in 2~3 hours.

On child read assignment

Seemingly can be finished in 1 hour on a rich machine. Test run with 32 threads and 256 GB memory finished in ~ 1 hour.

Speedup parental reads re-partition

Right now the two parents short reads are re-partitioned in serial.
We can speed up by parallelize that, one task per parent.
Expecting a drop of ~ 1 hour runtime.

But lower priority.

understand canu pipeline scripts up to child read haplotyping stage

The following subroutines or blocks should be understood first before meaningful changes to the code can happen.

canu.pl

Relevant blocks:

  • while (scalar(@ARGV)): parses cmd line options; see comment block for canu::Defaults::addSequenceFile for relevant reading
  • if ((scalar(@haplotypes) > 0) && (setOptions($mode, "haplotype") eq "haplotype")): the short block that actually calls out to canu/HaplotypeReads for trio binning.

canu/Defaults.pm

  • addSequenceFile: parsing path of parental and child reads, DOES NOT actually load reads
  • getPhysicalMemorySize & getNumberOfCPUs: literally detects the number of CPUs and memory available on the machine; they are only non-trivially used in Configure::getAllowedResources()
  • setDefaults: see doc for Defaults.pm below
  • setExecDefaults($$): sets Memory, Threads, StageSpace and Concurrency of: 1) a specific stage with 2) description to undef (yes, all four).
  • checkParameters: makes sure options and parameters make sense

canu/Configure.pm

  • configureAssembler: see docs for Configure.pm below
  • getAllowedResources: see docs for Configure.pm below
  • findGridMaxMemoryAndThreads and expandRange: see docs for Configure.pm below

canu/Execution.pm

  • submitScript: returns without doing anything if no grid is detected in canu::Defaults, which is our intended use case.
  • getBinDirectoryShellCode, setWorkDirectoryShellCode, & getJobIDShellCode
  • stopAfter: checks that given "stop after" stage is requested by user, and exit the program.
  • submitOrRunParallelJob: see docs for Execution.pm below

canu/HaplotypeReads.pm

  • haplotypeReadsExist
  • haplotypeSplitReads
  • haplotypeCountConfigure
  • haplotypeCountCheck
  • haplotypeMergeCheck
  • haplotypeSubtractCheck
  • haplotypeReadsConfigure
  • haplotypeReadsCheck

Logical steps in canu.pl

  1. options parsing, parameters setting, resources detection

    • set defaults
    • parse arguments (cmd line or specs file)
    • set parameters based o parsed arguments
    • detect runtime info (JVM, minimap2, gnuplot)
    • detect & configure computing resources
    • configure assembler
    • check parameters
  2. minio IO prep

    • set & move to user specified work dir
    • input reads parsing
    • set up other output dir under workdir
  3. reads haplotyping

    • repartition the parental reads
    • configure/meta-generate the shell scripts for three meryl steps: count, merge, subtract
    • launch those scripts in batches, in order
    • configure/meta-generate the shell script for child read assignment
    • run the assignment script

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.