Code Monkey home page Code Monkey logo

tiny's People

Contributors

hp2048 avatar mud-skip avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

tiny's Issues

LastZ alignments

LastZ alignments take extremely long time to run. See the table below.

ChromosomeSize	   ElapsedWalltimehours
110772064			318
161160915			219
117486409			182
154487646			315
 79982351			176
110838418			314

Chicken genome (GRCg6a) was used as the reference. CM023912.1 was used as the test chromosome. It is chromosome 1 of the Western jackdaw bird, which is 117,486,409 nt long sequence. First, the jackdaw chromosome was aligned to the chicken genome as is using the following command.

lastz_32 --targetcapsule=GCF_000002315.6_GRCg6a_genomic.capsule CM023912.1.fa K=2400 L=3000 Y=9400 H=2000 --ambiguous=iupac --format=axt --output=CM023912.1.fa.axt

It took 655,336 seconds (~188 hours) of walltime to align this chromosome.

Then, we split the input jackdaw chromosome 1 sequence into chunks of 1Mb each and creating 20 chunks of the chromosome with each chunk/file containing ~6 sequences of 1Mb each using the following command.

perl splitFasta.pl -i CM023912.1.fa -o CM023912.1.c -c 20

Each of the 20 files were then aligned against the chicken reference genome in an embarrassingly parallel manner using GNU Parallel Util. Typical command was as follows:

lastz_32 --targetcapsule=GCF_000002315.6_GRCg6a_genomic.capsule CM023912.1.c.09.fa K=2400 L=3000 Y=9400 H=2000 --ambiguous=iupac --format=axt --output=CM023912.1.c.09.axt

Alignment times ranged between 12.25 seconds for the last chunk with only 3,486,409 nt sequence. The longest runtime was 12,641 seconds (3.51 hours).

There were 1,894,570 alignment blocks reported in the output .axt output file without splitting the chromosome and 1,894,351 alignments blocks after splitting the chromosome. This show a small reduction (219) in the number of aligned blocks. The coverage on the chicken genome was 1,017,962,208 bp without the split and 1,017,817,249 bp after the split.

Further testing is required of the genomic intervals to assess if all areas of the query and the target are covered in the alignment

For now, split alignments can be merged into a single file for downstream processing using

grep -h ^'#' *.axt >merged.output
grep -vh ^'#' *.axt | perl -lne 'if ($_=~/^\d+/) { @a = split (" ", $_); $a[4] =~ /(\S+)\.s(\d+)\.e\d+/; $c=$1;$s=$2; $a[4] = $c; $a[5] = $s + $a[5]; $a[6] = $s + $a[6]; $a[0] = $counter ? $counter : 0; $counter++; print join (" ", @a); } else { print $_ }' >> merged.output
axtChain -minScore=3000 -linearGap=medium merged.output GCF_000002315.6_GRCg6a_genomic.2bit query_genomic.2bit output.chain

OUTCOME:
Splitting the large input chromosomes into smaller chunks facilitate embarrassingly parallel running of the LastZ alignments. It also allows for more predictable estimation of runtime when using HPC facilities.

automate lastz cmds creation

preparelastz script is less than ideal. work on it to be able to select query and target species from species.txt file

Runtime improvements

  1. Consider seeding options that balance with the RAM utilisation. Currently everything is seeded and searched. Don't need to do that.
  2. RAM is largely unused. Consider increasing backtracking RAM to avoid alignment truncation.
  3. Include alignment filters to reduce the number of alignments reported to save space.
  4. Get stats on pairs of alignments and design parameter settings for reptiles as a whole or birds, turtles, alligators, snakes, lizards as groups.
  5. Look into job monitoring systems. Ensembl eHive or something similar to manage large number of small tasks.
  6. Keep a database of pairs aligned and pairs unaligned.
  7. Automate species and chromosome tables for ease in plotting.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.