LastZ alignments take extremely long time to run. See the table below.
ChromosomeSize ElapsedWalltimehours
110772064 318
161160915 219
117486409 182
154487646 315
79982351 176
110838418 314
Chicken genome (GRCg6a) was used as the reference. CM023912.1 was used as the test chromosome. It is chromosome 1 of the Western jackdaw bird, which is 117,486,409 nt long sequence. First, the jackdaw chromosome was aligned to the chicken genome as is using the following command.
lastz_32 --targetcapsule=GCF_000002315.6_GRCg6a_genomic.capsule CM023912.1.fa K=2400 L=3000 Y=9400 H=2000 --ambiguous=iupac --format=axt --output=CM023912.1.fa.axt
It took 655,336 seconds (~188 hours) of walltime to align this chromosome.
Then, we split the input jackdaw chromosome 1 sequence into chunks of 1Mb each and creating 20 chunks of the chromosome with each chunk/file containing ~6 sequences of 1Mb each using the following command.
perl splitFasta.pl -i CM023912.1.fa -o CM023912.1.c -c 20
Each of the 20 files were then aligned against the chicken reference genome in an embarrassingly parallel manner using GNU Parallel Util. Typical command was as follows:
lastz_32 --targetcapsule=GCF_000002315.6_GRCg6a_genomic.capsule CM023912.1.c.09.fa K=2400 L=3000 Y=9400 H=2000 --ambiguous=iupac --format=axt --output=CM023912.1.c.09.axt
Alignment times ranged between 12.25 seconds for the last chunk with only 3,486,409 nt sequence. The longest runtime was 12,641 seconds (3.51 hours).
There were 1,894,570 alignment blocks reported in the output .axt
output file without splitting the chromosome and 1,894,351 alignments blocks after splitting the chromosome. This show a small reduction (219) in the number of aligned blocks. The coverage on the chicken genome was 1,017,962,208 bp without the split and 1,017,817,249 bp after the split.
Further testing is required of the genomic intervals to assess if all areas of the query and the target are covered in the alignment
For now, split alignments can be merged into a single file for downstream processing using
grep -h ^'#' *.axt >merged.output
grep -vh ^'#' *.axt | perl -lne 'if ($_=~/^\d+/) { @a = split (" ", $_); $a[4] =~ /(\S+)\.s(\d+)\.e\d+/; $c=$1;$s=$2; $a[4] = $c; $a[5] = $s + $a[5]; $a[6] = $s + $a[6]; $a[0] = $counter ? $counter : 0; $counter++; print join (" ", @a); } else { print $_ }' >> merged.output
axtChain -minScore=3000 -linearGap=medium merged.output GCF_000002315.6_GRCg6a_genomic.2bit query_genomic.2bit output.chain
OUTCOME:
Splitting the large input chromosomes into smaller chunks facilitate embarrassingly parallel running of the LastZ alignments. It also allows for more predictable estimation of runtime when using HPC facilities.