April helps noteworthy genomes spring up and stand out from large sequence datasets.
First off, a disclaimer: April is not built yet and is in its very early stages. That said, I can still provide a little backstory.
For the past year, I've been working on a tool for finding noteworthy SARS-CoV-2 lineages called ALPINE. One part of the pipeline makes a pairwise nucleotide distance matrix of sequences from each month of the Covid-19 pandemic, with the aim being to find sequences that are highly mutated. However, we kept getting the weird result that hardly any sequences were springing up as highly mutated, and those that were springing up were hardly impressive. Among other possible explanations, I decided to start exploring whether a) sequences contaminated with primer sequences, and b) sequences bioinformatically contaminated with the SARS-CoV-2 reference sequence, were artificially inflating the pairwise distances for most sequences each month, thereby making it harder for noteworthy sequences to stand out.
So, the idea of April is to allow noteworthy samples to "spring up" from a background of potentially contaminated consensus sequences. The name comes from the fact that (in the Northern hemisphere) most plants start springing up in April.
April could also be an acronym for the following (I just haven't chosen one yet):
- Approximate Permutations of Reference Infiltration eLiminated
- Accuracy Profiler and Reference Influence Limiter
- Accurate Purification of Reference Integration Layers
April is being built in Rust with an emphasis on a rich and configurable command line interface, parallel processing and filtering of FASTA records, fast kmer containment computations, and informative report generation.