Under active development. This software is a sequence and gene aware pangenome construction toolkit. The approach used here is iterative.
- The assembly with the most sequence is considered the "current working panreference" at the start.
- Take the longest, unprocessed assembly and GFF3 file, when available.
- Compare using nucmer to the current working panreference. Add in all non-redundant sequences as separate contigs to the current working panreference.
- If a GFF3 file is provided, extended any non-redundant sequences to existing feature (gene) boundaries so that we do not carry partial features to the next iteration.
- Add any features, and rename when appropriate, to the current work panreference GFF3 file.
- If there are more unprocessed assemblies, GOTO 1.
Until recently, most pan-genomes have consisted solely of the genes, while ignoring the rest of the genomic context. There are advantages to identifying all sequence within a population, such as:
- Expanded SNP Calling
- Identify large gene families
- Identify sequence of structural variants
- Expanded sequence
- Population modalities of sequences
Work based off of a single reference may exclude many genes and sequences from downstream population studies. The construction of a panreference reduces this problem.
TODO:
- Web interface
- Better statistic tracking
- Load into Neo4j database
- Identify sequences that are singletons, and which sequence chunks are found in multiple, but not all, population members
- Identify conserved sequences
- Identify structural variations such as translocations, inversion, etc...
Not yet
Copyright © 2017 Joseph Guhlin
GPLv3