Code Monkey home page Code Monkey logo

pangea's Introduction

Pangea

The Pan-Genome Analysis Pipeline

DOI

Pangea is a pipeline to calculate the entire gene collection of a set of prokaryotic organisms (Pan genome). It is based on an all-against-all algorithm to find shared and non-shared genes between samples.

Pangea was designed to compare high diverse prokaryotic genomes using profile Hidden Markov Model (profile HMM) to make a position-specific scoring. Pangea builds a profile HMM for each gene that constitutes the pan genome. Then each profile is used to search the corresponding gene in each sample. During this search the profiles are iteratively populated with the corresponding homologous sequences. This allows each profile HMM to include all the variations that a gene presents between samples, increasing the robustness and therefore the accuracy of the search.

First, Pangea makes a single blast comparison (blastn for nucleotides or blastp for amino acids) between the features of the first two query samples or between the first query and a reference sample. Homologous sequences, according to an established e-value and a percentage of identity, are considered shared features, while features without any blast hit are considered non-shared. Sequences of shared features are extracted and aligned with muscle to build a profile HMM from the alignment using hmmbuil, while sequences of shared features are extracted and used to build profile HMM from a single sequence. Then, the profiles HMM are used to look for homologous features in the remaining samples according to the same percentage of identity and e-value used in the blast comparison. Only the best hit is considered homolog (shared feature) and its sequence is added to the corresponding profile HMM, while non-homologous features are re-analyzed to discard duplicates or fragmented sequences, which would generate less significant hits. Remaining non-homologous features are considered new elements of the pan genome and their sequence are used to create profiles HMM from a single sequence. All features are reported in a presence/absence table, which represents the composition of the entire pan genome, including the core genome, accessory genome and singletons.

Workflow

Installation

Pangea is a collection of perl scripts and it has been tested on Linux and MacOS. For its correct execution requires the previous installation of some third-party programs.

Dependencies

Third-party:

Perl modules:

R packages:

Input files

For now, Pangea works only with the outputs produced by Prokka. All the annotated samples should be contained in the same directory, one folder per sample. It is highly recommended to use the --locustag option during the annotation process with prokka. For more information on the installation and execution of prokka check here.

You have to create a text file that lists all the samples you want to analyze. The samples will be analyzed according to the order specified in the list.

Additionally you can include a reference sample. This can be declared at the beginning of your sample list or you can use the --trusted option to include it. See the Usage section.

Outputs

Depending on the used options, the possible outputs include the following:

Name Type Description
Blast Dir Contains all the blast databases used by the program
PanGenomeHmmDb Dir Contains a database of the pan genome used during the HMMER comparisons
ORFeomes Dir Contains the ORFeomes or Proteomes of the queries
ORFs Dir Contains all the analyzed features. One folder per feature
CoreSequences Dir Contains the core and the aligned-core fasta files of each query
*UniqueBlastComparison.txt File Contains the blast report of the first comparison
*Progress.csv File Records the progress of the analysis. It includes the total number of features, the size of the pan-genome and the number of new genes added in each sample analyzed
*Summary.txt File Contains a summary of the parameters used for the analysis and the final size of the pan and core genome
*Presence_Absence.csv File Contains the record of the features present in each sample. Each gene is identified by its locus tag
*CoreGenome.csv File Contains the record of the features present in the core genome defined as features present in the 100% of samples. Each gene is identified by its locus tag
*Boolean_PanGenome.csv File Contains the record of the features present in each sample. Presence of a feature is represented by 1 while absence is represented by 0
*Boolean_AccessoryGenes.csv File Contains the record of the dispensable features defined as features present in more than one sample but not present in the 100% of samples. Presence of a feature is represented by 1 while absence is represented by 0
*PanGenome.fasta File Multifasta file with the sequences of features of the final pan genome. Features included were extracted from the first sample that each feature was detected
*ConsensusPanGenome File Multifasta file with the consensus sequence of the pan genome obtained from the profile HMM of each feature
*PanGenome_HeatMap.pdf Plot Representation of the Presence/Absence of fueatures of the pan genome

Usage

   Usage:
      $./Pangea.pl [options] --list Full/Path/Of/Sample/List --annotation /Full/Path/Of/Annotation/Directory 
      --moltype nucl/prot --out Full/Path/For/Outputs

   Options:  
      --help                      Print this help message
      --project      -p   STR     Set a prefix name for the project
      --list         -l   STR     File with the list of samples. Full path is mandatory
      --trusted      -c   STR     Name of the sample that you want to use as reference. 
                                  Annotation of reference shuld to be included in the same 
                                  directory as the rest of the samples
      --evalue       -e   REAL    Expectation value threshold used on blast and HMMER comparisons
      --ident        -i   REAL    Minimum percentage of identity for blast and HMMER
      --cpus         -t   INT     Number of threads used only on some steps of blast and HMMER searches
      --conpan       -P           Create a multifasta file of consensus pan genes
      --alncore      -C           Creaate a multifasta file of aligned core genes
      --booleantbl   -b           Create gene presence/absence table for the pan and accessory genes 
                                  where presence is represented by 1 and abence by 0
      --annotation   -a   STR     Full path of the annotation directory           
      --moltype      -m   STR     Type of molecule you want to analyze, "nucl" for nucleotides 
                                  and "prot" for aminoacids
      --out          -o   STR     Output directory
      --recovery     -r           Recover an interrupted process and/or increase the analysis 
                                  with additional samples. Increase the analysis requires adding 
                                  the new samples to the sample list

Citation

Seemann T.
Prokka: rapid prokaryotic genome annotation
Bioinformatics 2014 Jul 15;30(14):2068-9.
PMID:24642063


Website: https://torresrc.github.io/Pangea/

Also available from:

Download Pangea

pangea's People

Contributors

torresrc avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

hugeandbulk

pangea's Issues

default values for command line arguments

I cannot find default values for the command line arguments. As a result, all the parameters need to be manually set for the program to run. It is particularly disturbing because I do not want to specify --trusted

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.