Pipeline for the comprehensive evaluation and re-annotation of gene and gene families in a genome assembly

This pipeline automates and standardizes gene and gene family member annotations for a number of gene/gene families in genome assemblies. It allows to obtain the most accurate number of gene copies and minimizes methodological biases that would otherwise perturb downstream comparative analyses. BITACORA and GeMoMa are the main tools used for the identification and annotation of gene families in genome assemblies, together with a first step that identifies and curate gene models using Blastp and InterProScan based on an input file with information of the gene families to annotate. This pipeline was optimized to re-annotate more than 400 gene and gene families across the 163 ant genomes generated in the Global Ant Genomics Alliance (GAGA) project. Briefly, the pipeline evaluates the initial generated general annotations, annotate previously missed genes, and label putative erroneus annotations to be manually reviewed and fixed if necessary using an Apollo or other similar tools.

Find here the specific datasets and steps used in the GAGA project: https://github.com/schraderL/GAGA/tree/main/Scripts/04_Gene_re-annotation

Prerequisites
Installation
Computational Requirements
Usage

4.1 Preparing data

4.2 Run pipeline

4.3 Output
Example

1. Prerequisites

The requiered dependencies necessary for running the pipeline are:

Perl: Perl is installed by default in most operating systems. See https://learn.perl.org/installing/ for installation instructions.
Python: Download the newest version available from https://www.python.org/downloads/
Biopython: Download from https://biopython.org/wiki/Download
BLAST: Download blast executables from: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
HMMER: The easiest way to install HMMER in your system is to type one of the following commands in your terminal:

brew install hmmer               # OS/X, HomeBrew
port install hmmer               # OS/X, MacPorts
apt install hmmer                # Linux (Ubuntu, Debian...)
dnf install hmmer                # Linux (Fedora)
yum install hmmer                # Linux (older Fedora)
conda install -c bioconda hmmer  # Anaconda

Or compile HMMER binaries from the source code: http://hmmer.org/

GeMoMa: GeMoMa is implemented in Java using Jstacs and can be downloaded from: http://www.jstacs.de/index.php/GeMoMa. The latest version of GeMoMa tested in our pipeline is v1.7.1.
BITACORA: Already inlcluded in this repository, but it can be downloaded from https://github.com/molevol-ub/bitacora HMMER and BLAST binaries require to be added to the PATH environment variable, or specified in the master script "runBITACORA_command_line.sh". It is also necessary to add in the script the path to the bitacora github repository and the path to GeMoMa jar file (i.e.: GeMoMa-1.7.1.jar).
MAFFT: Download the newest version available from https://mafft.cbrc.jp/alignment/software/

Finally, InterProScan output (in tsv format) generated by running this program in the initial annotated proteins is required as input. The script "submit_interpro.sh" must be executed before running the pipeline.

InterProScan: Download it from https://interproscan-docs.readthedocs.io/en/latest/UserDocs.html#obtaining-a-copy-of-interproscan

NOTE: When running InterProScan, it is highly recommended to be do it in a computer cluster or workstation given the increase of RAM memory and time required.

2. Installation

The pipeline is distributed as a multiplatform shell script (run_pipeline.sh) that calls several other Python and Perl scripts, which include all functions responsible for performing all pipeline tasks. The version of BITACORA used in the pipeline is already included in this repository, hence, it does not require any installation or compilation step apart from the ones stated in Prerequisites.

To run the pipeline edit the master script "run_pipeline.sh" variables and also the script of BITACORA "runBITACORA_command_line.sh" path variables: BLAST and HMMER (if needed), BITACORA script folder and GeMoMa jar file.

3. Computational requirements

The pipeline have been tested in UNIX-based platforms (both in Mac OS and Linux operating systems). Multiple threading can be set in blast searches, which is the most time-consuming step, by editing the variable THREADS in "run_pipeline.sh".

For a typical good quality genome (~2Gb in size and ~10,000 scaffolds) and a 40Gb RAM machine, it is able to analyze a whole genome and a total of 94 gene and gene families in 10 hours using 4 cores.

4. Usage

usage: run_analysis.py [-h]
                       pipeline_dir gene_families_info gene_families_db
                       proteome interpro gff genome bitacora name out_dir
                       threads

Pipeline for gene family re-annotation accross hundreds of genomes.

positional arguments:
  pipeline_dir        Main directory of the pipeline that contains the
                      README.md.
  gene_families_info  Excel file containing all the gene families information.
  gene_families_db    Directory containing the query protein databases
                      (GENEFAMILY-NAME_db.fasta), where the “GENEFAMILY-NAME”
                      label is a gene family name from the Excel file. The
                      addition of ”_db” to the database name with its proper
                      extension is mandatory. ej: PATH/TO/DB
  proteome            File with predicted proteins in FASTA format.
  interpro            File with predicted domains from InterPro in TSV format.
  gff                 File with structural annotations in GFF3 format.
  genome              File with genomic sequences in FASTA format.
  bitacora            Path to script runBITACORA_command_line.sh.
  name                Name/ID of the genome ej. GAGA-0001.
  out_dir             Output directory. ej: PATH/TO/OUT_DIR
  threads             Number of threads to be used.

optional arguments:
  -h, --help          show this help message and exit

4.1. Preparing data

The input files required to run a full analysis (update the complete path to these files if needed in the master script "run_pipeline.sh") are the following:

gene_families_info: Excel file called "gene_families.xlsx" that contains all the gene families names to be annotated together with the information requiered by the pipeline about them.

Example of the format of this file:

Function/Classification	Gene family	Blast	InterPro Domain	Pfam domain	Average number of genes	Average protein length	Peptide start length	E-value
Chemosensory receptors	OR	Yes	IPR004117	PF02949	300-400	400	40	1,00E-5
Chemosensory and others	CD36	Yes	IPR002159	PF01130	9	350	40	1,00E-5
...	...	...	...	...	...	...	...	...

NOTE: Not all the fields from the table have to be filled for a gene family to be annotated. "Function/Classification", "InterPro Domain" and "Pfam domain" can be blank cells (specially if there is no domain information about the gene family). "Blast" cell can also be blank if there is no input fasta database of the gene family to run a blastp.

gene_families_db: directory containing the query protein databases (GENEFAMILY-NAME_db.fasta) in FASTA format, where the “GENEFAMILY-NAME” label is a gene family name from the Excel file. The addition of ”_db” to the database name with its proper extension is mandatory.
out_dir: output directory containing all gene families annotation.
File with genomic sequences in FASTA format
File with structural annotations in GFF3 format
File with predicted proteins in FASTA format.
File with predicted domains from InterPro in TSV format

Those files are some of the variables that must be specified in the main script "run_pipeline.sh".

4.2. Run pipeline

Once edited the scripts as indicated in the Installation step and prepared the data, the pipeline can be executed with the following command:

bash run_pipeline.sh

or in the case of running in the computerome, or to submit in a computational cluster:

bash run_pipeline_computerome.sh

4.3. Output

The pipeline generates for each GENOME AND GENE FAMILY, two outputs of interest:

Step 1 outputs:

The output in this folder provide from the identification of genes or gene family members using the BLAST and protein domain searches (HMMER and InterProScan), and the identification and curation or putative annotation errors (such as chimeric or partial proteins) based on the results from these searches. These output files are generated in the first step before running bitacora and are located in the directory "OUTPUT_DIR/gene_families_pipeline/GENE-FAMILY-NAME/Step1". In this directory the GFF, CDS and PEP fasta files can be found as well as some Intermediate files.

Step 2 outputs:

In the second step, we include the proteins identified in the first step and conduct a comprehensive identification and annotation of gene family members using BITACORA. The output are located in the directory "OUTPUT_DIR/gene_families_pipeline/GENE-FAMILY-NAME/Step2_bitacora". The description of this output can be found in the Documentation of BITACORA https://github.com/molevol-ub/bitacora.git

Summary file:

For a general overview of the results, a file called table_results.txt in the output directory is produced and contains the number of genes identified in each step, as well as some reference numbers indicated in the input excel table to validate the retrieved annotations.

4.4. Output selection

After running the pipeline, it's time to check the table_results.txt file for each genome and correct the gene families that have a label "Review". The steps we followed in the GAGA project are described here: https://github.com/schraderL/GAGA/tree/main/Scripts/04_Gene_re-annotation. For more information about this step contact [email protected].

The last step is to generate the final output that consists of selecting the desired output files for each gene family that is indicated in the last column of the table_results.txt. To do so, run the script generate_final_oputputs.sh after filling the following arguments:

usage: gen_final_output.py [-h]
                           pipeline_dir genome gene_families_info
                           pipeline_output out_dir

Script to generate the final outputs after running the re-annotation pipeline.

positional arguments:
  pipeline_dir        Main directory of the pipeline that contains the
                      README.md.
  genome              File with genomic sequences in FASTA format.
  gene_families_info  Excel file containing all the gene families information.
  pipeline_output     Directory with the output of the pipeline. ej:
                      PATH/TO/GAGA-0001
  out_dir             Output directory.

optional arguments:
  -h, --help          show this help message and exit

5. Example

An example to run the pipeline can be found in Example folder. It consists of two chemosensory-related gene families in insects: Odorant receptors (ORs), and the CD36-SNMP gene family; will be searched in the chromosome 2R of Drosophila melanogaster. The GFF3 and protein files are modified from original annotations, deleting some gene models, to allow that BITACORA can identify novel not-annotated genes.

First, unzip the Example_files.zip to obtain the necessary files. Move the folder "Gene_families" with the fasta databases files to "Gene-annotation-pipeline/Data/". Then, move the folder named "Drosophila_melanogaster" (the genome name) in the directory "Gene-annotation-pipeline/Data/Genomes". Finally, move the "gene_families.xlsx" file to "Gene-annotation-pipeline/Data".

To run the example, edit the script "runBITACORA_command_line.sh" from BITACORA to add the paths variables and also edit the "run_pipeline.sh" script to add your path to the pipeline and the genome name (in this example "Drosophila_melanogaster"). Then in the command line:

bash run_pipeline.sh

6. Citation

Joel Vizueta, Alejandro Sánchez-Gracia, Julio Rozas. BITACORA: A comprehensive tool for the identification and annotation of gene families in genome assemblies. Molecular Ecology Resources, 2020. https://doi.org/10.1111/1755-0998.13202

References for GeMoMa:

J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. Using intron position conservation for homology-based gene prediction. Nucleic Acids Research, 2016. https://doi.org/10.1093/nar/gkw092

J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi. BMC Bioinformatics, 2018. https://doi.org/10.1186/s12859-018-2203-5

iandreu / gene-annotation-pipeline Goto Github PK

gene-annotation-pipeline's Introduction

Pipeline for the comprehensive evaluation and re-annotation of gene and gene families in a genome assembly

Contents

1. Prerequisites

2. Installation

3. Computational requirements

4. Usage

4.1. Preparing data

4.2. Run pipeline

4.3. Output

4.4. Output selection

5. Example

6. Citation

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent