Pipeline for the comprehensive evaluation and re-annotation of gene and gene families in a genome assembly
This pipeline automates and standardizes gene and gene family member annotations for a number of gene/gene families in genome assemblies. It allows to obtain the most accurate number of gene copies and minimizes methodological biases that would otherwise perturb downstream comparative analyses. BITACORA and GeMoMa are the main tools used for the identification and annotation of gene families in genome assemblies, together with a first step that identifies and curate gene models using Blastp and InterProScan based on an input file with information of the gene families to annotate. This pipeline was optimized to re-annotate more than 400 gene and gene families across the 163 ant genomes generated in the Global Ant Genomics Alliance (GAGA) project. Briefly, the pipeline evaluates the initial generated general annotations, annotate previously missed genes, and label putative erroneus annotations to be manually reviewed and fixed if necessary using an Apollo or other similar tools.
Find here the specific datasets and steps used in the GAGA project: https://github.com/schraderL/GAGA/tree/main/Scripts/04_Gene_re-annotation
-
Prerequisites
-
Installation
-
Computational Requirements
-
Usage
4.1 Preparing data
4.2 Run pipeline
4.3 Output
-
Example
The requiered dependencies necessary for running the pipeline are:
- Perl: Perl is installed by default in most operating systems. See https://learn.perl.org/installing/ for installation instructions.
- Python: Download the newest version available from https://www.python.org/downloads/
- Biopython: Download from https://biopython.org/wiki/Download
- BLAST: Download blast executables from: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
- HMMER: The easiest way to install HMMER in your system is to type one of the following commands in your terminal:
brew install hmmer # OS/X, HomeBrew
port install hmmer # OS/X, MacPorts
apt install hmmer # Linux (Ubuntu, Debian...)
dnf install hmmer # Linux (Fedora)
yum install hmmer # Linux (older Fedora)
conda install -c bioconda hmmer # Anaconda
Or compile HMMER binaries from the source code: http://hmmer.org/
-
GeMoMa: GeMoMa is implemented in Java using Jstacs and can be downloaded from: http://www.jstacs.de/index.php/GeMoMa. The latest version of GeMoMa tested in our pipeline is v1.7.1.
-
BITACORA: Already inlcluded in this repository, but it can be downloaded from https://github.com/molevol-ub/bitacora HMMER and BLAST binaries require to be added to the PATH environment variable, or specified in the master script "runBITACORA_command_line.sh". It is also necessary to add in the script the path to the bitacora github repository and the path to GeMoMa jar file (i.e.: GeMoMa-1.7.1.jar).
-
MAFFT: Download the newest version available from https://mafft.cbrc.jp/alignment/software/
Finally, InterProScan output (in tsv format) generated by running this program in the initial annotated proteins is required as input. The script "submit_interpro.sh" must be executed before running the pipeline.
- InterProScan: Download it from https://interproscan-docs.readthedocs.io/en/latest/UserDocs.html#obtaining-a-copy-of-interproscan
NOTE: When running InterProScan, it is highly recommended to be do it in a computer cluster or workstation given the increase of RAM memory and time required.
The pipeline is distributed as a multiplatform shell script (run_pipeline.sh) that calls several other Python and Perl scripts, which include all functions responsible for performing all pipeline tasks. The version of BITACORA used in the pipeline is already included in this repository, hence, it does not require any installation or compilation step apart from the ones stated in Prerequisites.
To run the pipeline edit the master script "run_pipeline.sh" variables and also the script of BITACORA "runBITACORA_command_line.sh" path variables: BLAST and HMMER (if needed), BITACORA script folder and GeMoMa jar file.
The pipeline have been tested in UNIX-based platforms (both in Mac OS and Linux operating systems). Multiple threading can be set in blast searches, which is the most time-consuming step, by editing the variable THREADS in "run_pipeline.sh".
For a typical good quality genome (~2Gb in size and ~10,000 scaffolds) and a 40Gb RAM machine, it is able to analyze a whole genome and a total of 94 gene and gene families in 10 hours using 4 cores.
usage: run_analysis.py [-h]
pipeline_dir gene_families_info gene_families_db
proteome interpro gff genome bitacora name out_dir
threads
Pipeline for gene family re-annotation accross hundreds of genomes.
positional arguments:
pipeline_dir Main directory of the pipeline that contains the
README.md.
gene_families_info Excel file containing all the gene families information.
gene_families_db Directory containing the query protein databases
(GENEFAMILY-NAME_db.fasta), where the “GENEFAMILY-NAME”
label is a gene family name from the Excel file. The
addition of ”_db” to the database name with its proper
extension is mandatory. ej: PATH/TO/DB
proteome File with predicted proteins in FASTA format.
interpro File with predicted domains from InterPro in TSV format.
gff File with structural annotations in GFF3 format.
genome File with genomic sequences in FASTA format.
bitacora Path to script runBITACORA_command_line.sh.
name Name/ID of the genome ej. GAGA-0001.
out_dir Output directory. ej: PATH/TO/OUT_DIR
threads Number of threads to be used.
optional arguments:
-h, --help show this help message and exit
The input files required to run a full analysis (update the complete path to these files if needed in the master script "run_pipeline.sh") are the following:
- gene_families_info: Excel file called "gene_families.xlsx" that contains all the gene families names to be annotated together with the information requiered by the pipeline about them.
Example of the format of this file:
Function/Classification | Gene family | Blast | InterPro Domain | Pfam domain | Average number of genes | Average protein length | Peptide start length | E-value |
---|---|---|---|---|---|---|---|---|
Chemosensory receptors | OR | Yes | IPR004117 | PF02949 | 300-400 | 400 | 40 | 1,00E-5 |
Chemosensory and others | CD36 | Yes | IPR002159 | PF01130 | 9 | 350 | 40 | 1,00E-5 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
NOTE: Not all the fields from the table have to be filled for a gene family to be annotated. "Function/Classification", "InterPro Domain" and "Pfam domain" can be blank cells (specially if there is no domain information about the gene family). "Blast" cell can also be blank if there is no input fasta database of the gene family to run a blastp.
-
gene_families_db: directory containing the query protein databases (GENEFAMILY-NAME_db.fasta) in FASTA format, where the “GENEFAMILY-NAME” label is a gene family name from the Excel file. The addition of ”_db” to the database name with its proper extension is mandatory.
-
out_dir: output directory containing all gene families annotation.
-
File with genomic sequences in FASTA format
-
File with structural annotations in GFF3 format
-
File with predicted proteins in FASTA format.
-
File with predicted domains from InterPro in TSV format
Those files are some of the variables that must be specified in the main script "run_pipeline.sh".
Once edited the scripts as indicated in the Installation step and prepared the data, the pipeline can be executed with the following command:
bash run_pipeline.sh
or in the case of running in the computerome, or to submit in a computational cluster:
bash run_pipeline_computerome.sh
The pipeline generates for each GENOME AND GENE FAMILY, two outputs of interest:
- Step 1 outputs:
The output in this folder provide from the identification of genes or gene family members using the BLAST and protein domain searches (HMMER and InterProScan), and the identification and curation or putative annotation errors (such as chimeric or partial proteins) based on the results from these searches. These output files are generated in the first step before running bitacora and are located in the directory "OUTPUT_DIR/gene_families_pipeline/GENE-FAMILY-NAME/Step1". In this directory the GFF, CDS and PEP fasta files can be found as well as some Intermediate files.
- Step 2 outputs:
In the second step, we include the proteins identified in the first step and conduct a comprehensive identification and annotation of gene family members using BITACORA. The output are located in the directory "OUTPUT_DIR/gene_families_pipeline/GENE-FAMILY-NAME/Step2_bitacora". The description of this output can be found in the Documentation of BITACORA https://github.com/molevol-ub/bitacora.git
- Summary file:
For a general overview of the results, a file called table_results.txt in the output directory is produced and contains the number of genes identified in each step, as well as some reference numbers indicated in the input excel table to validate the retrieved annotations.
After running the pipeline, it's time to check the table_results.txt file for each genome and correct the gene families that have a label "Review". The steps we followed in the GAGA project are described here: https://github.com/schraderL/GAGA/tree/main/Scripts/04_Gene_re-annotation. For more information about this step contact [email protected].
The last step is to generate the final output that consists of selecting the desired output files for each gene family that is indicated in the last column of the table_results.txt. To do so, run the script generate_final_oputputs.sh after filling the following arguments:
usage: gen_final_output.py [-h]
pipeline_dir genome gene_families_info
pipeline_output out_dir
Script to generate the final outputs after running the re-annotation pipeline.
positional arguments:
pipeline_dir Main directory of the pipeline that contains the
README.md.
genome File with genomic sequences in FASTA format.
gene_families_info Excel file containing all the gene families information.
pipeline_output Directory with the output of the pipeline. ej:
PATH/TO/GAGA-0001
out_dir Output directory.
optional arguments:
-h, --help show this help message and exit
An example to run the pipeline can be found in Example folder. It consists of two chemosensory-related gene families in insects: Odorant receptors (ORs), and the CD36-SNMP gene family; will be searched in the chromosome 2R of Drosophila melanogaster. The GFF3 and protein files are modified from original annotations, deleting some gene models, to allow that BITACORA can identify novel not-annotated genes.
First, unzip the Example_files.zip to obtain the necessary files. Move the folder "Gene_families" with the fasta databases files to "Gene-annotation-pipeline/Data/". Then, move the folder named "Drosophila_melanogaster" (the genome name) in the directory "Gene-annotation-pipeline/Data/Genomes". Finally, move the "gene_families.xlsx" file to "Gene-annotation-pipeline/Data".
To run the example, edit the script "runBITACORA_command_line.sh" from BITACORA to add the paths variables and also edit the "run_pipeline.sh" script to add your path to the pipeline and the genome name (in this example "Drosophila_melanogaster"). Then in the command line:
bash run_pipeline.sh
Joel Vizueta, Alejandro Sánchez-Gracia, Julio Rozas. BITACORA: A comprehensive tool for the identification and annotation of gene families in genome assemblies. Molecular Ecology Resources, 2020. https://doi.org/10.1111/1755-0998.13202
References for GeMoMa:
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. Using intron position conservation for homology-based gene prediction. Nucleic Acids Research, 2016. https://doi.org/10.1093/nar/gkw092
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi. BMC Bioinformatics, 2018. https://doi.org/10.1186/s12859-018-2203-5