Code Monkey home page Code Monkey logo

viralverify's Introduction

viralVerify: viral contig verification tool

Version: 1.1

viralVerify classifies contigs (output of metaviralSPAdes or other assemblers) as viral, non-viral or uncertain, based on gene content. Also for non-viral contigs it can optionally provide plasmid/non-plasmid classification.

viralVerify predicts genes in the contig using Prodigal in the metagenomic mode, runs hmmsearch on the predicted proteins and classifies the contig as vrial or non-viral by applying the Naive Bayes classifier (NBC). For the set of predicted HMMs, viralVerify uses trained NBC to classify this set to be viral or chromosomal.

To improve results in the case of metagenomes with possible host contamination, we recommend users to filter out reads that align to the host genome prior to assembly. Since viralVerify is based on gene classification, it can be used on contigs on any length, and short viruses can be detected as long as they contain a recognizable virus-specific gene. To help analyze the rapidly growing amount of novel data, we have added a script that allows users to construct their own training database from a set of viral, chromosomal and plasmid contigs, as well as custom HMM database

Requirements

viralVerify is a Python script, thus, installation is not required. However, it has the following dependencies:

or

  • recent release of the Pfam-A database (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/).

To work properly, viralVerify require Prodigal and hmmsearch in your PATH environment variable.

Optional BLAST verification

You can verify your output by BLAST to check if you found novel viruses or plasmids. In this case, you need to have blastn in your $PATH, Biopython installed, and provide a path to the nucleotide database (e.g. local copy of the NCBI nt database). For each contig we report information (e-value, query coverage, identity and subject title) about its best blast hit in the provided database.

Usage

viralverify 
        -f Input fasta file
        -o output_directory 
        --hmm HMM  Path to HMM database

        Optional arguments:
        -h, --help  Show the help message and exit
        --db DB     Run BLAST on input contigs against provided db
        -t          Number of threads
        -thr THR    Sensitivity threshold (minimal absolute score to classify sequence, default = 7)
        -p          Output predicted plasmidic contigs separately

Output file: comma-separated table <input_file>_result_table.csv

Output format: contig name, prediction result, log-likelihood ratio, list of predicted HMMs

Fasta files with prediction results can be found in the Prediction_results_fasta folder

To decrease number of false positives (at the expense of potential false negatives) you may increase the detection threshold, provided as an optional argument.

Retraining classifier

You can retrain the classifier with your custom data using provided training_script. It takes viral, chromosomal and plasmid (optionally) training sequences in fasta format and set of HMMS, predict genes and HMM hits, and returns the frequency table. To use the retrained classifier, replace the "classifier_table.txt" file in the viralVerify directory with the obtained table.

viralverify's People

Contributors

dmitry-antipov avatar mikeraiko avatar samnooij avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

viralverify's Issues

small issue in output format

Hi,

I ran viralVerify with the -p flag, and most lines were fine, but I ran across one like this, where a comma seems to be missing from between the last two columns when there are no PFAMs detected:

cluster361_bin0_k105_8559,Uncertain - viral or bacterial,3643,-,--

Thanks, ben

can viralVerify used for metatranscriptome?

Hi there,

I have learned that metaviralSPAdes was used for virus from metagenomes, but not very suitable for metatranscriptomes.

So can I use tools like Trinity or rnaSPAdes to first assemble the metatranscriptomes reads, and use viralVerify to find the viral contigs? Alse, dose viralComplete work for RNA viral contigs?

Thank you!

PlasmidVerify cutoff selection

Since the cutoff used for plasmid prediction differs to that one used in plasmidVerify, viralVerify outputs lots of plasmid sequences as unknown. Cutoff selection for plasmid prediction should be improved.

Segmentation fault during hmmsearch

I've been using viral verify to analyse some long-read derived viral contigs from a coastal viromes. Out of 18 viromes, five fail due to a segmentation fault. I managed to chase down the contig in one of the failures that was causing the error (attached) with the output folder.

The segmentation fault occurs on both our HPC infrastructure and on a ubuntu 18 desktop. In both systems, viralVerify is being run through conda with the following environment

name: viralVerify
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=1_gnu
  - ca-certificates=2020.6.20=hecda079_0
  - certifi=2019.11.28=py27h8c360ce_1
  - hmmer=3.3.1=he1b5a44_0
  - ld_impl_linux-64=2.35=h769bd43_9
  - libffi=3.2.1=he1b5a44_1007
  - libgcc-ng=9.3.0=h5dbcf3e_17
  - libgomp=9.3.0=h5dbcf3e_17
  - libstdcxx-ng=9.3.0=h2ae2ef3_17
  - ncurses=6.2=h58526e2_3
  - openssl=1.1.1h=h516909a_0
  - pip=20.1.1=pyh9f0ad1d_0
  - prodigal=2.6.3=h516909a_2
  - python=2.7.15=h5a48372_1011_cpython
  - python_abi=2.7=1_cp27mu
  - readline=8.0=he28a2e2_2
  - setuptools=44.0.0=py27_0
  - sqlite=3.33.0=h4cf870e_1
  - tk=8.6.10=hed695b0_1
  - wheel=0.35.1=pyh9f0ad1d_0
  - zlib=1.2.11=h516909a_1010

seg_fault.zip
seg.fault.fa.zip

disagreements between plasmidVerify and viralVerify

Hi,

I have contigs assembled by plasmidSPAeds based on bacterial isolates PE reads -- therefore suspected plasmids. I want to further filter these contigs using plasmidVerify (as suggested in your paper) and viralVerify. So I have run both tools and found some disagreements in predictions. While, as expected, viralVerify predicted some of plasmidVerify 'Plasmid' contigs as viral ones, it also predicted some of the 'Chromosoome' plasmids as 'Plasmids'.

For example here are predicted 'plasmids' by viralVerify:

NODE_15_length_37998_cov_14.724243_component_0,**Plasmid**,37998,-,6.89,AIRS AIRS_C MCPsignal 4HB_MCP_1 TarH Oxidored_FMN Peripla_BP_4 AP_endonuc_2 GFO_IDH_MocA GFO_IDH_MocA_C Sigma54_activat HTH_8 Radical_SAM Bac_luciferase BPD_transp_2 Acyl-CoA_dh_2 Acyl-CoA_dh_M Acyl-CoA_dh_N BPD_transp_2 ABC_tran ABC_tran ABC_tran LysR_substrate HTH_1 Acetyltransf_1 Sigma54_activat PAS_9 PAS PAS_4 HTH_8 PAS_9 PAS_8 PAS_8 Peripla_BP_4 Abhydrolase_1 CN_hydrolase Molybdopterin Molydop_binding BPD_transp_2 Acyl-CoA_dh_N Acyl-CoA_dh_2 Acyl-CoA_dh_M ABC_tran BCA_ABC_TP_C BPD_transp_2 AMP-binding VirK Pyr_redox_3 Peripla_BP_6
NODE_16_length_25536_cov_16.257905_component_0,**Plasmid**,25536,-,5.07,Fe-ADH adh_short_C2 adh_short adh_short_C2 Epimerase NAD_binding_9 DJ-1_PfpI HTH_18 DJ-1_PfpI Sigma54_activat XylR_N V4R HTH_8 LysR_substrate HTH_1 Phenol_MetA_deg Peripla_BP_6 DAO PALP OCD_Mu_crystall Shikimate_DH K_oxygenase LysE Beta-lactamase tRNA-synt_1g Cupin_2 Aldedh Phage_integrase

and the corresponding prediction by the plasmidVerify:

NODE_15_length_37998_cov_14.724243_component_0,**Chromosome**,-2.497141683569888,AIRS AIRS_C MCPsignal 4HB_MCP_1 TarH Oxidored_FMN Peripla_BP_4 AP_endonuc_2 GFO_IDH_MocA GFO_IDH_MocA_C Sigma54_activat HTH_8 Radical_SAM Bac_luciferase BPD_transp_2 Acyl-CoA_dh_2 Acyl-CoA_dh_M Acyl-CoA_dh_N BPD_transp_2 ABC_tran ABC_tran ABC_tran LysR_substrate HTH_1 Acetyltransf_1 Sigma54_activat PAS_9 PAS PAS_4 HTH_8 PAS_9 PAS_8 PAS_8 Peripla_BP_4 Abhydrolase_1 CN_hydrolase BPD_transp_2 Acyl-CoA_dh_N Acyl-CoA_dh_2 Acyl-CoA_dh_M ABC_tran BCA_ABC_TP_C Molybdopterin Molydop_binding BPD_transp_2 AMP-binding VirK Pyr_redox_3 Peripla_BP_6
NODE_16_length_25536_cov_16.257905_component_0,**Chromosome**,-3.574700563044223,Fe-ADH adh_short_C2 adh_short adh_short_C2 Epimerase NAD_binding_9 DJ-1_PfpI HTH_18 DJ-1_PfpI Sigma54_activat XylR_N V4R HTH_8 LysR_substrate HTH_1 Phenol_MetA_deg Peripla_BP_6 DAO PALP OCD_Mu_crystall Shikimate_DH K_oxygenase LysE Beta-lactamase tRNA-synt_1g Cupin_2 Aldedh Phage_integrase

I would be happy for your advice regarding the best strategy to continue. Should I look only on consensus prediction by both viralVerify and plasmidVerify (i.e. both tools predict them as a plasmid) or any other approach?

Thanks in advanced,
Haim

Conda installable?

I'd like to incorporate this in some conda environments. RIght now I have some scripts that download it and add it to the path but it would be really convenient to have this installable via conda.

Is this possible?

ValueError: invalid literal for int() with base 10

Hi there,

I've been getting this error with one of my datasets:

Gene prediction... Traceback (most recent call last): File "/usr/local/software/viralVerify/viralverify.py", line 424, in <module> main() File "/usr/local/software/viralVerify/viralverify.py", line 220, in main if int(gene_start.strip()) < int((contig_len_circ[contig_name][0])): ValueError: invalid literal for int() with base 10: '5_plasmid_pHL2708X3,_complete_sequence:0-8689:1_1'

This is what the that protein fasta looks like:

NZ_CP021331.1_Maritalea_myrionectae_strain_HL2708#5_plasmid_pHL2708X3,_complete_sequence:0-8689:1_1 # 44 # 331 # 1 # ID=22686_1;partial=00;start_type=TTG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;gc_cont=0.462
MQKLPLDHVKIDKSFVQSADQDPRAHEITLTIVRLCSSFGMGCIAEGVETAAQLEMLSKI
GCHTLQGYYFAKPMSAKQVNQYIAENTPMISIGIG*

How to run data

  1. In the verify section, in the data downloaded from the RefSeq dataset, there are multiple > identified sequences in the FASTA file of each virus. Are these sequences contings? In the process of naive Bayesian classifier training and sequence prediction, do we use all sequences in FASTA file as the processing unit, or the part identified by > as the basic processing unit
  2. The code in the second part 'viralverify' and 'training_script' Can the input of script code only be one sequence? How to deal with all data sets? You need to use script batch processing under Linux?

Syntax Error in viralverify.py

Running the viralverify.py script, I found this syntax Error

File "./viralverify.py", line 173
    open (f"{res_path}_virus.fasta", "a").close() 
                                  ^
SyntaxError: invalid syntax

These the packages in my environment:

packages in environment at ~/miniconda2/envs/viralverify:

Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
ca-certificates           2021.4.13            h06a4308_1  
certifi                   2020.12.5        py39h06a4308_0  
hmmer                     3.3.2                he1b5a44_0    bioconda
ld_impl_linux-64          2.33.1               h53a641e_7  
libffi                    3.3                  he6710b0_2  
libgcc-ng                 9.1.0                hdf63c60_0  
libstdcxx-ng              9.1.0                hdf63c60_0  
ncurses                   6.2                  he6710b0_1  
openssl                   1.1.1k               h27cfd23_0  
pip                       21.1.1           py39h06a4308_0  
prodigal                  2.6.3                h516909a_2    bioconda
python                    3.9.5                hdb3f193_3  
readline                  8.1                  h27cfd23_0  
setuptools                52.0.0           py39h06a4308_0  
sqlite                    3.35.4               hdfb4753_0  
tk                        8.6.10               hbc83047_0  
tzdata                    2020f                h52ac0ba_0  
wheel                     0.36.2             pyhd3eb1b0_0  
xz                        5.2.5                h7b6447c_0  
zlib                      1.2.11               h7b6447c_3  

"_input_with_circ.fasta" contains all sequences from input contigs.fasta

When running viralverify, an output file "XXX_input_with_circ.fasta" is produced for input XXX.fasta.

However, this file contains the same number of sequences as the input. Exploring the repo, I noticed it should have been produced by check_circular.py.

I can't however manage to run that script solo, using a fresh install of ViralVerify: it throw an error first that fastaparser module is absent. I installed it with pip install fastaparser, and now the error is: AttributeError: module 'fastaparser' has no attribute 'read_fasta'

My question is : is it intended in viralverify workflow that XXX_input_with_circ.fasta be first a strict copy of the input fasta that is then trimmed to only "circs.fasta" based on annotation ? (HMM search is still running for me now).

viralVerify vs plasmidVerify

Hi ablab,
Thanks for your tools!
I have a couple of questions>

  1. Is plasmidVerify integrated in viralVerify? will be any difference if I run the two tools to my contigs or will that be redundant?
  2. How can I interpret the scores I am getting from viralverify? and biological threshold behind?
    Thanks!
    Ricardo

How to run this tool?

I'm working on an institute-wide pipeline for JCVI and had some trouble running your tool.

Here's my version installed via pip:

 viral_verify --version
viral_verify, version 0.1.1

Here's my command:

viral_verify -i veba_output/binning/47-Drifterexpttime4punches_S40/tmp/unbinned.fasta -o veba_output/binning/47-Drifterexpttime4punches_S40/intermediate/viral_viralverify_output -H /usr/local/scratch/CORE/jespinoz/db/pfam/v33.1/Pfam-A.hmm -t 16

Edit: I had to decompress the PFAM database which was the error in the original post that I've edited since then.

Should I be using the PFAM database or the database from FigShare?

Can you update the Usage on your GitHub?

This is the results output:

veba_output/binning/47-Drifterexpttime4punches_S40/intermediate/viral_viralverify_output/
├── classified-fasta-output
│   ├── unbinned-chromosome.fasta
│   └── unbinned-unclassified.fasta
├── unbinned-circularized.fasta
├── unbinned-genes.fa
├── unbinned-hmmsearch.domtblout
├── unbinned-hmmsearch.output
├── unbinned-proteins-circularized.fa
├── unbinned-proteins.fa
└── unbinned-results.csv

I ran the version from GitHub on a differen tdataset and got the following output:

testing/viralverify_output/
├── oral_viruses_domtblout
├── oral_viruses_feature_table.txt
├── oral_viruses_genes.fa
├── oral_viruses_input_with_circ.fasta
├── oral_viruses_out_pfam
├── oral_viruses_prodigal.log
├── oral_viruses_proteins_circ.fa
├── oral_viruses_proteins.fa
├── oral_viruses_result_table.csv
├── Prediction_results_fasta
│   ├── oral_viruses_chromosome.fasta
│   ├── oral_viruses_plasmid.fasta
│   ├── oral_viruses_plasmid_uncertain.fasta
│   ├── oral_viruses_virus.fasta
│   └── oral_viruses_virus_uncertain.fasta
└── viralverify.log

1 directory, 15 files

How come the output is so different between the pip and GitHub versions?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.