segatalab / viromeqc Goto Github PK

ViromeQC is a computational tool to benchmark and quantify non-viral contamination in VLP-enrihed viromes. ViromeQC provides an enrichment score for each virome. The score is calculated with respect to the expected prokaryotic markers abundances in reference metagenomes

License: MIT License

Python 99.36% Shell 0.64%

virome

viromeqc's Introduction

ViromeQC

Description

Provides an enrichment score for VLP viromes with respect to metagenomes
Useful benchmark for the quality of enrichment of a virome
Tested on Linux Ubuntu Server 16.04 LTS and on Linux Mint 19

Requires:

Bowtie2 >= v. 2.3.4
Samtools >= 1.3.1
Biopython >= 1.69
Pysam >= 0.14
Diamond (tested on v.0.9.9 and 0.9.29)
Python3 (tested on 3.6)
pandas >= 0.20

Update: ViromeQC now works with newer versions of diamond (e.g. v0.9.29) Thanks to Ryan Cook (@RyanCookAMR) for the new diamond db

Usage

Step 1: clone or download the repository

git clone --recurse-submodules https://github.com/SegataLab/viromeqc.git

or download the repository from the releases page

Step 2: install the database:

This steps downloads the database file. This needs to be done only the first time you run ViromeQC. This may require a few minutes, depending on your internet connection.

viromeQC.py --install

Alternatively, you can also download the database files from Zenodo. Once downloaded the files, create a folder named index/ in the ViromeQC installation folder and unzip all the files in this folder.

Step 3: Run on your sample

viromeQC.py -i <input_virome_file(s)> -o <report_file.txt>

Please Note: You can pass more than one file as input (e.g. for multiple runs or paired end reads). However, you can process only one sample at a time with this command. If you want to parallelize the execution, this can be easily done with Parallel or equivalent tools.

You can try the test example (test/test.sh) which analyzes 10'000 reads from the sample SRR829034. This should take approximately 1 or 2 minutes.

Parameters:

usage: viromeQC.py -i <input_virome_file> -o <report_file.txt>

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT [INPUT ...]], --input [INPUT [INPUT ...]]
                        Raw Reads in FASTQ format. Supports multiple inputs
                        (plain, gz o bz2) (default: None)
  -o OUTPUT, --output OUTPUT
                        output file (default: None)
  --minlen MINLEN       Minimum Read Length allowed (default: 75)
  --minqual MINQUAL     Minimum Read Average Phred quality (default: 20)
  --bowtie2_threads BOWTIE2_THREADS
                        Number of Threads to use with Bowtie2 (default: 4)
  --diamond_threads DIAMOND_THREADS
                        Number of Threads to use with Diamond (default: 4)
  -w {human,environmental}, --enrichment_preset {human,environmental}
                        Calculate the enrichment basing on human or
                        environmental metagenomes. Defualt: human-microbiome
                        (default: human)
  --bowtie2_path BOWTIE2_PATH
                        Full path to the bowtie2 command to use, deafult
                        assumes that bowtie2 is present in the system path
                        (default: bowtie2)
  --diamond_path DIAMOND_PATH
                        Full path to the diamond command to use, deafult
                        assumes that diamond is present in the system path
                        (default: diamond)
  --version             Prints version informations (default: False)
  --install             Downloads database files (default: False)
  --sample_name SAMPLE_NAME
                        Optional label for the sample to be included in the
                        output file (default: None)
  --tempdir TEMPDIR     Temporary Directory override (default is the system
                        temp. directory) (default: None)

Pipeline structure

ViromeQC starts from FASTQ files (compressed files are supported), and will:

Elimitate short and low quality reads
- adjust the minqual and minlen parameters if you want to change the thresholds
Map the reads against a curated collection of rRNAs and single-copy bacteral markers
Filter the reads to remove short and dlsivergent alignments
Compute the enrichment value of the sample, compared to the median observed in human metagenomes
- use -w environmental for envronmental reads
- reference medians for un-enriched metagenomes are taken from medians.csv, you can provide your own data to ViromeQC by changing this file accordingly
Produce a report file with the alignment rates and the final enrichment score (which is the minimum enrichment observed across SSU-rRNA, LSU-rRNA and single-copy markers)

Output

Output is given as a TSV file with the following structure:

Sample	Reads	Reads_HQ	SSU rRNA alignment (%)	LSU rRNA alignment (%)	Bacterial_Markers alignment (%)	total enrichmnet score
your_sample.fq	40000	39479	0.00759898	0.0227969	0.01266496	5.795329

An alignment score of 5.8 means that the virome is 5.8 times more enriched than a comparable metagenome
High score (e.g. 10-50) reflect high VLP enrichment

Citation

If you find this tool useful, please cite:

Zolfo, M., Pinto, F., Asnicar, F., Manghi, P., Tett A., Segata N. Detecting contamination in viromes using ViromeQC, Nature Biotechnology 37, 1408–1412 (2019)

viromeqc's People

Contributors

Stargazers

Watchers

Forkers

hwl26

viromeqc's Issues

[Errno 8] Exec format error:

Hi,
I am getting the following error when I attempt to run the example file that came with the software.

% bash test.sh
Checking Database Files                                           [ -   OK  - ]
[fastq_len_filter] | 9384 / 10000 (94.0%) reads selected          [ -  DONE - ]
[SILVA_SSU]   | Bowtie2 Aligning                                  [ -  ...  - ]Fatal error running Bowtie2 on SSU rRNA. Error message:
[Errno 8] Exec format error:
'/g/scb2/bork/mocat/software/viromeqc/1.0/cmseq/cmseq/filter.py'  [ -  FAIL - ]

I am using the following versions of the dependencies

python/3.6 diamond/0.9.24 samtools/1.7 bowtie2/2.3.4.3

Thank you in advance for your help.

biopython error

Dear developers, I've downloaded viromeqc but I have some problems when running. I created a conda environment for it with python3.8, and I installed pandas and biopython, but after this, the following error occurs:

(viromeqc) pgen@pgen:~/sw/viromeqc$ python3.8 ./viromeQC.py
Failed in importing Biopython. Please check Biopython is
installed properly on your system! [ - FAIL - ]

when I try to reinstall biopython it seems to be properly installed:

(viromeqc) pgen@pgen:~/sw/viromeqc$ pip install biopython
Requirement already satisfied: biopython in /home/pgen/miniconda3/envs/viromeqc/lib/python3.8/site-packages (1.79)
Requirement already satisfied: numpy in /home/pgen/miniconda3/envs/viromeqc/lib/python3.8/site-packages (from biopython) (1.23.2)

Any help will be great!

Singularity/Docker

Hi,
Do you have any plan to release a Singularity/Docker version of viromeqc?
Thanks

Can I turn off length and quality filtering?

Hi,

Can I turn off length and quality filtering?

I'd like to use viromeqc for data trimmed already by other tools such as trimmomatic. In this case, I think filtering by viromeqc may be not necessary.

Further, it seems that the filtering step takes much time compared to bowtie2 and diamond steps because filtering uses only one thread.

Thanks.

Ilnam

Error in testing viromeQC

Hi,

I am new to use the tool to evaluate my virome sequence dataset. I always got an error that came with the tool when I attempted to run the example file as follows:

File "/Users/Ernie/viromeqc/test/../viromeQC.py", line 9, in
import pandas as pd
ModuleNotFoundError: No module named 'pandas'

I've downloaded the pandas via conda/bioconda and saw its version is 1.3.2. Can someone help me resolve this issue please?

Many thanks in advance for your kind help!

Can't detect LSU and SSU reads

Hey SegataLab,
I've installed viromeqc but I'm experiencing the folling problems:

Getting the git repo doesn't download cmseq by default. One needs to get it manually and put it into the correct folder
LSU and SSU reads are not detected, skewing the enrichment score, making the enrichment score better than it is
Although viromeQC runs to completion, there's the following error in the output:

Traceback (most recent call last):
  File "/home/shiraz/apps/viromeqc/cmseq/cmseq/filter.py", line 5, in <module>
    from cmseq import __version__
ImportError: cannot import name '__version__' from 'cmseq' (/home/shiraz/apps/viromeqc/cmseq/cmseq/cmseq.py)
[main_samview] fail to read the header from "-".
Traceback (most recent call last):
  File "/home/shiraz/apps/viromeqc/cmseq/cmseq/filter.py", line 5, in <module>
    from cmseq import __version__
ImportError: cannot import name '__version__' from 'cmseq' (/home/shiraz/apps/viromeqc/cmseq/cmseq/cmseq.py)
[main_samview] fail to read the header from "-".

Failed to retrieve DB error

Hi,
When I run viromeQC.py --install , I got "Failed to retrieve DB". Could you tell me how to solve this problem? Thank you very much.

can get the virome db:

i failed to retrieve DB form the zenodo or dropbox !

can you give me another way to solve this problem?

thanks a lot!

viromeqc does not work with current versions of python and samtools

Dear SegataLab,
I'm trying to run viromeQC, but there are a series of errors that seem to relate to dependency versions.

E.g. cmseq does not install by default. Upon getting the newest version of cmseq and inserting it into the viromeQC folder, LSU and SSU reads are undetected with a cmseq python error

Upon installing an old version of cmseq matching the date of the viromeQC release, the following error is produced upon running:
[main_samview] fail to read the header from "-".

This seems to be related to the way newer versions of samtools process stdin and stdout.

Trying to installing an older version of samtools fails through conda because it requires an outdated version of Python (3.1) which is older than the 3.6 you tested with viromeqc

Is it possible for you to update the viromeQC repo so that it can run with current versions of dependencies?

Support for already-qc'd fna files of reads

Dear Moreno,
It would be so convenient to have support for read fna files that have already been QC'd so they can be mapped directly by bowtie2. The current QC step is slow, and allowing for this would speed up viromeQC a lot!

A lot of us have MDA amplified viromes, where a read deduplication step is necessary after read filtering and trimming to remove redundant reads prior to assembly, mapping, etc. Such deduplication is already done by other tools (such as vsearch or rmdup) and dramatically reduces the size of the read files. Following qc and deduplication, the quality scores are no longer important, so the output is normally fna files of reads instead of fastq files.

Incorporating support for qc'd fna files (bowtie2 also supports this) would dramatically increase the speed of viromeQC because the length filtering step is omitted, plus fewer reads have to be mapped by bowtie2. E.g. many of my MDA amplified virome files are often 10x smaller after passing my own qc and deduplication pipeline.

Please consider this, as it would be extremely easy for you to implement!

How to select the parameter "--enrichment_preset"

Dear author,

Thanks for your excellent work. I have some VLP enrichment samples from the animal gut. When I use the tool, how to select the parameter "--enrichment_preset". The default is "human", but I do not know whether I should calculate the enrichment based on the human microbiome. I hope to get your instructions.

Best Regards,
Jiaojiao