xunchen85 / vicaller Goto Github PK

View Code? Open in Web Editor NEW

14.0 2.0 5.0 14.84 MB

A software to detect virome-wide integrations

Perl 99.90% Shell 0.10%

virology bioinformatics-tool ngs cancer-genomics

vicaller's Introduction

VIcaller v1.1

A software to detect virome-wide viral integrations

It is important to run "validate" function to remove possible false-positives to obtain the final list of candidate viral integrations.

1 Introduction

Viral Integration caller (VIcaller) is a bioinformatics tool designed for identifying viral integration events using high-throughput sequencing (HTS) data. VIcaller is developed under Linux platform. It uses both FASTQ files or aligned BAM files as input. It also supports both single-end and paired-end reads. VIcaller contains one main Perl script, VIcaller.pl, that include three main functions: 1) detect, which will detect virome-wide candidate viruses and integration events; 2) validate, which will perform the in silico validation on those candidate viral integrations; 3) calculate, which will calculate the integration allele fraction. We also generated a comprehensive viral reference genome library with 411,195 unique whole and partial genomes, covering all six virus taxonomic classes. The viral reference genome library also comes with a taxonomy database in a defined format that give the virus name, etc.

2 Availability

VIcaller is an open-source software. VIcaller.v1.1 source code is also available at http://www.uvm.edu/genomics/software/VIcaller.html. You need to get the virome-wide library and vector database at http://www.uvm.edu/genomics/software/VIcaller.html.

3 VIcaller installation

3.1 Unzip the VIcaller installer

Unzip the installer and change the directory

$ tar vxzf VIcaller.tar.gz
$ cd VIcaller/
$ mkdir Tools

3.2 Install the dependent Perl libraries and tools

a) Currently VIcaller relies on the following dependencies to be compiled (contact Dr. Xun Chen if you need help get those tools or Perl libraries installed).
b) Obtain the installed file from the following links.
c) Follow the instruction to successfully install each tool (contact server manager if there is any compile issues).
d) Check or install the listed Perl libraries using cpan, cpanm or other methods.

Install each of the listed tools

• BWA (default version: v0.7.10): https://github.com/lh3/bwa/tree/master/bwakit
• Bowtie2 (default version: v2.2.7): https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.7/
• TopHat2 (v2.1.1): http://ccb.jhu.edu/software/tophat/index.shtml
• BLAT (default version: v.35): http://genomic-identity.wikidot.com/install-blat
• BLAST+ (default version: v2.2.30): http://mirrors.vbi.vt.edu/mirrors/ftp.ncbi.nih.gov/blast/executables/blast%2B/2.2.30/
• SAMtools (default version: v1.6): https://sourceforge.net/projects/samtools/
• HYDRA (default version: 0.5.3): https://code.google.com/archive/p/hydra-sv/downloads
• NGS QC Toolkit (default version: v2.3.3): the installer can be found under the VIcaller/Tools/ folder
a) Copy the script “TrimmingReads_sanger.pl” under the VIcaller/Scripts/ folder to the installed NGSQCToolkit_v2.3.3/Trimming/ folder
• FastUniq (Default version: v1.1): https://sourceforge.net/projects/fastuniq/
• SE-MEI (modified): https://github.com/dpryan79/SE-MEI (original version), the modified version can be found under the VIcaller/Tools/ folder
a) Copy the modified SE-MEI installer (SE-MEI-master.tar.gz) under the VIcaller/Scripts/ folder to the VIcaller/Tools/ folder
b) Install the modified SE-MEI tool follow the README file
• RepeatMasker (default version: v4.0.5):
a) Install RepeatMasker: http://www.repeatmasker.org/
b) Install RMBlast aligner: http://www.repeatmasker.org/RMBlast.html
c) Compile the Repbase database: https://www.girinst.org/repbase/
• MEME (default version: v4.11.1): http://web.mit.edu/meme_v4.11.4/share/doc/download.html
• TRF (default version: v4.07b): https://tandem.bu.edu/trf/trf.html

Install Perl libraries

$ cpan String::Approx
$ cpan Time::HiRes
$ cpan Test::Most
$ cpan Bio::Seq
$ cpan Bio::SeqIO
$ cpan Bio::DB::GenBank
$ cpan IO::Zlib

3.3 Prepare databases

Obtain and index human reference genome using BWA, Bowtie2, and BLAST+ separately:

$ cd VIcaller/Database/Human/
$ wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
$ gunzip hg38.fa.gz 
$ bwa index -a bwtsw hg38.fa
$ bowtie2-build hg38.fa hg38.fa
$ makeblastdb -in hg38.fa -dbtype nucl

Index viral database using BWA, Bowtie2, and BLAST+ separately:

$ cd VIcaller/Database/Virus/
$ bwa index -a bwtsw virus_db_090217.fa
$ bowtie2-build virus_db_090217.fa virus_db_090217.fa
$ makeblastdb -in virus_db_090217.fa -dbtype nucl

3.4 Prepare the VIcaller config file

Example of VIcaller.config

export PERL5LIB=/users/xchen/.cpan/build/
export PATH=$PATH:/users/xchen/VIcaller/Tools/bowtie2-2.2.7/
# human_genome = /users/xchen/VIcaller/Database/Human/hg38.fa
# human_genome_tophat = /users/xchen/VIcaller/Database/hg38.fa
# virus_genome = /users/xchen/VIcaller/Database/Virus/virus_db_090217.fa          
# virus_taxonomy = /users/xchen/VIcaller/Database/Virus/virus_db_090217.taxonomy
# virus_list = /users/xchen/VIcaller/Database/Virus/virus_db_090217.virus_list
# vector_db = /gpfs2/dli5lab/CAVirus/Database/Vector/Vector.fa
# cell_line = /users/xchen/VIcaller/Database/cell_line.list
# bowtie_d = /users/xchen/VIcaller/Tools/bowtie2-2.2.7/
# tophat_d = /users/xchen/VIcaller/Tools/tophat-2.1.1.Linux_x86_64/
# bwa_d = /users/xchen/VIcaller/Tools/bwa-master/
# samtools_d = /users/xchen/VIcaller/Tools/samtools-1.6/
# repeatmasker_d = /users/xchen/VIcaller/Tools/RepeatMasker/
# meme_d = /users/xchen/VIcaller/Tools/meme_4.11.1/
# NGSQCToolkit_d = /users/xchen/VIcaller/Tools/NGSQCToolkit_v2.3.3/
# fastuniq_d = /users/xchen/VIcaller/Tools/FastUniq/
# SE_MEI_d = /users/xchen/VIcaller/Tools/SE-MEI/
# hydra_d = /users/xchen/VIcaller/Tools/Hydra-Version-0.5.3/
# blat_d = /users/xchen/bin/x86_64/
# blastn_d = /users/xchen/VIcaller/Tools/ncbi-blast-2.2.30+-src/bin/

Check the generated VIcaller.config file

#. Make sure the space between “#” and parameters.
#. Make sure the directory for the Perl library is correct or the libraries are available in the path if you install them locally.
#. Make sure the Bowtie2 directory is correct or it is available in the path (recommended) if you are going to analyze RNA-seq data.
#. Make sure the human and virus databases existed and correctly indexed.

4 VIcaller command line

$ perl VIcaller.pl <functions> [arguments]

4.1 Detect candidate viral integrations

Command line

$ perl VIcaller.pl detect [arguments]

Examples

a) WGS data in single-end fastq format:

$ perl VIcaller.pl detect -d WGS -i seq -f .fastq.gz -s single-end -t 12

b) RNA data in paired-end fastq format (set bowtie2 path before run the following command):

$ perl VIcaller.pl detect -d RNA-seq -i seq -f .fastq.gz -s paired-end -t 12

c) RNA alignment data in bam format (Note: Human reference genome should be the same as the bam file)

$ perl VIcaller.pl detect -d RNA-seq -i seq -f .bam -s paired-end -t 12

4.2 Validate candidate viral integrations (will remove false viral integrations detected at the "detect" step due to reads mis-alignment)

Command line

$ perl VIcaller.pl validate [arguments]

Example

$ perl VIcaller.pl validate -i seq -S seq_1_24020575_24020787_HPV16_218931404 -G 218931404 -V HPV16

4.3 Calculate allele fraction

Command line

$ perl VIcaller.pl calculate [arguments]

Example

$ perl VIcaller.pl calculate -i seq -f .fastq.gz -S -C 1 -P 24020575 -B 2 -N 20

5 Output

Output and file list

The candidate viral integrations detected by VIcaller are kept in the file with suffix of “.output” in Viral integration Format (VIF), with the visualization of the aligned read sequences in the file with suffix of “.visualization”. After in silico validation and allele fraction calculation, the results are also kept in the output file. “seq” is an example sample ID.

Header of the output file

Details are included in the Manual PDF file

6 FAQ

6.1 How to annotate the detected viral integrations?

    The following Linux command can be used to extract the information required to run human genome functional annotation tools. The VIcaller output file is “seq.output”, and for example, if the functional annotation software is SnpEff, the following command line will extract the information required to run SnpEff. The output from using this command will be the input file for SnpEff.

$ awk '{if ($7!="Chr.")print$7"\t"$17"\t.\tA\tT\t."}' seq.output

6.2 What is the difference between “Fast” mode and “Standard” mode?

   “Fast” mode is significantly faster than “Standard” mode. However, the “Fast” mode does not analyze viral reads, which are supporting evidence for distinguishing between viral integrations and viral infections.

6.3 How to use the viral integration data from VIcaller for integration enrichment analysis?

    VIcaller analyzes individual samples and then generates a list of viral integrations for each sample. Viral integration enrichment (bias) analysis, which is a statistical analysis, requires inclusion of a group of samples. The enrichment analysis has to be performed solely. There are multiple statistical models to calculate/determine enrichment hotspots (such as simulation-based Z score test). There are many available tools and R packages that can be selected for enrichment analysis. Users may have different preference on statistical models to fit in their actual samples/data.

6.4 Can I use the published tools that were designed for detecting transposable element insertions to identify virome-wide integrations?

    VIcaller uses the reads that are commonly used in transposable element insertion and other structural variation detection tools. However, because VIcaller is specifically designed to identify virome-wide integrations, it has significant advantages over alignment-based transposable element insertion detection tools for viral integration analysis, which are designed to extract and mainly use (human’s) anomalous reads specifically. For example, 1) VIcaller supports the use of virome-wide library as the reference to detect any characterized viruses, while most transposable element detection tools use transposable element sequences as the reference; and 2) VIcaller implements viral integration-specific quality control procedures and implements additional steps to in silico verify detected viral integrations. We have tried to compare VIcaller with other transposable element insertion detection software, e.g., MELT. MELT failed to run in a virome-wide fashion after we replaced MELT’s default consensus transposable element reference sequences with our virome-wide database. We further tested whether MELT was able to detect simulated candidate viral integrations, and we found that although MELT did run, it was not able to detect any of these integrations.

6.5 Can I use other virome-wide libraries?

    You can use other viral databases as the reference. However, the final output may not include the viral names or other taxonomy information. The reads that multiple-mapped different viral sequences from the same virus may not be efficiently recovered for the detection of viral integrations.

6.6 Which human renference build is used for the test output file?

    We previously used hg19 reference build to generate the test output file and to prepare the input BAM files. Thus, you should use hg19 to run the test data in BAM format or if you want to compare with the example output file.

Citation

Xun Chen, Jason Kost, Arvis Sulovari, Nathalie Wong, Winnie S. Liang, Jian Cao, and Dawei Li. A virome-wide clonal integration analysis platform for discovering cancer viral etiologies. Genome Research. 2019 Mar. DOI: 10.1101/gr.242529.118

Download

http://www.uvm.edu/genomics/software/VIcaller.html

Copyright

VIcaller is licensed under the Creative Commons Attribution-NonCommercial 4.0 International license. It may be used for non-commercial use only. For inquiries about a commercial license, please contact the corresponding author at [email protected] or The University of Vermont Innovations at [email protected].

vicaller's People

Contributors

Stargazers

Watchers

Forkers

xflicsu mcmcphe saintxugang wangjie07070910 htejedam

vicaller's Issues

About input data

Dear VIcaller team,

Hi, I'm Oh.

I have a question.

Can I use alignment file derived from paired-end WGS data?
ex) perl VIcaller.pl detect -d WGS -I sample -f .bam -s paired-end

There is no method using alignment file of WGS in user manual.

Thanks.

Oh.

Download RepBase for RepeatMasker

There is error when go through step using RepeatMasker:
"Species 'human' is not known to RepeatMasker."
It looks like I didn't setup the reference database RepBase for RepeatMasker properly.

Therefore, I tried to setup RepeatMasker with reference database RepBase by following RepeatMakser INSTALL file. However, I found out we need to pay for downloading the database file Repbase-derived RepeatMasker libraries:
[RepBaseRepeatMaskerEdition-20181026.tar.gz] (53.48 MB) from https://www.girinst.org/server/RepBase/index.php.

How did you setup the reference database (human) for RepeatMasker?

Tophat with --no-coverage-search

Hi!

When running VIcaller on one stage tophat suggests running itself in --no-coverage-search mode to make the process much faster. The run indeed completes much faster when this option is added to VIcaller.pl, so I wanted to ask if you tested this option before. Does this option (or can it, in principle), in your opinion, prevent VIcaller from finding/validating some viral integrations?

Sergei

Result_visual3-3.pl ERROR

Dear Developer,
I encounter a error while run VIcaller, error message “Modification of non-creatable array value attempted, subscript -1 at Result_visual3-3.pl”

When I looked into the script “Result_visual3-3.pl” , I found that input file “$ARGV[0]_sf.fuq” maybe not exist in any scripts.

Could you help me check it?

Thanks!
JY

viral database

hello Xun,

I encountered a problem when I try to access to the website you described http://www.uvm.edu/genomics/software/VIcaller.html. I cannot find the the virome-wide library under the Database/ folder. I can only access to the http://www.uvm.edu/. Do you know how to solve this?
Thank you very much!

How to filter the list from the "detect" result

Hi Dr. Xun,
I got several candidate virus in my RNA-Seq data sets by the "detect" function of VIcaller.
I read the user manual of VIcaller.
It provide "validate" and "calculate" function of VIcaller to select only one interested virus.
Du to "detect" function detected numerous viruses, how to select those viruses only in one step?

output columns explain

Dear VIcaller developer,

I want to know meaning of some output columns.
I see output from detect function that No._chimeric_reads and No._split_reads both are zero, but No._reads_supporting_VI have value bigger than 0. supporting reads = chimeric reads + split reads, is it right? And how to explain these columns？

Best,
JY

Vicaller.pl 1164 line is strange.

Hi, I have three issue.

I find a strange line at Vicaller.pl
I don't know perl well but line 1164 seems to something wrong.

system ("perl ${directory}Scripts/Extract_fasta.pl $GI $virus_genome >${directory}Database/GI/${GI}.fa");
and file name is incorrect: "Scripts/Extract_read_informationpl "--> "Scripts/Extract_read_information.pl"
Are the two changes correct?
I have trouble running validate function.
I get the following warning:
Warning: [blastn] Query is Empty!
Warning: [blastn] Query is Empty!
Warning: [blastn] Query is Empty!

but my config and blastn probably have no problem.
so, I take a look at Vicaller.pl
blastn run using query "${input_sampleID2}_aligned_both.fas"
That means "${input_sampleID2}_aligned_both.fas" is not created.

${input_sampleID2}_aligned_both.fas is made this function.

validate function

sub v_obtain_seq {
system ("perl ${directory}Scripts/Extract_specific_loci_final_reads.pl ${input_sampleID}_f2 ${input_sampleID2}.virus_f2 >${input_sampleID2}.information");
my $cmd=q(awk '{print$8}');
system ("$cmd ${input_sampleID2}.information |sort |uniq >${input_sampleID2}.id");
system ("perl ${directory}Scripts/Extract_fastq.pl -f ${input_sampleID}_1.1fuq -b ${input_sampleID2}.id -o ${input_sampleID2}_aligned_1.fuq");
system ("perl ${directory}Scripts/Extract_fastq.pl -f ${input_sampleID}_2.1fuq -b ${input_sampleID2}.id -o ${input_sampleID2}_aligned_2.fuq");
system ("perl ${directory}Scripts/Extract_fuq_split.pl -f ${input_sampleID}_1sf.fuq -b ${input_sampleID2}.id -o ${input_sampleID2}_aligned_sf.fuq");
system ("perl ${directory}Scripts/Convert_fastq_to_fasta.pl ${input_sampleID2}");
system ("perl ${directory}Scripts/Convert_fastq_to_fasta_for_split_read.pl ${input_sampleID2}");
system ("perl ${directory}Scripts/Convert_fastq_to_fasta.pl ${input_sampleID2}");
system ("perl ${directory}Scripts/Convert_fastq_to_fasta_for_split_read.pl $input_sampleID2");
system ("cat ${input_sampleID2}_aligned.fas ${input_sampleID2}_aligned_sf.fas >${input_sampleID2}_aligned_both.fas");
}

In conclusion, i wonder ${input_sampleID2}.virus_f2 is a previously generated file.
Or is there something else wrong?

Among the parameter values in the validation function, Is the virus name only an abbreviation?
Should not the full name in the output file 'Candidate_virus' column is possible?

where to download

Hi,

We would like to try VIcaller. However, there is no virus files for download from website "www.uvm.edu/genomics/software/VIcaller.html"

This is the statement from page 4 of the VIcaller User Manual :
"Obtain and index the virome-wide library using BWA, Bowtie2, and BLAST+ separately:
a) Download the virus_db_090217.fa, virus_db_090217.taxonomy, virus_db_090217.virus_list
and Vector.fa files from the website: www.uvm.edu/genomics/software/VIcaller.html"

Could you let us know where to download those ?

Thanks,
Sharon

Extract spanning reads and junction reads around integration site

Hi Dr. Xun

Thanks for your works.

Could you give some suggestions to extract spanning reads and junction reads around integration site?

According to the manual, the file seq_1_24020575_24020787_hum an_papillomavirus_type_2189314 04.CS3 might be the one I am looking for. But It was not produced.

Here the codes to run example data:
VIcaller.pl detect -d WGS -i seq -f .fastq.gz -m standard -t 12 -Q 20 -a -r -c /VIcaller_v1.1/VIcaller.config

The seq.output looks fine:
Sample_ID VIcaller_mode QC Reciprocal_alignment Candidate_virus GI Chr. Start End No._chimeric_reads No._split_reads Upstream_breakpoint_on_human Downstream_breakpoint_on_human Upstream_breakpoint_on_virus Downstream_breakpoint_on_virus Information_of_both_upstream_and_downstream_breakpoints Integration_site_in_the_human_genome Integration_allele_fraction No._reads_supporting_nonVI No._reads_supporting_VI Average_alignment_score Is_cell_line_contamination Is_vector Validation_chimeric_confident Validation_chimeric_weak Validation_chimeric_false Validation_split_confident Validation_split_weak Validation_split_false seq standard yes yes human_papillomavirus_type 218931404 1 23694041 23694297 0 0 23694257 23694211 7876 7905 D(+-);D(-+) 23694234 - - 38 64.1578947368421 - - - - - - - -

Best Regards,

Yang

The seq.output of test differ depending on input.

ss.xlsx

This xlsx file consist of three seq.output.
Row 2 is your's seq.output.
Row 3 is output received seq_WGS.bam file as input
Row 4 is output received seq_1.fastq.gz and seq_2.fastq.gz file as input.
Why are the three output different?

my cmd: perl VIcaller.pl detect -d WGS -i test/seq_WGS -f .bam -s paired-end
my config
export PERL5LIB=/etc/perl:/usr/local/lib/x86_64-linux-gnu/perl/5.22.1:/usr/local/share/perl/5.22.1:/usr/lib/x86_64-linux-gnu/perl5/5.22:/usr/share/perl5:/usr/lib/x86_64-linux-gnu/perl/5.22:/usr/share/perl/5.22:/usr/local/lib/site_perl:/usr/lib/x86_64-linux-gnu/perl-base
export PATH=$PATH:/data/program/bowtie2/bowtie2-2.2.9/
human_genome = /data/program/VIcaller_v1.1/Database/Human/hg38.fa
human_genome_tophat = /data/program/VIcaller_v1.1/Database/Human/hg38.fa
virus_genome = /data/program/VIcaller_v1.1/Database/Virus/virus_db_090217.fa
virus_taxonomy = /data/program/VIcaller_v1.1/Database/Virus/virus_db_090217.taxonomy
virus_list = /data/program/VIcaller_v1.1/Database/Virus/virus_db_090217.virus_list
vector_db = /gpfs2/dli5lab/CAVirus/Database/Vector/Vector.fa
cell_line = /data/program/VIcaller_v1.1/Database/cell_line.list
# bowtie_d = /data/program/bowtie2/bowtie2-2.2.9/
# tophat_d = /data/program/tophat/tophat-2.1.0.Linux_x86_64/
# bwa_d = /data/program/bwa/bwa/
# samtools_d = /data/program/samtools/samtools-1.9/
# repeatmasker_d = /data/program/RepeatMasker/
# meme_d = /data/program/meme/
# NGSQCToolkit_d = /data/program/NGS_QC_Toolkit/
# fastuniq_d = /data/program/FastUniq/
# SE_MEI_d = /data/program/VIcaller_v1.1/Tools/SE-MEI/
# hydra_d = /data/program/Hydra-Version-0.5.3/
# blat_d = /home/hikim/bin/x86_64/
# blastn_d = /data/program/blast/ncbi-blast-2.5.0+

What is input file of calculate?

calculate parameter

What is input file of calculate?
I don't know what to do with parameter -F
inputID_h.bam?
inputID_h1.sort.bam?

and,
Is the parameter -N the column No._reads_supporting_VI?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.