Code Monkey home page Code Monkey logo

vsnp3's Introduction

vSNP3

This is an introduction to vSNP version 3. A comprehensive overview of vSNP version 2 is here with version 2 GitHub repo here.

vSNP3 generates BAM, VCF and annotated SNP tables and corresponding phylogenetic trees to achieve a high resolution SNP analysis.

Whole genome sequencing for disease tracing and outbreak investigations is routinely required for high consequence diseases. vSNP is an accreditation-friendly and robust tool, designed for easy error correction and SNP validation... vSNP. vSNP generates annotated SNP tables and corresponding phylogenetic trees that can be scaled for reporting purposes. It is able to process large scale datasets, and can accommodate multiple references.

Conda installation

Tested with Python 3.8 - 3.9.

Anaconda setup

conda create -c conda-forge -c bioconda -n vsnp3 vsnp3=3.21

Installation test

which vsnp3_step1.py
vsnp3_step1.py -h
vsnp3_step2.py -h

Test workflow

Download test files

cd ~; git clone https://github.com/USDA-VS/vsnp3_test_dataset.git

Add reference:

cd ~/vsnp3_test_dataset/vsnp_dependencies
vsnp3_path_adder.py -d `pwd`

Run test with AF2122 (Mycobacterium bovis)

Test step 1:

cd ~/vsnp3_test_dataset/AF2122_test_files/step1
vsnp3_step1.py -r1 *_R1*.fastq.gz -r2 *_R2*.fastq.gz -t Mycobacterium_AF2122

Test step 2:

cd ~/vsnp3_test_dataset/AF2122_test_files/step2
vsnp3_step2.py -wd . -a -t Mycobacterium_AF2122

Run test with NC_045512 (SARS-CoV-2)

Test step 1:

cd ~/vsnp3_test_dataset/NC_045512_test_files/step1
for i in *.fastq.gz; do n=`echo $i | sed 's/[_.].*//'`; echo "$i in directory $n"; mkdir -p $n; mv $i $n/; done
cdir=`pwd`; for f in *; do echo $f; cd ./$f; vsnp3_step1.py -r1 *_R1*.fastq.gz -r2 *_R2*.fastq.gz -t NC_045512_wuhan-hu-1; cd $cdir; done
mkdir stats; cp **/*stats.xlsx stats; cd stats; vsnp3_excel_merge_files.py #see combined_excelworksheets-*.xlsx stat summary

Test step 2:

cd ~/vsnp3_test_dataset/NC_045512_test_files/step2
vsnp3_step2.py -a -t NC_045512_wuhan-hu-1

Description

See -h option for detail and usage

Step 1

  • vsnp3_step1.py # main entry for step 1
  • vsnp3_alignment_vcf.py
  • vsnp3_assembly.py
  • vsnp3_best_reference_sourmash.py
  • vsnp3_fastq_stats_seqkit.py
  • vsnp3_group_reporter.py
  • vsnp3_vcf_annotation.py
  • vsnp3_zero_coverage.py

Step 2

  • vsnp3_step2.py, # main entry for step 2
  • vsnp3_fasta_to_snps_table.py
  • vsnp3_group_on_defining_snps.py
  • vsnp3_html_step2_summary.py
  • vsnp3_remove_from_analysis.py

Step 1 and 2

  • vsnp3_annotation.py
  • vsnp3_file_setup.py
  • vsnp3_reference_options.py

Utility scripts

  • vsnp3_path_adder.py
  • vsnp3_bruc_mlst.py
  • vsnp3_download_fasta_gbk_gff_by_acc.py
  • vsnp3_excel_merge_files.py
  • vsnp3_filter_finder.py
  • vsnp3_spoligotype.py

Additional Tools

Version Enhancements

  • Python 3.9 support.
  • Modularity improvements allowing for better usage of functions.
  • Sourmash to select best reference.
  • All dependency files can be provided as individual options, ie. file options can be explicit or based on reference directory.
  • All thresholds added as options.
  • FASTAs and GBKs can remain separate.
  • GBK download utility script to get full versus partial file.
  • Updated annotation descriptions.
  • Enhanced annotation when "no annotation provided".
  • Indels called as N at group SNP positions.
  • VCF files in set with "wrong" chromosome will halt script.
  • Defining snps and filters zipped with starting VCF files for better repeatability.
  • Ability to test defining SNPs individually.
  • With only a FASTA alignment a cascading SNP table can be generated.
  • Version and option capturing enhancements.
  • Enhanced file input options.
  • Oxford Nanopore read compatible (beta).
  • all_vcf named as defining SNP column label.
  • Removed Picard which drops Java requirements thus providing easier conda installation.
  • Strict sample/VCF file naming requirements when matching metadata names. Using the two column metadata Excel worksheet first column must match VCF file exactly, else minus ".vcf", else minus "_zc.vcf", else no match.
  • Improved organization of cascading SNP tables.
  • Unmapped reads not assembled by default. New flag to request assembly of unmapped reads.
  • Detailed steps recorded in run_log.txt.
  • Spoligotype for TB complex isolates not found by default. -s option required with step 1 or find using vsnp3_spoligotype.py.
  • Brucella MLST removed from step 1. Find Brucella MLST using vsnp3_bruc_mlst.py.

vsnp3's People

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

duceppemo

vsnp3's Issues

Error while running vSNP 3.20 "step1" with nanopore data

Command used:
vsnp3_step1.py \ -r1 "$r1" \ --spoligo \ --nanopore

Error message:
Traceback (most recent call last): File "/home/bioinfo/miniconda3/envs/vsnp3.20/bin/vsnp3_step1.py", line 273, in <module> vsnp.run() File "/home/bioinfo/miniconda3/envs/vsnp3.20/bin/vsnp3_step1.py", line 158, in run if alignment.zero_coverage.ave_coverage < 40 or alignment.DUPLICATION_RATIO > .80 or float(fastq_stats.R1.passQ20) < 50.0: ValueError: could not convert string to float: '11,015'
Looks like it's caused by the use of a comma (",") as thousands separator. It might be link to the host system language preferences if you don't get that error on your dev / production machine.

vSNP3

I tried to download vSNP3 in MacOS Monterey, ver 12.6.7 and it does not work. Do you have any suggestions? I followed the installation instructions that are provided for Mac users. Thanks!

mpilup takes 5h per sample with nanopore data

Hi Tod,

I've been trying the nanopore support of vSNP3 and I think it still needs to be optimized.

First, when installing vSNP3 with conda, it lacks 2 dependencies: vcftools and bcftools. To get a fully working pipeline (I only tested step 1 so far), I had to run:

conda create -y -n vsnp3 -c bioconda vsnp3=3.06
conda install -c bioconda vcftools bcftools
# I had a problem with some vcftools library that could be solved by creating a symbolic link
ln -s /home/bioinfo/miniconda3/envs/vsnp3/lib/libcrypto.so.1.1 /home/bioinfo/miniconda3/envs/vsnp3/lib/libcrypto.so.1.0.0

There are a few Warnings printed in the terminal while the step1 runs. The command I used:

$ vsnp3_step1.py -n -r1 '/home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS036.fastq.gz' -f /home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.fasta         -b /home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.gbk

The terminal output:

vsnp3_step1.py SET ARGUMENTS:
Namespace(FASTQ_R1='/home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS036.fastq.gz', FASTQ_R2=None, FASTA=['/home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.fasta'], gbk=['/home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.gbk'], reference_type=None, nanopore=True, assemble_unmap=False, debug=False)



Best Reference Finding with Sourmash 
2022-05-19 14:51:17

== This is sourmash version 4.4.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
loaded query: /home/bioinfo/analyses/mbovis_... (k=31, DNA)
loaded 1 databases.

WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
11 matches; showing first 3:
similarity   match
----------   -----
  6.3%       NC_002945.4 Mycobacterium bovis AF2122/97 genome assembly...
  6.2%       NZ_CP041790.1 Mycobacterium tuberculosis strain SEA170200...
  6.2%       CP016401.1 Mycobacterium caprae strain Allgaeu genome

Sample: MBWGS036
Top Sourmash Finding: NC_002945.4 
Reference Set: Mycobacterium_AF2122 
Top reference that is automatically available: /home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.fasta

#############


Spoligotype 
2022-05-19 14:51:23

Align and make VCF file 
2022-05-19 14:52:36
[M::mm_idx_gen::0.136*1.01] collected minimizers
[M::mm_idx_gen::0.160*1.95] sorted minimizers
[M::main::0.160*1.95] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.184*1.83] mid_occ = 11
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.194*1.79] distinct minimizers: 770441 (96.15% are singletons); average occurrences: 1.053; average spacing: 5.362; total length: 4349904
[M::worker_pipeline::3.132*5.40] mapped 9607 sequences
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -a -x map-ont -R @RG\tID:MBWGS036\tSM:MBWGS036\tPL:ILLUMINA\tPI:250 -t 8 -o MBWGS036.sam /home/bioinfo/analyses/mbovis_nanopore_vsnp3/step1/NC_002945v4.fasta /home/bioinfo/analyses/mbovis_nanopore_vsnp3/step1/MBWGS036.fastq.gz
[M::main] Real time: 3.149 sec; CPU: 16.941 sec; Peak RSS: 0.794 GB
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[markdup] warning: unable to calculate estimated library size. Read pairs 0 should be greater than duplicate pairs 0, which should both be non zero.
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf temp1.vcf
	--recode-INFO-all
	--out temp2
	--recode
	--remove-indels

Warning: Expected at least 2 parts in INFO entry: ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
Warning: Expected at least 2 parts in INFO entry: ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
After filtering, kept 1 out of 1 Individuals
Outputting VCF file...
After filtering, kept 516 out of a possible 611 Sites
Run Time = 0.00 seconds

Zero Coverage 
2022-05-19 17:12:07
	Positions with no coverage: 12,953, 0.297777% of reference

MBWGS036 Poor FASTQ Usability
MBWGS036 Acceptable Reference Usability

As you can notice, the top reference has a very low % value. It still picks the right one, but this part of the pipeline is not optimized for Nanopore. Also, why is it still looking for the best reference is we already told which one to use?

The log file looks like this:


vsnp3_step1.py SET ARGUMENTS:
Namespace(FASTQ_R1='MBWGS009.fastq.gz', FASTQ_R2=None, FASTA=['/home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.fasta'], gbk=['/home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.gbk'], reference_type=None, nanopore=True, assemble_unmap=False, debug=False)

Call Summary:
SYSTEM CALL: minimap2 -a -x map-ont -R "@RG\tID:MBWGS009\tSM:MBWGS009\tPL:ILLUMINA\tPI:250" -t 8 /home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS009/NC_002945v4.fasta /home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS009/MBWGS009.fastq.gz -o MBWGS009.sam -- 2022-05-19_17:45:11
SYSTEM CALL: samtools fixmate -O bam,level=1 -m MBWGS009.sam MBWGS009_fixmate.bam -- 2022-05-19_17:45:24
SYSTEM CALL: samtools sort -l 1 -@8 -o MBWGS009_pos_srt.bam MBWGS009_fixmate.bam -- 2022-05-19_17:45:24
SYSTEM CALL: samtools markdup -f markduplicate_stats.txt -r -O bam,level=1 MBWGS009_pos_srt.bam MBWGS009_nodup.bam -- 2022-05-19_17:45:24
NOTE: Read stats gathered by markduplicate_stats.txt -- 2022-05-19_17:45:24
NOTE: Nanopore - bcftools mpileup used to call SNPs and make VCF files *** -- 2022-05-20_00:17:23
SYSTEM CALL: bcftools mpileup --threads 16 -Ou -f /home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS009/NC_002945v4.fasta MBWGS009_nodup.bam | bcftools call --threads 16 -mv -v -Ov -o MBWGS009_unfiltered_hapall.vcf -- 2022-05-20_00:17:23
SYSTEM CALL: vcffilter -f "QUAL > 20" MBWGS009_unfiltered_hapall.vcf > temp1.vcf -- 2022-05-20_00:17:23
NOTE: Nanopore QUAL values increased by 100 to obtain closer values seen with Illumina reads, and allowing VCF files from both platforms to be ran together. -- 2022-05-20_00:17:23
NOTE: Skipped unmapped read assembly -- 2022-05-20_00:17:23
IMPORT: VCF_Annotation(gbk_list=self.gbk, vcf_file=filtered_hapall) -- 2022-05-20_00:17:25
IMPORT: Zero_Coverage(FASTA=reference, bam=nodup_bamfile, vcf=filtered_hapall,) -- 2022-05-20_00:17:41
NOTE: Files moved to temp_dir and removed: *_unmapped*.fastq.gz, *_all.bam, *_fixmate.bam, *_pos_srt.bam, markduplicate_stats.txt, *.bai, *_filtered_hapall.vcf, *_mapfix_hapall.vcf, *_unfiltered_hapall.vcf, *_filtered_hapall_nanopore.vcf, *.sam, *.amb, *.ann, *.bwt, *.pac, *.fasta.sa, *_sorted.bam, *.dict, chrom_ranges.txt, *.fai, dup_metrics.csv -- 2022-05-20_00:17:41

Versions:
vSNP3: 3.06
Bio, 1.79
numpy, 1.22.3
pandas, 1.4.2
Minimap2: 2.24-r1122
Freebayes: v1.3.6
samtools 1.15
Using htslib 1.14

The main issue right now is that the mpileup step (using bcftools) takes about 5h per sample. I just can rerun all my samples with vSNP3 if it takes that long!

Here's the content of the Excel stats file:

sample	date	FASTA/s	Sourmash Sequence Similarity	Found_Reference_Set	FASTQ_R1	R1 File Size	R1 Read Count	R1 Length Sum	R1 Min Length	R1 Ave Length	R1 Max Length	R1 Passing Q20	R1 Passing Q30	R1 Read Quality Ave	Spoligotype Spacer Counts	Spoligotype Binary Code	Spoligotype Octal Code	Spoligotype SB Number	Groups	Aligner	Mapped Paired Reads	Mapped Single Reads	Unmapped Reads	Unmapped Percent	Unmapped Assembled Contigs	Duplicate Paired Reads	Duplicate Single Reads	Duplicate Percent of Mapped Reads	BAM/Reference File	Reference Length	Genome with Coverage	Average Depth	No Coverage Bases	Percent Ref with Zero Coverage	Quality SNPs
MBWGS009	2022-05-19_17-40-28	NC_002945v4.fasta	3.9%:3b48a55512e8dedc2b8d6e33699893bd	Mycobacterium_AF2122	MBWGS009.fastq.gz	248.4 MB	74,874	262,465,587	1	3,505.4	36,224	65.27%	36.07%	13.8717	20:23:0:27:0:24:26:24:0:28:0:16:26:26:26:0:28:32:0:23:27:28:36:32:36:43:35:35:31:0:0:0:0:0:32:38:36:35:0:0:0:0:0	binary-1101011101011110110111111111100000111100000	octal-656573377603600	SB1071	group file not provided	Minimap2	0	74,847	2,725	3.5%	skipped assembly	0	442	0.6%	MBWGS009_nodup.bam made with NC_002945v4	4,349,904	99.81%	59.1X	8,295	0.190694%	596

So any plans on improving support for Nanopore? I actually haven't tested vSNP3 on paired end data yet, so I don't know if the speed problem is only Nanopore related or not. Let me know if you need more info.

Thanks!
Marco

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.