Code Monkey home page Code Monkey logo

vsnp3's Introduction

vSNP3

This is an introduction to vSNP version 3. A comprehensive overview of vSNP version 2 is here with version 2 GitHub repo here.

vSNP3 generates BAM, VCF and annotated SNP tables and corresponding phylogenetic trees to achieve a high resolution SNP analysis.

Whole genome sequencing for disease tracing and outbreak investigations is routinely required for high consequence diseases. vSNP is an accreditation-friendly and robust tool, designed for easy error correction and SNP validation... vSNP. vSNP generates annotated SNP tables and corresponding phylogenetic trees that can be scaled for reporting purposes. It is able to process large scale datasets, and can accommodate multiple references.

Conda installation

Tested with Python 3.8 - 3.9.

Anaconda setup

conda create -c conda-forge -c bioconda -n vsnp3 vsnp3=3.21

Installation test

which vsnp3_step1.py
vsnp3_step1.py -h
vsnp3_step2.py -h

Test workflow

Download test files

cd ~; git clone https://github.com/USDA-VS/vsnp3_test_dataset.git

Add reference:

cd ~/vsnp3_test_dataset/vsnp_dependencies
vsnp3_path_adder.py -d `pwd`

Run test with AF2122 (Mycobacterium bovis)

Test step 1:

cd ~/vsnp3_test_dataset/AF2122_test_files/step1
vsnp3_step1.py -r1 *_R1*.fastq.gz -r2 *_R2*.fastq.gz -t Mycobacterium_AF2122

Test step 2:

cd ~/vsnp3_test_dataset/AF2122_test_files/step2
vsnp3_step2.py -wd . -a -t Mycobacterium_AF2122

Run test with NC_045512 (SARS-CoV-2)

Test step 1:

cd ~/vsnp3_test_dataset/NC_045512_test_files/step1
for i in *.fastq.gz; do n=`echo $i | sed 's/[_.].*//'`; echo "$i in directory $n"; mkdir -p $n; mv $i $n/; done
cdir=`pwd`; for f in *; do echo $f; cd ./$f; vsnp3_step1.py -r1 *_R1*.fastq.gz -r2 *_R2*.fastq.gz -t NC_045512_wuhan-hu-1; cd $cdir; done
mkdir stats; cp **/*stats.xlsx stats; cd stats; vsnp3_excel_merge_files.py #see combined_excelworksheets-*.xlsx stat summary

Test step 2:

cd ~/vsnp3_test_dataset/NC_045512_test_files/step2
vsnp3_step2.py -a -t NC_045512_wuhan-hu-1

Description

See -h option for detail and usage

Step 1

  • vsnp3_step1.py # main entry for step 1
  • vsnp3_alignment_vcf.py
  • vsnp3_assembly.py
  • vsnp3_best_reference_sourmash.py
  • vsnp3_fastq_stats_seqkit.py
  • vsnp3_group_reporter.py
  • vsnp3_vcf_annotation.py
  • vsnp3_zero_coverage.py

Step 2

  • vsnp3_step2.py, # main entry for step 2
  • vsnp3_fasta_to_snps_table.py
  • vsnp3_group_on_defining_snps.py
  • vsnp3_html_step2_summary.py
  • vsnp3_remove_from_analysis.py

Step 1 and 2

  • vsnp3_annotation.py
  • vsnp3_file_setup.py
  • vsnp3_reference_options.py

Utility scripts

  • vsnp3_path_adder.py
  • vsnp3_bruc_mlst.py
  • vsnp3_download_fasta_gbk_gff_by_acc.py
  • vsnp3_excel_merge_files.py
  • vsnp3_filter_finder.py
  • vsnp3_spoligotype.py

Additional Tools

Version Enhancements

  • Python 3.9 support.
  • Modularity improvements allowing for better usage of functions.
  • Sourmash to select best reference.
  • All dependency files can be provided as individual options, ie. file options can be explicit or based on reference directory.
  • All thresholds added as options.
  • FASTAs and GBKs can remain separate.
  • GBK download utility script to get full versus partial file.
  • Updated annotation descriptions.
  • Enhanced annotation when "no annotation provided".
  • Indels called as N at group SNP positions.
  • VCF files in set with "wrong" chromosome will halt script.
  • Defining snps and filters zipped with starting VCF files for better repeatability.
  • Ability to test defining SNPs individually.
  • With only a FASTA alignment a cascading SNP table can be generated.
  • Version and option capturing enhancements.
  • Enhanced file input options.
  • Oxford Nanopore read compatible (beta).
  • all_vcf named as defining SNP column label.
  • Removed Picard which drops Java requirements thus providing easier conda installation.
  • Strict sample/VCF file naming requirements when matching metadata names. Using the two column metadata Excel worksheet first column must match VCF file exactly, else minus ".vcf", else minus "_zc.vcf", else no match.
  • Improved organization of cascading SNP tables.
  • Unmapped reads not assembled by default. New flag to request assembly of unmapped reads.
  • Detailed steps recorded in run_log.txt.
  • Spoligotype for TB complex isolates not found by default. -s option required with step 1 or find using vsnp3_spoligotype.py.
  • Brucella MLST removed from step 1. Find Brucella MLST using vsnp3_bruc_mlst.py.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.