usda-vs / vsnp Goto Github PK

vSNP -- validate SNPs

License: GNU General Public License v3.0

Python 100.00%

snp pipeline genomics tuberculosis brucella analysis-pipeline mycobacterium-tuberculosis-complex snps whole-genome-sequencing high

vsnp's Introduction

vSNP3 - Latest Version - Recommended

Latest version of vSNP see vSNP3 repository.

Overview

vSNP -- validate SNPs

Whole genome sequencing for disease tracing and outbreak investigations is routinely required for high consequence diseases. vSNP is an accreditation-friendly and robust tool, designed for easy error correction and SNP validation. vSNP rapidly generates annotated SNP tables and corresponding phylogenetic trees that can be easily scaled for reporting purposes. It is able to process large scale datasets, and efficiently accommodates multiple references.

Features:

Allows for the creation of customized groups utilizing user specified SNPs.
Functionality to filter desired SNP positions by group or for all groups.
Two types of output:
1. A spreadsheet, or SNP table containing SNP calls sorted in evolutionary order with annotation and genome position information.
2. Corresponding phylogenetic trees
Ability for run-detail to be provided and optimized by multiple users.

vSNP is a 2-step process for efficiency.

vSNP_step1.py takes as input single or paired FASTQ files and either a reference FASTA or directory name containing the reference FASTA (see vsnp_path_adder.py below). For Mycobacterium tuberculosis complex and Brucella species, if no FASTA file or directory name is provided a "best reference" is automatically selected

NOTE: It is only necessary to process each set of raw data once for each reference. The VCF file from Step 1 is saved for future analyses with additional samples in Step 2. It is recommended that VCF files generated with a single reference be placed in a single directory to facilitate future analysis. Only VCF files generated from the same reference can be compared in Step 2.
vSNP_step2.py builds SNP tables and corresponding phylogenetic trees from a directory containing a collection of zc.vcf files output from step 1. Step 2 will handle large datasets with thousands of VCF files, outputting detailed comparisons in minutes.

Setup

Installing vSNP:

It is expected the setup user is familiar with the command-line and can install conda.

Do not work in base. If needed make new environmnet.

conda create --name myenv

Installation:

conda install vsnp -c conda-forge -c bioconda

Run vSNP_step1.py -h to see usage details.

macOS users may need to follow these special instructions for Samtools.

Reference Options:

Except for a reference FASTA file, no other dependency file is required to run vSNP. However, other files are recommended to add additional functionality to the script. For example, Excel file containing defining SNPs and filters will allow for the creation of custom subgroups. A template is provided at dependencies/template_defining_filter.xlsx.

Download and add reference options. It is recommended that reference options be placed on storage accessible to both computing resources and subject matter experts analyzing the output data.

git clone https://github.com/USDA-VS/vSNP_reference_options.git

Use vsnp_path_adder.py to add options. See vsnp_path_adder.py -h for help.

For example, after running the following all subdirectories are accessible using the -r option.

vsnp_path_adder.py -d /full/path/to/vSNP_referenece_options

vSNP_step1.py -r1 *_R1*gz -r2 *_R2*gz -r Mycobacterium_AF2122

Test Files:

Download test files:

git clone https://github.com/USDA-VS/fastq_data_set-tb_complex

Place each sample in its own directory, and on each directory run the following command:

vSNP_step1.py -r1 *_R1*gz -r2 *_R2*gz

See help running multiple samples at once.

Copy *zc.vcf output from step 1 into a directory for step 2. Only samples compared to the same reference can be analyzed together in step 2.

Run vSNP_step2.py on this directory.

Output Structure:

As with reference options, it is recommended to place output from step 1 and 2 on storage accessible to subject matter experts analyzing the data. It may be necessary to use data from all three sources- reference options, step 1 and step 2 to fully understand the relationships of the data.

Recommended directory structure for step 1 and 2 output:

Procedure Detail:

Procedure detail is here.

Utility Scripts:

vSNP utility scripts.

vsnp's People

Contributors

Stargazers

Watchers

Forkers

duceppemo govtmirror joshuaeveleth palc luckyluke841 marisbest2 adamkoziol viloleal noahaus elgin-akin1 elginakin

vsnp's Issues

Pipeline Inspiration

Hello Tod,

I'm a grad student looking to study how to best genotype Mycobacterium Bovis, and vSNP code provides an excellent way to study this workflow. I was wondering if there were any scientific papers that you read that helped to inform your decision making when creating the vSNP program, because I would like to have a paper to complement my study of the program!

Test data output?

Hi I have installed vSNP using your instructions.

Due to an error, I had to uninstall samtools using conda since it suggested the set-up was not correct. In the processes it removed a few extra dependecies including pysam. I then installed both pysam and samtools and then I tried vSNP.py out on the mycobacterium test dataset.

That ended with this output:

runtime: 0:17:34.115335:

average_coverage: 13.8
time_stamp: 2018-12-03_20-33-19
sample_name: 13-1941
species: af
reference_sequence_name: NC_002945.4
R1size: 31.0MB
R2size: 38.8MB
allbam_mapped_reads: 286,205
genome_coverage: 99.02%
ave_coverage: 13.8
ave_read_length: 227.2
unmapped_reads: 1871
unmapped_assembled_contigs: 745
good_snp_count: 675
mlst_type: N/A
octalcode: 640013777377600
sbcode: N/A
hexadecimal_code: 68-0-5F-7E-FF-60
binarycode: 1101000000000010111111111110111111111100000
Q_ave_R1: 34.7
Q30_R1: 89.9%
Q_ave_R2: 27.4
Q30_R2: 41.4%
Path to cumulative stat summary file not found


runtime: 0:19:52.687740:

See files, vSNP has finished alignments

What does that mean, when path to stat summary is not found. Is that bad?

Thomas

Processing large number of VCF files

When using the "all" option on a high number of VCF file (4460), I get the following error.
`All_VCFs table dimensions: (4298, 63111)
All_VCFs RAxML running...
ERROR: missing ')' at line 0 near '5MIDNRdeerMontm_zc'
cat: write error: Broken pipe
All_VCFs Getting map quality...
All_VCFs annotating from annotation dictionary... D20181218_1709
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/bioinfo/miniconda3/envs/vsnp/lib/python3.6/concurrent/futures/process.py", line 175, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/home/bioinfo/miniconda3/envs/vsnp/lib/python3.6/concurrent/futures/process.py", line 153, in _process_chunk
return [fn(*args) for args in chunk]
File "/home/bioinfo/miniconda3/envs/vsnp/lib/python3.6/concurrent/futures/process.py", line 153, in
return [fn(*args) for args in chunk]
File "/home/bioinfo/vSNP/functions.py", line 2720, in get_snps
excelwriter(out_sort) #***FUNCTION CALL #sort
File "/home/bioinfo/vSNP/functions.py", line 2889, in excelwriter
wb.close()
File "/home/bioinfo/miniconda3/envs/vsnp/lib/python3.6/site-packages/xlsxwriter/workbook.py", line 306, in close
self._store_workbook()
File "/home/bioinfo/miniconda3/envs/vsnp/lib/python3.6/site-packages/xlsxwriter/workbook.py", line 677, in _store_workbook
xlsx_file.write(os_filename, xml_filename)
File "/home/bioinfo/miniconda3/envs/vsnp/lib/python3.6/zipfile.py", line 1645, in write
with open(filename, "rb") as src, self.open(zinfo, 'w') as dest:
File "/home/bioinfo/miniconda3/envs/vsnp/lib/python3.6/zipfile.py", line 1378, in open
return self._open_to_write(zinfo, force_zip64=force_zip64)
File "/home/bioinfo/miniconda3/envs/vsnp/lib/python3.6/zipfile.py", line 1488, in _open_to_write
self._writecheck(zinfo)
File "/home/bioinfo/miniconda3/envs/vsnp/lib/python3.6/zipfile.py", line 1604, in _writecheck
" would require ZIP64 extensions")
zipfile.LargeZipFile: Filesize would require ZIP64 extensions
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/bioinfo/vSNP/vSNP.py", line 227, in
functions.run_script2(arg_options)
File "/home/bioinfo/vSNP/functions.py", line 1696, in run_script2
for samples_in_fasta in pool.map(get_snps, directory_list, itertools_repeat(arg_options), chunksize=5):
File "/home/bioinfo/miniconda3/envs/vsnp/lib/python3.6/concurrent/futures/process.py", line 366, in _chain_from_iterable_of_lists
for element in iterable:
File "/home/bioinfo/miniconda3/envs/vsnp/lib/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
yield fs.pop().result()
File "/home/bioinfo/miniconda3/envs/vsnp/lib/python3.6/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File "/home/bioinfo/miniconda3/envs/vsnp/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
zipfile.LargeZipFile: Filesize would require ZIP64 extensions`

Update gatk system calls

You need to change the system calls to "gatk" to "gatk3" since GATK4 was released.

An NCBI change to the Genbank file format breaks table annoations

The annotation issue is being cause by this join notation being used in the gbks. For example the line:
complement(join(4727880..4728107,1..738))
Annotation works if the join is removed. Making the above:
complement(4727880..4728107)

When fixing this, search the gbk on the keyword join to also find lines such as this:
join(4727880..4728107,1..738)
This line will also break table annotation.
Simply deleting the line from the gbk will make the fix.

Be aware that these "fixes" will cause regions of the genome to not be identified, however this should be a very small amount.

samtools: error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file or directory

Hello,

I am having trouble installing/using this software.

OS: CentOS Linux release 7.9.2009 (Core)
Conda version: 4.11.0

Steps taken to install:

conda create -n vsnp
source activate vsnp
conda install -c defaults -c bioconda -c conda-forge vsnp
cd
git clone https://github.com/USDA-VS/vSNP_reference_options.git
vsnp_path_adder.py -d ~/vSNP_reference_options

This installs the software, but upon using vSNP_step1.py, I get:

### SRR Making indexes...
samtools: error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file or directory

conda list output:

# packages in environment at /home/bc06026/.conda/envs/vsnp:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
abyss                     2.0.2                h51208dd_5    bioconda
asciitree                 0.3.3                      py_2  
bc                        1.07.1               h7f98852_0    conda-forge
biopython                 1.78             py39h7f8727e_0  
blas                      1.0                         mkl  
bokeh                     2.4.2            py39h06a4308_0  
bottleneck                1.3.4            py39hce1f21e_0  
brotli                    1.0.9                he6710b0_2  
bwa                       0.7.17               h7132678_9    bioconda
bzip2                     1.0.8                h7b6447c_0  
c-ares                    1.18.1               h7f8727e_0  
ca-certificates           2022.3.29            h06a4308_0  
certifi                   2021.10.8        py39h06a4308_2  
click                     8.0.4            py39h06a4308_0  
cloudpickle               2.0.0              pyhd3eb1b0_0  
cycler                    0.11.0             pyhd3eb1b0_0  
cytoolz                   0.11.0           py39h27cfd23_0  
dask                      2022.2.1           pyhd3eb1b0_0  
dask-core                 2022.2.1           pyhd3eb1b0_0  
dbus                      1.13.18              hb2f20db_0  
distributed               2022.2.1           pyhd3eb1b0_0  
expat                     2.4.4                h295c915_0  
fasteners                 0.16.3             pyhd3eb1b0_0  
fontconfig                2.13.1               h6c09931_0  
fonttools                 4.25.0             pyhd3eb1b0_0  
freebayes                 0.9.21.7                      0    bioconda
freetype                  2.11.0               h70c0345_0  
fsspec                    2022.2.0           pyhd3eb1b0_0  
giflib                    5.2.1                h7b6447c_0  
glib                      2.69.1               h4ff587b_1  
gst-plugins-base          1.14.0               h8213a91_2  
gstreamer                 1.14.0               h28cd5cc_2  
h5py                      3.6.0            py39ha0f2276_0  
hdf5                      1.10.6               hb1b8bf9_0  
heapdict                  1.0.1              pyhd3eb1b0_0  
htslib                    1.14                 h9093b5e_0    bioconda
humanize                  3.10.0             pyhd3eb1b0_0  
icu                       58.2                 he6710b0_3  
intel-openmp              2021.4.0          h06a4308_3561  
jinja2                    3.0.3              pyhd3eb1b0_0  
joblib                    1.1.0              pyhd3eb1b0_0  
jpeg                      9d                   h7f8727e_0  
kiwisolver                1.3.2            py39h295c915_0  
krb5                      1.19.2               hac12032_0  
lcms2                     2.12                 h3be6417_0  
ld_impl_linux-64          2.35.1               h7274673_9  
libcurl                   7.80.0               h0b77cf5_0  
libdeflate                1.7                  h7f98852_5    conda-forge
libedit                   3.1.20210910         h7f8727e_0  
libev                     4.33                 h7f8727e_1  
libffi                    3.3                  he6710b0_2  
libgcc-ng                 11.2.0              h1d223b6_14    conda-forge
libgfortran-ng            7.5.0               ha8ba4b0_17  
libgfortran4              7.5.0               ha8ba4b0_17  
libgomp                   11.2.0              h1d223b6_14    conda-forge
libnghttp2                1.46.0               hce63b2e_0  
libpng                    1.6.37               hbc83047_0  
libssh2                   1.9.0                h1ba5d50_1  
libstdcxx-ng              11.2.0              he4da1e4_14    conda-forge
libtiff                   4.2.0                h85742a9_0  
libuuid                   1.0.3                h7f8727e_2  
libwebp                   1.2.2                h55f646e_0  
libwebp-base              1.2.2                h7f8727e_0  
libxcb                    1.14                 h7b6447c_0  
libxml2                   2.9.12               h03d6c58_0  
libzlib                   1.2.11            h166bdaf_1014    conda-forge
locket                    0.2.1            py39h06a4308_2  
lz4-c                     1.9.3                h295c915_1  
make                      4.2.1                h1bed415_1  
markupsafe                2.0.1            py39h27cfd23_0  
matplotlib                3.5.1            py39h06a4308_1  
matplotlib-base           3.5.1            py39ha18d171_1  
mkl                       2021.4.0           h06a4308_640  
mkl-service               2.4.0            py39h7f8727e_0  
mkl_fft                   1.3.1            py39hd3c417c_0  
mkl_random                1.2.2            py39h51133e4_0  
mpi                       1.0                     openmpi  
msgpack-python            1.0.2            py39hff7bd54_1  
munkres                   1.1.4                      py_0  
ncurses                   6.3                  h7f8727e_2  
networkx                  2.7.1              pyhd3eb1b0_0  
numcodecs                 0.9.1            py39h295c915_0  
numexpr                   2.8.1            py39h6abb31d_0  
numpy                     1.21.2           py39h20f2e39_0  
numpy-base                1.21.2           py39h79a1101_0  
openjdk                   11.0.13              h87a67e3_0  
openmpi                   4.0.2                hb1b8bf9_1  
openssl                   1.1.1n               h7f8727e_0  
packaging                 21.3               pyhd3eb1b0_0  
pandas                    1.4.1            py39h295c915_1  
pandoc                    2.12                 h06a4308_0  
partd                     1.2.0              pyhd3eb1b0_1  
pcre                      8.45                 h295c915_0  
perl                      5.26.2               h14c3975_0  
picard                    2.18.29                       0    bioconda
pillow                    9.0.1            py39h22f2fdc_0  
pip                       21.2.4           py39h06a4308_0  
pomegranate               0.14.4           py39h9a67853_0  
psutil                    5.8.0            py39h27cfd23_1  
py-cpuinfo                8.0.0              pyhd3eb1b0_1  
pyparsing                 3.0.4              pyhd3eb1b0_0  
pyqt                      5.9.2            py39h2531618_6  
pysam                     0.16.0.1         py39h051187c_3    bioconda
python                    3.9.11               h12debd9_2  
python-dateutil           2.8.2              pyhd3eb1b0_0  
python_abi                3.9                      2_cp39    conda-forge
pytz                      2021.3             pyhd3eb1b0_0  
pyvcf                     0.6.8           py39hde42818_1002    conda-forge
pyyaml                    6.0              py39h7f8727e_1  
qt                        5.9.7                h5867ecd_1  
raxml                     8.2.12               hec16e2b_4    bioconda
readline                  8.1.2                h7f8727e_1  
regex                     2022.3.15        py39h7f8727e_0  
samtools                  1.15                 h3843a85_0    bioconda
scikit-allel              1.3.5            py39hde0f152_1    conda-forge
scikit-learn              1.0.2            py39h51133e4_1  
scipy                     1.7.3            py39hc147768_0  
seaborn                   0.11.2             pyhd3eb1b0_0  
setuptools                58.0.4           py39h06a4308_0  
sip                       4.19.13          py39h295c915_0  
six                       1.16.0             pyhd3eb1b0_1  
sortedcontainers          2.4.0              pyhd3eb1b0_0  
sqlite                    3.38.2               hc218d9a_0  
tabixpp                   1.1.0                hb264ae4_8    bioconda
tblib                     1.7.0              pyhd3eb1b0_0  
threadpoolctl             2.2.0              pyh0d69192_0  
tk                        8.6.11               h1ccaba5_0  
toolz                     0.11.2             pyhd3eb1b0_0  
tornado                   6.1              py39h27cfd23_0  
typing_extensions         4.1.1              pyh06a4308_0  
tzdata                    2022a                hda174b7_0  
vcflib                    1.0.3                hecb563c_1    bioconda
vsnp                      2.03                 hdfd78af_2    bioconda
wheel                     0.37.1             pyhd3eb1b0_0  
xlrd                      2.0.1              pyhd3eb1b0_0  
xlsxwriter                3.0.2              pyhd3eb1b0_0  
xz                        5.2.5                h7b6447c_0  
yaml                      0.2.5                h7b6447c_0  
zarr                      2.8.1              pyhd3eb1b0_0  
zict                      2.0.0              pyhd3eb1b0_0  
zlib                      1.2.11            h166bdaf_1014    conda-forge
zstd                      1.4.9                haebb681_0

Attempts to resolve this:

I found a similar Github issue here: merenlab/anvio#1479 and tried the suggested solution of conda install -c bioconda samtools=1.9 --force-reinstall, but I'm getting

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions

Package libgcc-ng conflicts for:
python=3.9 -> zlib[version='>=1.2.11,<1.3.0a0'] -> libgcc-ng[version='>=10.3.0|>=7.2.0']
python=3.9 -> libgcc-ng[version='>=7.3.0|>=7.5.0']

Package ncurses conflicts for:
python=3.9 -> ncurses[version='>=6.2,<7.0a0|>=6.3,<7.0a0']
python=3.9 -> readline[version='>=8.0,<9.0a0'] -> ncurses[version='>=6.1,<7.0a0']

Package _libgcc_mutex conflicts for:
samtools=1.9 -> libgcc-ng[version='>=7.3.0'] -> _libgcc_mutex[version='*|0.1',build='main|conda_forge|main']
python=3.9 -> libgcc-ng[version='>=7.5.0'] -> _libgcc_mutex[version='*|0.1',build='main|conda_forge|main']The following specifications were found to be incompatible with your system:

  - feature:/linux-64::__glibc==2.17=0
  - feature:|@/linux-64::__glibc==2.17=0
  - samtools=1.9 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']

Your installed version is: 2.17

I tried making a new environment and installing vsnp with conda install -c defaults -c bioconda -c conda-forge vsnp python=3.7 and after a very long time it did give me a y/n proceed to install prompt, but samtools was at 1.7 and I still got the same libcrypto.so.1.0.0 error.
I tried downloading the most recent .tar.bz2 from here: https://anaconda.org/bioconda/vsnp/files, editing the meta.yaml changing python >=3.7 to python=3.7, conda-build ., but got:

conda_build.exceptions.DependencyNeedsBuildingError: Unsatisfiable dependencies for platform linux-64: {'vcflib', 'pysam', 'freebayes', 'scikit-allel', 'picard', 'raxml', 'samtools', 'abyss', 'bwa', 'pyvcf'}

I tried the same thing but removing the version requirement from Python and setting samtools to require version 1.9, but got a similar conda_build.exceptions.DependencyNeedsBuildingError error.

Please advise the best way to resolve this and let me know if you need any other information. Thank you for your help.

AttributeError

Hello,
I'm facing a problem running the first part of vSNP pipeline.
I have Mycobacterium bovis raw reads obtained from Illumina Miniseq, with a paired end run, and I want the Mycobacterium bovis AF2122 as a reference. I wrote the following command:

SNP_step1.py -r1 47104-2018_S10_L001_R1_001.fastq.gz -r2 47104-2018_S10_L001_R2_001.fastq.gz -r Mycobacterium_AF2122
and the analysis starts properly until I get this message:

Traceback (most recent call last):
File "/home/[email protected]/anaconda3/envs/vSNP/bin/vSNP_step1.py", line 122, in
align_reads.align()
File "/home/[email protected]/anaconda3/envs/vSNP/bin/vsnp_alignment_vcf.py", line 376, in align
group_reporter = GroupReporter(zero_coverage_vcf, ref_option)
File "/home/[email protected]/anaconda3/envs/vSNP/bin/vsnp_group_reporter.py", line 30, in init
defsnp_iterator = iter(defining_snps.iteritems())
File "/home/[email protected]/anaconda3/envs/vSNP/lib/python3.9/site-packages/pandas/core/generic.py", line 5989, in getattr
return object.getattribute(self, name)
AttributeError: 'Series' object has no attribute 'iteritems'

After that anything happen and my folder look like this:

Sometimes it happen that the command line write itself this: Minimum k-mer coverage is 7
and keep going into the analysis (?), but not always.

How can I solve this problem?

Hope I wrote clearly,
thank you in advance
Valentina

Missing dependency

The script vcftofasta.sh at line 244 call for a file "ALL_WGS.xlsx", but it's not present in the dependencies. Can you please add it up?

"script2" has changed a bit since last time I used it. I'm trying to implement all the new changes in my local installation.

Features request

Hi Tod,

I think that adding these 2 options would make the usage of the pipeline more flexible:

Add an option to run only the "all" tree from the VCF files. All cores would be used for the RAxML job.
Add an option to control the number of sample to parallel process. This would be useful if the number of samples to process in small. Say you have only 2 samples and one of them has super big fastq files compared to the other one, than you could assign to run only 1 sample in parallel. That would assign all the cores to 1 sample, run them in series and cut down the run time by much.

I would actually use those 2 options often. We rarely have lots of new strains to process here...
Thanks,
Marco

Is step1 complete? got message: "reading *_unaligned_R2.fastq not such file or directory"

Hello, I've run vSNP step 1, and the program stopped after outputting this message "reading *_unaligned_R2.fastq not such file or directory". Why does this happens? Has the step1 finished correctly anyway?

Thanks in advance,

Genbank short version is downloaded

vsnp_fasta_gbk_gff_by_acc.py -b -a <ncbi accession> downloads the genbank short version and not the long version. The long version is needed by vSNP_step2.py to annotate tables. Manually downloading the long version from a web browser and placing into the dependency folder is current work around. Remember to check the file coding (UTF-8, LF).

usda-vs / vsnp Goto Github PK

vsnp's Introduction

vSNP3 - Latest Version - Recommended

Overview

vSNP -- validate SNPs

Features:

Setup

Installing vSNP:

Reference Options:

Test Files:

Output Structure:

Procedure Detail:

Utility Scripts:

vsnp's People

Contributors

Stargazers

Watchers

Forkers

vsnp's Issues

Recommend Projects

Recommend Topics

Recommend Org