Code Monkey home page Code Monkey logo

vsnp3's Issues

vSNP3

I tried to download vSNP3 in MacOS Monterey, ver 12.6.7 and it does not work. Do you have any suggestions? I followed the installation instructions that are provided for Mac users. Thanks!

mpilup takes 5h per sample with nanopore data

Hi Tod,

I've been trying the nanopore support of vSNP3 and I think it still needs to be optimized.

First, when installing vSNP3 with conda, it lacks 2 dependencies: vcftools and bcftools. To get a fully working pipeline (I only tested step 1 so far), I had to run:

conda create -y -n vsnp3 -c bioconda vsnp3=3.06
conda install -c bioconda vcftools bcftools
# I had a problem with some vcftools library that could be solved by creating a symbolic link
ln -s /home/bioinfo/miniconda3/envs/vsnp3/lib/libcrypto.so.1.1 /home/bioinfo/miniconda3/envs/vsnp3/lib/libcrypto.so.1.0.0

There are a few Warnings printed in the terminal while the step1 runs. The command I used:

$ vsnp3_step1.py -n -r1 '/home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS036.fastq.gz' -f /home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.fasta         -b /home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.gbk

The terminal output:

vsnp3_step1.py SET ARGUMENTS:
Namespace(FASTQ_R1='/home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS036.fastq.gz', FASTQ_R2=None, FASTA=['/home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.fasta'], gbk=['/home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.gbk'], reference_type=None, nanopore=True, assemble_unmap=False, debug=False)



Best Reference Finding with Sourmash 
2022-05-19 14:51:17

== This is sourmash version 4.4.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
loaded query: /home/bioinfo/analyses/mbovis_... (k=31, DNA)
loaded 1 databases.

WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
11 matches; showing first 3:
similarity   match
----------   -----
  6.3%       NC_002945.4 Mycobacterium bovis AF2122/97 genome assembly...
  6.2%       NZ_CP041790.1 Mycobacterium tuberculosis strain SEA170200...
  6.2%       CP016401.1 Mycobacterium caprae strain Allgaeu genome

Sample: MBWGS036
Top Sourmash Finding: NC_002945.4 
Reference Set: Mycobacterium_AF2122 
Top reference that is automatically available: /home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.fasta

#############


Spoligotype 
2022-05-19 14:51:23

Align and make VCF file 
2022-05-19 14:52:36
[M::mm_idx_gen::0.136*1.01] collected minimizers
[M::mm_idx_gen::0.160*1.95] sorted minimizers
[M::main::0.160*1.95] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.184*1.83] mid_occ = 11
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.194*1.79] distinct minimizers: 770441 (96.15% are singletons); average occurrences: 1.053; average spacing: 5.362; total length: 4349904
[M::worker_pipeline::3.132*5.40] mapped 9607 sequences
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -a -x map-ont -R @RG\tID:MBWGS036\tSM:MBWGS036\tPL:ILLUMINA\tPI:250 -t 8 -o MBWGS036.sam /home/bioinfo/analyses/mbovis_nanopore_vsnp3/step1/NC_002945v4.fasta /home/bioinfo/analyses/mbovis_nanopore_vsnp3/step1/MBWGS036.fastq.gz
[M::main] Real time: 3.149 sec; CPU: 16.941 sec; Peak RSS: 0.794 GB
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[markdup] warning: unable to calculate estimated library size. Read pairs 0 should be greater than duplicate pairs 0, which should both be non zero.
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf temp1.vcf
	--recode-INFO-all
	--out temp2
	--recode
	--remove-indels

Warning: Expected at least 2 parts in INFO entry: ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
Warning: Expected at least 2 parts in INFO entry: ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
After filtering, kept 1 out of 1 Individuals
Outputting VCF file...
After filtering, kept 516 out of a possible 611 Sites
Run Time = 0.00 seconds

Zero Coverage 
2022-05-19 17:12:07
	Positions with no coverage: 12,953, 0.297777% of reference

MBWGS036 Poor FASTQ Usability
MBWGS036 Acceptable Reference Usability

As you can notice, the top reference has a very low % value. It still picks the right one, but this part of the pipeline is not optimized for Nanopore. Also, why is it still looking for the best reference is we already told which one to use?

The log file looks like this:


vsnp3_step1.py SET ARGUMENTS:
Namespace(FASTQ_R1='MBWGS009.fastq.gz', FASTQ_R2=None, FASTA=['/home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.fasta'], gbk=['/home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.gbk'], reference_type=None, nanopore=True, assemble_unmap=False, debug=False)

Call Summary:
SYSTEM CALL: minimap2 -a -x map-ont -R "@RG\tID:MBWGS009\tSM:MBWGS009\tPL:ILLUMINA\tPI:250" -t 8 /home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS009/NC_002945v4.fasta /home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS009/MBWGS009.fastq.gz -o MBWGS009.sam -- 2022-05-19_17:45:11
SYSTEM CALL: samtools fixmate -O bam,level=1 -m MBWGS009.sam MBWGS009_fixmate.bam -- 2022-05-19_17:45:24
SYSTEM CALL: samtools sort -l 1 -@8 -o MBWGS009_pos_srt.bam MBWGS009_fixmate.bam -- 2022-05-19_17:45:24
SYSTEM CALL: samtools markdup -f markduplicate_stats.txt -r -O bam,level=1 MBWGS009_pos_srt.bam MBWGS009_nodup.bam -- 2022-05-19_17:45:24
NOTE: Read stats gathered by markduplicate_stats.txt -- 2022-05-19_17:45:24
NOTE: Nanopore - bcftools mpileup used to call SNPs and make VCF files *** -- 2022-05-20_00:17:23
SYSTEM CALL: bcftools mpileup --threads 16 -Ou -f /home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS009/NC_002945v4.fasta MBWGS009_nodup.bam | bcftools call --threads 16 -mv -v -Ov -o MBWGS009_unfiltered_hapall.vcf -- 2022-05-20_00:17:23
SYSTEM CALL: vcffilter -f "QUAL > 20" MBWGS009_unfiltered_hapall.vcf > temp1.vcf -- 2022-05-20_00:17:23
NOTE: Nanopore QUAL values increased by 100 to obtain closer values seen with Illumina reads, and allowing VCF files from both platforms to be ran together. -- 2022-05-20_00:17:23
NOTE: Skipped unmapped read assembly -- 2022-05-20_00:17:23
IMPORT: VCF_Annotation(gbk_list=self.gbk, vcf_file=filtered_hapall) -- 2022-05-20_00:17:25
IMPORT: Zero_Coverage(FASTA=reference, bam=nodup_bamfile, vcf=filtered_hapall,) -- 2022-05-20_00:17:41
NOTE: Files moved to temp_dir and removed: *_unmapped*.fastq.gz, *_all.bam, *_fixmate.bam, *_pos_srt.bam, markduplicate_stats.txt, *.bai, *_filtered_hapall.vcf, *_mapfix_hapall.vcf, *_unfiltered_hapall.vcf, *_filtered_hapall_nanopore.vcf, *.sam, *.amb, *.ann, *.bwt, *.pac, *.fasta.sa, *_sorted.bam, *.dict, chrom_ranges.txt, *.fai, dup_metrics.csv -- 2022-05-20_00:17:41

Versions:
vSNP3: 3.06
Bio, 1.79
numpy, 1.22.3
pandas, 1.4.2
Minimap2: 2.24-r1122
Freebayes: v1.3.6
samtools 1.15
Using htslib 1.14

The main issue right now is that the mpileup step (using bcftools) takes about 5h per sample. I just can rerun all my samples with vSNP3 if it takes that long!

Here's the content of the Excel stats file:

sample	date	FASTA/s	Sourmash Sequence Similarity	Found_Reference_Set	FASTQ_R1	R1 File Size	R1 Read Count	R1 Length Sum	R1 Min Length	R1 Ave Length	R1 Max Length	R1 Passing Q20	R1 Passing Q30	R1 Read Quality Ave	Spoligotype Spacer Counts	Spoligotype Binary Code	Spoligotype Octal Code	Spoligotype SB Number	Groups	Aligner	Mapped Paired Reads	Mapped Single Reads	Unmapped Reads	Unmapped Percent	Unmapped Assembled Contigs	Duplicate Paired Reads	Duplicate Single Reads	Duplicate Percent of Mapped Reads	BAM/Reference File	Reference Length	Genome with Coverage	Average Depth	No Coverage Bases	Percent Ref with Zero Coverage	Quality SNPs
MBWGS009	2022-05-19_17-40-28	NC_002945v4.fasta	3.9%:3b48a55512e8dedc2b8d6e33699893bd	Mycobacterium_AF2122	MBWGS009.fastq.gz	248.4 MB	74,874	262,465,587	1	3,505.4	36,224	65.27%	36.07%	13.8717	20:23:0:27:0:24:26:24:0:28:0:16:26:26:26:0:28:32:0:23:27:28:36:32:36:43:35:35:31:0:0:0:0:0:32:38:36:35:0:0:0:0:0	binary-1101011101011110110111111111100000111100000	octal-656573377603600	SB1071	group file not provided	Minimap2	0	74,847	2,725	3.5%	skipped assembly	0	442	0.6%	MBWGS009_nodup.bam made with NC_002945v4	4,349,904	99.81%	59.1X	8,295	0.190694%	596

So any plans on improving support for Nanopore? I actually haven't tested vSNP3 on paired end data yet, so I don't know if the speed problem is only Nanopore related or not. Let me know if you need more info.

Thanks!
Marco

Error while running vSNP 3.20 "step1" with nanopore data

Command used:
vsnp3_step1.py \ -r1 "$r1" \ --spoligo \ --nanopore

Error message:
Traceback (most recent call last): File "/home/bioinfo/miniconda3/envs/vsnp3.20/bin/vsnp3_step1.py", line 273, in <module> vsnp.run() File "/home/bioinfo/miniconda3/envs/vsnp3.20/bin/vsnp3_step1.py", line 158, in run if alignment.zero_coverage.ave_coverage < 40 or alignment.DUPLICATION_RATIO > .80 or float(fastq_stats.R1.passQ20) < 50.0: ValueError: could not convert string to float: '11,015'
Looks like it's caused by the use of a comma (",") as thousands separator. It might be link to the host system language preferences if you don't get that error on your dev / production machine.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.