fritzsedlazeck / sniffles Goto Github PK

Structural variation caller using third generation sequencing

License: Other

Python 100.00%

bioinformatics structural-variations ngm-lr pacbio nanopore structural-variation bioconda

sniffles's Introduction

Sniffles2

A fast structural variant caller for long-read sequencing, Sniffles2 accurately detect SVs on germline, somatic and population-level for PacBio and Oxford Nanopore read data.

Quick Start: Germline SV calling using Sniffles2

To call SVs from long read alignments (PacBio / ONT), you can use:

sniffles -i mapped_input.bam -v output.vcf

For improved calling in repetitive regions, Sniffles2 accepts a tandem repeat annotations file using the option --tandem-repeats annotations.bed. Sniffles2 compatible tandem repeat annotations for human references can be downloaded from the annotations/ folder.

(see sniffles --help or below for full usage information).

Installation

You can install Sniffles2 using pip or conda using:

pip install sniffles

conda install sniffles=2.4

If you previously installed Sniffles1 using conda and want to upgrade to Sniffles2, you can use:

conda update sniffles=2.4

Requirements

Python >= 3.10
pysam >= 0.21.0
edlib >=1.3.9
psutil>=5.9.4

Tested on:

python==3.10.12
pysam==0.21.0

Citation

Please cite our paper at: Sniffles v2: https://www.nature.com/articles/s41587-023-02024-y

and Sniffles v1: https://www.nature.com/articles/s41592-018-0001-7

Use-Cases / Modes

A. General (all Modes)

To output deletion (DEL SV) sequences, the reference genome (.fasta) must be specified using e.g. --reference reference.fasta.
Sniffles2 supports optionally specifying tandem repeat region annotations (.bed), which can improve calling in these regions --tandem-repeats annotations.bed. Sniffles2 compatible tandem repeat annotations for human references can be found in the annotations/ folder.
Sniffles2 is fully parallelized and uses 4 threads by default. This value can be adapted using e.g. --threads 4 as option. Memory requirements will increase with the number of threads used.
To output read names in SNF and VCF files, the --output-rnames option is required.

B. Multi-Sample SV Calling (Trios, Populations)

Multi-sample SV calling using Sniffles2 population mode works in two steps:

Call SV candidates and create an associated .snf file for each sample: sniffles --input sample1.bam --snf sample1.snf
Combined calling using multiple .snf files into a single .vcf: sniffles --input sample1.snf sample2.snf ... sampleN.snf --vcf multisample.vcf

Alternatively, for step 2. you can supply a .tsv file, containing a list of .snf files, and custom sample ids in an optional second column (one sample per line), .e.g.: 2. Combined calling using a .tsv as sample list: sniffles --input snf_files_list.tsv --vcf multisample.vcf

C. Mosaic SV Calling (Non-germline or somatic SVs)

To call mosaic SVs, the --mosaic option should be added, i.e.:

sniffles --input mapped_input.bam --vcf output.vcf --mosaic

D. Genotyping a known set of SVs (Force Calling)

Example command, to determine the genotype of each SV in input_known_svs.vcf for sample.bam and write the re-genotyped SVs to output_genotypes.vcf:

sniffles --input sample.bam --genotype-vcf input_known_svs.vcf --vcf output_genotypes.vcf

Quick Tips

Input / Output

.bam or .cram files containing long read alignments (i.e. from minimap2 or ngmlr) are supported as input
.vcf.gz (bgzipped+tabix indexed) output is supported
Simultaneous output of both .vcf and .snf file (for multi-sample calling) is supported

Companion apps

We have developed a plotting tools for Sniffles2: https://github.com/farhangus/sniffle2_plot
We also provide VCF and scripts used for the manuscript https://github.com/smolkmo/Sniffles2-Supplement

Supplementary tables

https://github.com/smolkmo/Sniffles2-Supplement/blob/main/Supplemetary%20tables.xlsx

sniffles's People

Contributors

Stargazers

Watchers

Forkers

jinfengchen zhuochenbioinfo genomicsnx debian alenzhao dasmoocher healthvivo scchess eharr ydliu-hit res2677 xxz19900 tjianghit yqwu1983 hgsc-ngsi wangdi2014 dychangfeng jdetras tw7649116 xuelei-dai thomasfaraut hzaumsq nextomics hmyh1202 mengchengyao yechao94 qzhqzh leaveyeah chen318liang polojacky vipmenon sysu-sinlee musketeer-d evansmathebula linzhi2013 alisaei arun-sub smoe yedomon rkrishnasanka pythseq quanrd liangminliu gaworj xxllgg dolittle007 vilperte novapyth ddesvillechabrol mcao0404 meltpinkg lfpaulin wook2014 nyagam lzx325 smolkmo mealser wdecoster zjmanmu nisharajagopal conanyangqun yaningyang guliba asleonard zhengxinchang burgshrimps monera68 bio-ontology-research-group schaudge yuruihong zhipengliux cxy-26 yuanzehu88 qinqian dx-lzu dineshravindraraju jlanej khjia samstudio8 christopher-schroeder bmw397 jmencius fulcrumgenomics benjaminvyshedskiy acenglish thesequencingcenter

sniffles's Issues

Genotyping for homozygous sample

Hi Fritz,

I work with Arabidopsis thaliana which we dont expect to see so much heterozygosity. But Half of my SV calls basically heterozygous and I dont want to filter them out.

Is there any parameter that I can use for genotyping considering this problem?

I have found these but I dont know which I am supposed to use:

-f , --allelefreq
--min_het_af
--min_homo_af

Best Regards,
Mehmet

unclear point about strands

Hi,

thank you for fixing bugs. i sincerely appreciate.
i have some questions about "strands" in INFO column.
there is "STRANDS=xx" records.
i extracted SV type and STRANDS pairs from Sniffles output.
i used cancer reads generated by MinION.
result is as below:

DEL	+	-
DEL/INV	+	-
DEL/INV/INVDUP	+	+
DUP	-	+
DUP/INS	+	-
INS	+	-
INV	+	+
INV	+	-
INV	-	-
INV/INVDUP	+	+
INV/INVDUP	-	-
INVDUP	+	+
INVDUP	-	-
TRA	+	+
TRA	+	-
TRA	-	+
TRA	-	-

i read this ##INFO line:
##INFO=<ID=STRANDS,Number=.,Type=String,Description="Strand orientation of the adjacency in BEDPE format (DEL:+-, DUP:-+, INV:++/--)">

i think INV will have only STRANDS=-- or STRANDS=++ but in my .vcf, INV sometimes have STRANDS=+-.

i would like to know what +/- means.

extremely long SVs detected (>20Mb)

Hi Fritz,

I applied Sniffles (1.0.6) on one simulated dataset of mine (100 deletions simulated in hg19, chr21+chr22). Sniffles detected many deletions correctly but also outputted several super-large (and in my opinion incorrect) SVs. See the IGV screenshot below (top row: implanted deletions, bottom row: detected SVs from Sniffles)

Two examples from the VCF output:
chr21 11004164 7 N <DEL> . PASS PRECISE;SVMETHOD=Snifflesv1.0.6;CHR2=chr21;END=32799278;STD_quant_start=0.935414;STD_quant_stop=0.707107;Kurtosis_quant_start=0.102041;Kurtosis_quant_stop=-1.000000;SVTYPE=DEL;SUPTYPE=SR;SVLEN=21795114;STRANDS=+-;RE=8 GT:DR:DV ./.:.:8
chr21 11142297 11 N <DUP> . PASS PRECISE;SVMETHOD=Snifflesv1.0.6;CHR2=chr21;END=31923593;STD_quant_start=5.196152;STD_quant_stop=1.154701;Kurtosis_quant_start=0.000000;Kurtosis_quant_stop=0.000000;SVTYPE=DUP;SUPTYPE=SR;SVLEN=20781296;STRANDS=-+;RE=3 GT:DR:DV ./.:.:3

I probably made some mistake here but cannot find it. Have you seen something similar before?

This is my pipeline:

mapping reads with ngm-lr onto hg19
samtools sort (query-sort) on bam
sniffles -s 5 -t 5 -v VCF (-s 2 gives similar result)

[Question] : SVTYPE = TRA

Hi,

Thanks a lot for sniffles, this is a really efficient, easy to install and to run tool... I'm so glad such a tool exist.

I've recently used sniffles after ngmlr alignement in order to detect SV in my higly diploid assembly.
I'm not really used to VCF format, but I think I can easily infer what SVTYPE means : INS = insertion, DEL = deletion, INV = inversion...
But I've got a lot of theses huge SV (like several Millon bases long) that are written as TRA, which I have no idea what it stands for.
I couldn't find any information regarding this in VCF specifications format . Do you have any idea about that subject ?

Cheers,

Roxane

Genotype sequences?

Hi - are there any plans to include sequences in the VCF output? For example, if it's an insertion, what sequence (from the reads) has been inserted?

If this isn't planned, can you suggest other tools to extract sequence from a VCF, reference, and reads?

Thanks,
~Joe

Use cs tag

Expand Sniffles to use the CS tag proposed by @lh3
https://github.com/lh3/minimap2#the-cs-optional-tag

'Too few reads' even if '--min_support 1'

Hi.

I tested Sniffles with 3 contigs that have 100/1000/2000 bp INDEL.
but Sniffles reported 'Too few reads' even if '--min_support 1'.

# sniffles -m sort.bam --min_support 1 -v output.vcf
Estimating parameter...
Too few reads detected in sort.bam

The contig are mapped by ngmlr and minimap2.

sniffles version:(git source)

# sniffles --version

sniffles  version: 1.0.8

sort.bam.zip

-l parameter : minimum sv length to be reported.

Hi Fritz,

I would like to get even one single nucleotide deletions or insertions. When I set -l parameter to 1. It gets so slow. 15 hours was not enough even to process couple of thousands reads.

Normally if -l is higher than 5 it takes 20 min.

Is it normal?

Best,
Mehmet

Analysis performed only on 13 first contig

Hi again, sorry for the second and maybe dumb question.

I just noticed that sniffles only worked on my first 13 contigs, but I have 500+ contigs in total. I did not saw any option to restrict the analysis only on the first contigs... But maybe I did something wrong ? Any idea ?

Thanks again,

Roxane


	Switch Chr Contig_11
		# Processed reads: 2680000
		# Processed reads: 2690000
		# Processed reads: 2700000
		# Processed reads: 2710000
		# Processed reads: 2720000
		# Processed reads: 2730000
		# Processed reads: 2740000
		# Processed reads: 2750000
		# Processed reads: 2760000
	Switch Chr Contig_12
		# Processed reads: 2770000
		# Processed reads: 2780000
		# Processed reads: 2790000
	Switch Chr Contig_13
		# Processed reads: 2800000
		# Processed reads: 2810000
		# Processed reads: 2820000
		# Processed reads: 2830000
Finalizing  ..

change 'Number=.' to 'Number=A' or 'Number=R' in VCF

Hi.

please change 'Number=.' to 'Number=A' or 'Number=R' in VCF for more strict VCF.

then 'bcftools norm --multiallelics' can be used to split/merge VCF record.

This will be a minor enhancement.

Best Regards

predicted_length - what is it?

Hi,

As usual - thanks for the great tool.

In the BEDPE output, the header specifies the columns as:

#Chrom	start	stop	chrom2	start2	stop2	variant_name/ID	score (smaller is better)	strand1	strand2	type	number_of_split_reads	best_chrom	best_pos	best_chrom2	best_pos2	predicted_length

The last column is called predicted_length. I naively assumed this was the predicted length of the SV -- e.g. of a deletion. However, I recently noticed that the predicted_length column always has the same value as the number_of_split_reads column. For example:

So my question is -- is that intentional (that both columns have the same value)? If so, is it safe to say that predicted_length is meaningless for now?

Best,

John

Do I need to mark duplicates before feeding sorted bam file to sniflle?

Hello, I am working on a human genome, and I want to call SV using Sniffle. I have just mapped pacbio data to human reference genome using ngmlr and then sort the bam file using samtools. Before feeding the sorted bam file to sniffle, I wonder whether I need to mark duplicates and select only a single subread as non-duplicate?

What is "forced vcf calling"

Hi,

I am wondering what exactly sniffles is going to do when I specify the option --Ivcf?
In the wiki, it says that this 'enables forced calling'. Does this mean sniffles will limit variant calling to regions of previously seen variants? If so, are the regions exact positions (variants in vcf file) or do they include flanking regions for variability?

Sorry if I misunderstood.

--report_seq core dump

When I run sniffles-1.0.6 with the --report_seq option I get:

terminate called after throwing an instance of 'std::out_of_range'
what(): basic_string::substr
Aborted (core dumped)

This does not happen when I don't use the --report_seq option. Without that option the program gives interesting results, thanks.

The problem seems to occur around the time the first vcf record it being output, 1-5 minutes into the run of the program. The error occurs for 4 out of 5 bam files I was working with, while the fifth one worked OK.

Error: Segmentation fault (core dumped) (sorting with picardtools)

Hi Fritz,

I am working a bacterial genome, which is a circular genome. I would like to call SV using Sniffle.

I have mapped my PacBio read to ref genome using NG-MLR.
ngmlr -r $REF -q $FILE -o $OUTPATH/$SAMNAME -t 30

Then I sort the sam file and convert it to sorted bam format using picard tool.
java -Djava.io.tmpdir=pwd/tmp -jar $PICARD/picard.jar SortSam \ I=$OUTPATH/$SAMNAME O=$OUTPATH/${SAMNAME%%.sam}_sorted.bam \ SO=coordinate TMP_DIR=pwd/tmp

The sorted bam file was fed to Sniffles:

sniffles -m file_sorted.bam -v file.vcf

After running this command, I got an error message:
Segmentation fault (core dumped)

Could you help me with this problem?

Jie

create release tags

Hi, can you create a release tag for Sniffles? I'd just like this for reproducibility as my users seem quite enamored with your software!

Question on bwa options

I am trying to use sniffle to detect sv on human genomes. The guide says "requires output from BWA-MEM with the optional SAM attributes enabled". What sam attributes are required? Thanks!

How to filter the variants called by sniffles

Hi, I've just run sniffles to call structral variants with pacbio data. Now I want to filter some variants which look like impossilbe. For example,

1       724782  0       N       <DUP>   .       PASS    IMPRECISE;SVMETHOD=Snifflesv1.0.7;CHR2=1;END=224202044;ZMW=19;STD_quant_start=59.084034;STD_quant_stop=93.640885;Kurtosis_quant_start=-1.482464;Kurtosis_quant_stop=-1.658414;SVTYPE=DUP;SUPTYPE=SR;SVLEN=223477262;STRANDS=-+;RE=26    GT:DR:DV        ./.:.:26

The length of this SV is 223477262, almost equal to the length of chromesome 1. And in the INFO field it is marked as IMPRECISE. I wonder whether I can use this mark to exclude it ? If not , is there any other way to filter raw variants?

Detection on Pooled Samples

Hi Fritz,

I was wondering if Sniffles is suitable for calling variants on pooled sample data, i.e. where there might be quite low frequency SVs? Or is it best suited to data from individuals?

Thanks very much in advance, and best wishes!

Sam

svlen is missing from insertion calls

Dear Fritz,

I was wondering if the following issue is an error in sniffles. I have aligned 5X worth of PacBio sequel data using NGMLR and called structural variations using SNIFFLES. I have found cases where insertions are called and insertion sequences are provided, but the SVLEN is missing as the example below shows.

I can approximate the SVLEN from the given SEQ, but I wanted to know what was causing this issue.

5 64184964 7714 N . PASS PRECISE;SVMETHOD=Snifflesv1.0.8;CHR2=5;END=64184964;ZMW=1;STD_quant_start=0.707107;STD_quant_stop=0.707107;Kurtosis_quant_start=-1.000000;Kurtosis_quant_stop=-1.000000;SVTYPE=INS;SUPTYPE=AL;SVLEN=NA;STRANDS=+-;SEQ=TGATCACGTTGGAAACACTCTTTTGTAGTGTCTGGAAGGGAAAGATTTGGAGCGCTTTGATGCCCTTTGGTGAACAAGGGCATTTCTTCCCATAAAAATTAGACAGAAGCATTTCTAGAAACTTGTTGTGATGGTGACCCAGCTAAGGAGGACAGTTGAACATTCGTATGATAGAGCAGTGTGAAACACTATTTTTGTGGAACAATGCAAGCAATTGGATATTTGGATAGCTGGAGGGGATTTCGTTGAAAGCGGGATATTCAAATAAAAAGGTAGACAGGAGCATTCTCAGAAATTTTCTTTCTGATGTACTGATATTCAACTCATAGAGTGAAGATTCCCGTTCATAGAGGCAGTTTTGAAACACCTTCTGGAGTATCTGGATGTGGACATGTGGAGCGCTTTGATTGCCTACGGTGAATGAAAAAGGGCAAATATCTTCCCATTAAAACGAGACAGGAAGATTCGGAGAGACAGTGGATGTGTTGTACATCAAGATAGCTAACAGAGTGCGAACCCTTTCTTTGTTACAGAGCAGCTTTGAAACGCTATTTTTGTTGGATCTGCAAATGGATATTTAGAGTTGCTTTTTACAGATATCGTTGGAATAGGGAAAATCGTCAACAAAATCTGGACAGAGCATTCTCACCAAAACTTCTTGTTGATGTGTGTCCTCAACGTAACAGAGTGAAAACTTTCTTTTCGATGCCTAGCAGTTTGGAAAAACCTTGTTGTAGAACACTGTAAGGGGATATTGGGATTGCTCTAAGATTTCGTTGGAAACGGGAATATCAATCATCTAAAATCTAGACAGAAGCACTTATTTAGAAAAATCTTGGTGATATCTGCATTCAAGTCAGAAGAGTTGAACATTCCTTACTTGAGATCGTTTGAAACACCCTTTGGAAAGAATCTGGAATGTGGAATTTGGAGCGCTTTATGCCTTTGGTTGAAAAGGAAACGTCTTCCAATAAAAGCTAGACAGAAAGCATTCTCAGAAATTGTTTCCTTGATGTGTGTACTCAACTAAAAGAGTTGAACCTTTCTATTGATTGTAGAGCAGTTTAACACACTCTTTTTGTGGATTCTGCAAGTGGATATTTGGATTGCTTTGAGGAGGTTCGTTGGAAGCGGGAATTCGTTAAACACACTAGAAGCACATTTCCCAGAAATGTTCTGGATATTTCAATTCAACTCATAGAGATGAACATGGCCTTCATAGAGCAGGTTGAAACACTCTGTTTTGTAGTTGTGGAAGTTGGACATTGTCGATCGCCTGGACGCCTACGGTTGAATGTGAAAAAAGGAGGAAAATTATATTCCAGAAAAAGGACAGAAGCATTCTCAGGAAACTGTGGTTGATTTGGTCCTCAACTAACAGAGTTGACTTTGCCTATGATAGGAGAGCAGTTTTTGAAACCCACTCTATTTTCATGATCTTGCAAGGGGATATTTGGATAGCTTGGAGATTTAGTTGGAAGCGGGCATTCAAAGTAAAGTAGACAGCAGCATTCGCAGATAAATTATTTTCTGATCGATGCATCAAATCATTAGAGTTGAACATTTCCTTTGTCATAGGGCAGGTTTGAAAGACCTCTTTTCTCGTAGTATCTGGCTGTGGAATTTGTGAGCGTTGAGCCTACGGTGAAAAAGTAAATATCTTCCCATTTAAAAACGAGACAGAAGGATTTGAGAAAACAAGTTGTGAATGGTGTACTCAGGCTAACAGAGTGAACCTTCTCTTTGATGCCAGTAGTTTGGAAACACTCTTTTTTGTAGAATAGGGAAACTGTAAGGGTGATTGGGATAGCTCTAATGATTTCGTTGGAAACGGAATATAATACACTAAAAATCGAGACAGAAGCCCTTTCAGAAAATACTTGTGATATCTGCATCAAGTCAACAGCGGTGTTCGAACATCAGCTTTCTTAGACACGTGGAAAACGTCTTGTTGTAGGGGGTCTGGAAGTGGTGGACATTTGGAAGCGCTTTGATGCCTT;RE=2;AF=0.4 GT:DR:DV 0/1:3:2

Best,
Sangjin Lee

double free or corruption (out)

I am running the command as follow:

sniffles-core-1.0.6/sniffles -n 5 --genotype -s 2 -m ../pacbio_align_sorted.bam -v ./sniffle_mem_min2.vcf > ./sniffle_out_mem_min2.log 2>&1

after some time of running it raise up this error:


*** Error in `/home/user/source/Sniffles_1.0.6/bin/sniffles-core-1.0.6/sniffles': double free or corruption (out): 0x00007ffe70fb9a70 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f44dadd47e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f44daddd37a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f44dade153c]
/home/medhat/source/Sniffles_1.0.6/bin/sniffles-core-1.0.6/sniffles(_ZN9Genotyper11compute_covER15Breakpoint_TreeRP15breakpoint_node+0x20f)[0x42d67f]
/home/medhat/source/Sniffles_1.0.6/bin/sniffles-core-1.0.6/sniffles(_ZN9Genotyper10update_SVsEv+0x3e)[0x43141e]
/home/user/source/Sniffles_1.0.6/bin/sniffles-core-1.0.6/sniffles(main+0x1ff)[0x426f5f]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f44dad7d830]
/home/user/source/Sniffles_1.0.6/bin/sniffles-core-1.0.6/sniffles(_start+0x29)[0x42aad9]
======= Memory map: ========
00400000-0052a000 r-xp 00000000 08:22 2009114                            /home/medhat/source/Sniffles_1.0.6/bin/sniffles-core-1.0.6/sniffles
0072a000-0072b000 r--p 0012a000 08:22 2009114                            /home/medhat/source/Sniffles_1.0.6/bin/sniffles-core-1.0.6/sniffles
0072b000-0072c000 rw-p 0012b000 08:22 2009114                            /home/medhat/source/Sniffles_1.0.6/bin/sniffles-core-1.0.6/sniffles
0072c000-0072f000 rw-p 00000000 00:00 0 
0232d000-11371000 rw-p 00000000 00:00 0                                  [heap]
7f44c4000000-7f44c5be4000 rw-p 00000000 00:00 0 
 ....
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
[1]    15054 abort (core dumped) sniffles-core-1.0.6/sniffles -n 5 --genotype -s 2

any suggestion?

How to group CNV

Hi,
I have got many structral variants called by Sniffles. Now I want to extract copy number variants from them, but I don't know what to do . Is it proper to just treat DUP and DEL as copy number variants ?
I'll appreciate your help!

the recommended pipeline for SV by pacbio reads

Hi.

Which is the recommended pipeline for SV by PacBio reads

raw PacBio read -> nglmr -> sniffer
raw pacbio read --(canu) -->corrected reads -> nglmr -> sniffer

for small var, 1) seems better. pacbio reads are more close to reference genome.

for big SV, 2) seems better. pacbio reads are more close to each other?

Is there some public test result?

Segfault with forced VCF calling

The bug I was referring to in #30

After some debugging I don't think that the genotyping is the problem.
The program segfaults because the variant is defined as DUP/INS (sv_type = 6), causing the following line (Breakpoint.cpp:452) to be false (variant is treated as an insertion):

		if (!(this->get_SVtype() & INS)) { //all types but Insertions:

and thus moving into the else clause which tries to compute the median of the lengths2 vector (Breakpoint.cpp:468), but this segfaults because lengths2 is empty.

When I change line 452 to

		if (!(this->get_SVtype() == INS)) { //all types but Insertions:

the program terminates successfully.

STD_quant_start and STD_quant_stop

Hi,
First, how much value of STD_quant_start and STD_quant_stop is precise and how much value of STD_quant_start and STD_quant_stop is imprecise.
Second: use bam(ngm-lr) to caculate depth(samtools),depth is very low,less than 2x, raw data 60x, how difference the pacbio bam and the illumina bam?how do you caculate depth?
Thanks.

--use_MD_Cigar

Hi,

Thanks for developing this tool. It seems to do what you says it does. :)

I was curious about the "--use_MD_Cigar" flag. Could you give me any insight of when I would or would not want to use it?

Many thanks.

--John

Trouble to understand some coordinates

Hello
I am curious regarding the way the results are represented.
I have for example a position which was confirmed through another independent analysis:

000000F 11430646        11430740        000000F 11580771        11580779        8       -1      +       -       DEL     15      000000F 11430739        000000F 11580772        150033

So I understand the format of the *.bede file here the following way:
At position 11430646 there is a potential deletion on one of the alleles which ranges till 11580772.
This is similarly as well graphically represented in the VCF file in IGV.

Now I have another candidate:

 006538F 3752    3780    006956F 783     1387    7718    -1      +       +       TRA     16      006538F 3779    006956F 1222    1406222

This is now a translocation which supposedly starts at 3752 on 006538F. But where does it no stop ?
I guess if I understood correctly is stops on 006956F @ 1222 ?

These are not visualized in the VCF if visualized by IGV .

bug report

Hi,

I am using Sniffles for my research.
I added --report_BND option but there may be bug.
In Sniffles/src/print/VCFPrinter.cpp line 95,

if (strands[0] == '-' && strands[0] == '+')

I think this if statement is always false.

citation for Sniffles

Hi,
I am curious how can we cite Sniffles if we used it?
Do you have a publication for Sniffles?
Thanks!

Should I be sorting the BAM file?

Hi,

I have not been sorting the alignments (e.g. via samtools sort) that I am giving sniffles. I just do this:

bwa index lambdaGenomeSequence.fa
bwa mem -M -x pacbio lambdaGenomeSequence.fa simreads.fasta | samtools view -bSh - > simreads.bam
sniffles -m simreads.bam -v simreads.vcf

Sniffles does not complain about the file not being sorted and it seems to give expected results on simulations. However, I was just curious if there will be unexpected effects of not sorting when I move on to real data on a much larger genome.

best,

John

DUP in Sniffles?

Hi Sniffles developers,

What does the DUPcalls in Sniffles result file refer to, interspersed duplication or tandem duplication?

Thanks.

George

Parallel output in vcf and bede ?

Hello
It seems that the parallel output in VCF and BEDE is not working . E.g. if I try:

sniffles -m file.sorted.bam -t 10 -v file.sniffles.vcf -b file.sniffles.bede

Only the VCF is generated. If I try on the other hand:

sniffles -m file.sorted.bam -t 10  -b file.sniffles.bede &

Everything works well. Is is possible to generate both outputs the same time?

Cheers

IGV throws error on sniffles vcf

IGV 2.3.68 throws an error on a vcf file I created with sniffles:

Unable to parse header with error: Your input file has a malformed header: Unexpected tag "SUPTYPE" in line /path/to/file.vcf

At first I thought it was a typo, but changing SUPTYPE to SUBTYPE gave a similar error.

multiple samples as input bam

Hi, thanks for developing such a nice tool. It's also very easy to use.

Just wondering whether it is possible to call SV using multiple samples?

For example
./sniffles -m sample1.bam sample2.bam -v output.vcf

Thanks again!

redundant translocation call

Hi,
I have just run Snifflesv1.0.7 to call structural variants with pacbio data. However, when I checked my results, I found many translocation calls are redundant. For example , there are 8 records located on chromosome 9 that seems to represent the same variant.

9       110248031       28615   N       <TRA>   .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.7;CHR2=22;END=23388487;ZMW=15;STD_quant_start=0.000000;STD_quant_stop=0.632456;Kurtosis_quant_start=6.944904;Kurtosis_quant_stop=-1.664143;SVTYPE=TRA;SUPTYPE=SR;SVLEN=1203722464;STRANDS=--;RE=18        GT:DR:DV        ./.:.:18
9       110248031       28616   N       <TRA>   .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.7;CHR2=22;END=23524163;ZMW=15;STD_quant_start=0.000000;STD_quant_stop=0.632456;Kurtosis_quant_start=1.716204;Kurtosis_quant_stop=-0.498174;SVTYPE=TRA;SUPTYPE=SR;SVLEN=1203858140;STRANDS=-+;RE=18        GT:DR:DV        ./.:.:18
9       110249307       28617   N       <TRA>   .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.7;CHR2=22;END=23388486;ZMW=28;STD_quant_start=0.000000;STD_quant_stop=0.516398;Kurtosis_quant_start=10.688393;Kurtosis_quant_stop=2.387808;SVTYPE=TRA;SUPTYPE=SR;SVLEN=1203721187;STRANDS=--;RE=30        GT:DR:DV        ./.:.:30
9       110249307       28619   N       <TRA>   .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.7;CHR2=22;END=23524163;ZMW=19;STD_quant_start=0.288675;STD_quant_stop=0.645497;Kurtosis_quant_start=1.002878;Kurtosis_quant_stop=2.246256;SVTYPE=TRA;SUPTYPE=SR;SVLEN=1203856864;STRANDS=-+;RE=24 GT:DR:DV        ./.:.:24
9       110250549       28621   N       <TRA>   .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.7;CHR2=22;END=23388492;ZMW=34;STD_quant_start=0.000000;STD_quant_stop=0.677003;Kurtosis_quant_start=-0.195078;Kurtosis_quant_stop=-1.416803;SVTYPE=TRA;SUPTYPE=SR;SVLEN=1203719951;STRANDS=++;RE=48       GT:DR:DV        ./.:.:48
9       110250549       28622   N       <TRA>   .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.7;CHR2=22;END=23524158;ZMW=30;STD_quant_start=0.000000;STD_quant_stop=0.000000;Kurtosis_quant_start=0.167256;Kurtosis_quant_stop=5.765222;SVTYPE=TRA;SUPTYPE=SR;SVLEN=1203855617;STRANDS=+-;RE=38 GT:DR:DV        ./.:.:38
9       133681711       28629   N       <TRA>   .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.7;CHR2=22;END=23632361;ZMW=40;STD_quant_start=0.935414;STD_quant_stop=0.500000;Kurtosis_quant_start=4.041601;Kurtosis_quant_stop=6.036755;SVTYPE=TRA;SUPTYPE=SR;SVLEN=1180532658;STRANDS=-+;RE=49 GT:DR:DV        ./.:.:49
9       133681337       29767   N       <TRA>   .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.7;CHR2=22;END=23632359;ZMW=37;STD_quant_start=0.759555;STD_quant_stop=0.960769;Kurtosis_quant_start=2.699066;Kurtosis_quant_stop=7.087048;SVTYPE=TRA;SUPTYPE=SR;SVLEN=1180533030;STRANDS=+-;RE=53 GT:DR:DV        ./.:.:53

According to other research, there are only one reciprocal translocation between chromosome 9 and 22, so I wonder why this happened, and what I can do to get a better result.

--genotype undefined INFO

using a current build adding --genotype flag produces AF entries in INFO but they aren't defined in the header. I assume this is allele frequency so am adding
##INFO=<ID=AF,Number=.,Type=Integer,Description="Allele Frequency.">

is that right?

vcf file columns

Dear Fritz

When I try to upload the vcf file into IGV I'm getting an error complaining about missing columns. In the example below the 3rd line is missing the GT:DR:DV field and the field behind that. I'm not too worried about these fields missing but having a couple of tab-separated dots in the lines before the line break where these fields would be would make IGV's problem go away, I reckon.

ENA|CM001196|CM001196.1 83605 0 N <DUP> . PASS PRECISE;SVMETHOD=Snifflesv1.0.8;CHR2=ENA|CM001196|CM001196.1;END=4960077;STD_quant_start=7.925792;STD_quant_stop=0.797724;Kurtosis_quant_start=-1.159180;Kurtosis_quant_stop=-1.284778;SVTYPE=DUP;SUPTYPE=SR;SVLEN=4876472;STRANDS=-+;RE=22;AF=0.956522 GT:DR:DV 1/1:1:22 ENA|CM001196|CM001196.1 85847 1 N <DEL> . PASS PRECISE;SVMETHOD=Snifflesv1.0.8;CHR2=ENA|CM001196|CM001196.1;END=4960071;STD_quant_start=2.102630;STD_quant_stop=1.076055;Kurtosis_quant_start=5.490807;Kurtosis_quant_stop=-1.172381;SVTYPE=DEL;SUPTYPE=SR;SVLEN=4874224;STRANDS=+-;RE=39;AF=0.975 GT:DR:DV 1/1:1:39 ENA|CM001196|CM001196.1 116744 2 N <INS> . PASS IMPRECISE;SVMETHOD=Snifflesv1.0.8;CHR2=ENA|CM001196|CM001196.1;END=116744;STD_quant_start=5.024938;STD_quant_stop=14.515509;Kurtosis_quant_start=-1.800901;Kurtosis_quant_stop=-1.888449;SVTYPE=INS;SUPTYPE=AL;SVLEN=71;STRANDS=+-;RE=40 ENA|CM001196|CM001196.1 117804 3 N <INS> . PASS IMPRECISE;SVMETHOD=Snifflesv1.0.8;CHR2=ENA|CM001196|CM001196.1;END=117804;STD_quant_start=4.156655;STD_quant_stop=12.062338;Kurtosis_quant_start=-1.102856;Kurtosis_quant_stop=-1.052519;SVTYPE=INS;SUPTYPE=AL,SR;SVLEN=610;STRANDS=+-;RE=37;AF=0.902439 GT:DR:DV 1/1:4:37 ENA|CM001196|CM001196.1 146939 4 N <DEL> . PASS PRECISE;SVMETHOD=Snifflesv1.0.8;CHR2=ENA|CM001196|CM001196.1;END=150338;STD_quant_stop=-1.052519;SVTYPE=INS;SUPTYPE=AL,SR;SVLEN=610;STRANDS=+-;RE=37;AF=0.902439 GT:DR:DV 1/1:4:37
Regards,

A. Webb

No precise variant

Hello,

I tested Sniffles on a small sample. I used NGM-LR for mapping pacbio long reads to the reference.
Sinffles reports about 13,000 variants, but they are all imprecise.
Is this expected? Any chance that I get precise calls by changing parameters of the mapper or Sniffles?

ngmlr and sniffles applied to de novo assembly

Dear Fritz,

We have assembled a large mammalian genome using 90X of SMRT platform data. The base-accuracy of the assembly is often assessed using alignment of HiseqX paired-end reads to the assembly and calculating the number of homozygous snps and homozygous indels.

I wanted to use long-reads to assess the assembly in a similar manner. I am using ngmlr and sniffles to align the long-reads back to the assembly and to genotype homozygous structural variations against the assembly. I have used the following cmd for this purpose.

${SNIFFLES_PATH}/sniffles -m input.bam -v output.vcf -t 10 -l 50 --report_seq --cluster --genotype

I have found approximately ~20,000 structural variations against the de novo assembly. I am quite surprised by the fact that there the genotype is 0/0 for the structural variation. I was expecting the genotype to be 0/1 or 1/1. Could you provide some insights into this?

I look forward to your response.

Best,
Sangjin Lee

Segmentation fault when giving invalid output vcf file

Hi Fritz,

on my machine, Sniffles crashes with a segfault when given an invalid path to the output vcf file:
sniffles -m input.bam -v non_exisiting_directory/output.vcf
Segmentation fault
I noticed this when I forgot to create the output directory ("non_exisiting_directory") before running Sniffles. Although this was my fault of course, it would be great if Sniffles would give a proper error message ;-)

Cheers
David

automatically generated files are committed to the repository

You have object files and autogenerated makefiles committed to the repository here, which isn't an efficient use of git. Please consider removing them from the repository and adding lines to .gitignore to prevent them getting added back in inadvertently.

Positions of SVs (insertions)

Hi Fritz,

I m willing to compare SVs which were called by sniffles with pacbio and short read read caller. (I use pindel for that).

Everything worked pretty well, there is one thing that I do not understand. The way that you give positions for indels.

For example for pindel it is always one nucleotide difference between start and stop of sv. Even if it is a SNP or 100 nucleotide insertions
.
But for Sniffles it varies a lot. I didnt understand the logic behind it. That is why currently I can not compare these two tools' output for indels.

Is it possible to tell me why insertion positions varies that much? Are those complex indels or what?

For example this one;
NC_003075.7 7231255 7968 N . PASS PRECISE;SVMETHOD=Snifflesv1.0.6;CHR2=NC_003075.7;END=7231260;STD_quant_start=0.000000;STD_quant_stop=1.140175;Kurtosis_quant_start=17.144694;Kurtosis_quant_stop=12.218618;SVTYPE=INS;SUPTYPE=AL;SVLEN=11;STRANDS=+-;RE=21 GT:DR:DV ./.:.:21

There is 5 nucleotides distance between start and stop position. I believe it should be something like that start:7231255 and stop 7231256

Best,
Mehmet

minimap2 output bam is not working with sniffles

Hello,

I am using nanopore (1D) human data, I have aligned using ngmlr and minimap2. Sniffles for SV calling, but ngmlr bam of the same sample is giving straucrtual variant vcf but minimap2 bam is not giving any variant output.
Do yo have any idea about it?
One more question, can I use Illumina bam to extract structural variants using Sniffles.

Which ngmlr paramters should be enabled to be compatible with sniffles?

Hi Fritz,

For BWA it is obvious -x and -M but I am not 100% sure which parameters for ngmlr that I should enable to use for sniffles?

Best,
Mehmet

TRA calls on a de novo assembly

Hi again,

Yesterday, I git cloned the latest version (1.0.3) -- I had been using 1.0.0.

I noticed it called more SVs given the same data. Most of the additional SV calls were of the type: TRA. I assume this means "translocation". ((edit: were there TRA calls in 1.0.0 -- or is this a newer feature since then?))

I am running this on a de novo assembly and also observed that the majority of the TRA calls (if not all) strictly map to the ends of separate contigs (different contig pairs for each TRA call) -- i.e. potentially scaffolding them. Have you explored using this information for scaffolding de novo assemblies? If so, any thoughts?

Best,

John

--genotype does not output genotypes

I aligned Oxford Nanopore (ONP) samples using ngmlr with this command

ngmlr -t 4 -r <FASTA> -q <ONP_FASTQ> -o <out.sam> -x ont

I ran Sniffles with the following command

sniffles --genotype -m <sorted.bam> -s 5 -v <sniffles_sv.vcf>

The output VCF I get does not have genotypes. The sites appear as such

./.:.:13

I notice get only . for the DR format field as well.

To iterate, every site gets a ./. genotype and . DR value.

Running sniffles --version returns

sniffles  version: 1.0.3

Is genotyping currently supported for this version? Also can you run sniffles jointly with multiple samples? I tried sniffles -m <1.bam> <2.bam> .. and sniffles -m <1.bam> -m <2.bam> .. and got error messages.

Thanks for any help!

VCF issues

Hi,

thanks for sniffles, it appears to be nice, quick and flexible on my very low coverage ONT data.

I am having an issue with the VCF. Tabix claims it is not sorted. However vt sort will not read it since the header is lacking many fields. I am sure I can find a workaround to sort with, but this might be helpful.

bwa_sniffles$ vt sort BRCA2_dup_sniffles.vcf
[W::vcf_parse] contig 'chrM' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] INFO 'STD_quant_start' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'STD_quant_stop' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'Kurtosis_quant_start' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'Kurtosis_quant_stop' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'SUPTYPE' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'STRANDS' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'STRANDS2' is not defined in the header, assuming Type=String
[W::vcf_parse] contig 'chr1' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr2' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr3' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr4' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr5' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr6' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr7' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr8' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr9' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr10' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr11' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr12' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr13' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr14' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr15' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr16' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr17' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr18' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr19' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr20' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr21' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr22' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chrX' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chrUn_gl000220' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chrUn_gl000225' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chrY' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr4_gl000193_random' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr7_gl000195_random' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr9_gl000198_random' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr17_gl000204_random' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr17_gl000205_random' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr19_gl000208_random' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chrUn_gl000219' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chrUn_gl000224' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chrUn_gl000230' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chrUn_gl000232' is not defined in the header. (Quick workaround: index the file with tabix.)
[vcf.c:1220 bcf_write] Unchecked error (2), exiting.

Colin

Run the same data twice?

I am just wondering if I run Sniffles twice with exactly the same data and settings, such as

${SNIFFLES} -m ${inputfile} -v ${outputfile} -t 20 --report_BND --genotype -s 5

Should the results be exactly the same?

Thanks.

Problem installing Sniffles on Mac OSX

Hi,
I tried installing Sniffles on my Mac, without success. I was wondering whether you could help me, please?

I have followed these steps:

wget https://github.com/fritzsedlazeck/Sniffles/archive/master.tar.gz -O Sniffles.tar.gz
tar xzvf Sniffles.tar.gz
cd Sniffles-master/
mkdir -p build/
cd build/

Then, for Mac, you suggest this: cmake -D CMAKE_C_COMPILER=/opt/local/bin/gcc-mp-4.7 -D CMAKE_CXX_COMPILER=/opt/local/bin/g++-mp-4.7 ..

However, I don't have a 'local' folder in 'opt', only a 'X11' folder. However, in my '/usr/bin
', I have 'gcc' and 'g++'. So I tried the following command:

cmake -D CMAKE_C_COMPILER=/usr/bin/gcc -D CMAKE_CXX_COMPILER=/usr/bin/g++ ..

Compiling the C compiler identification source file "CMakeCCompilerId.c" failed.
Compiler: /opt/local/bin/gcc-mp-4.7
Build flags:
Id flags: -c

The output was:
No such file or directory

Compiling the C compiler identification source file "CMakeCCompilerId.c" failed.
Compiler: /opt/local/bin/gcc-mp-4.7
Build flags:
Id flags: -Aa

The output was:
No such file or directory

Compiling the C compiler identification source file "CMakeCCompilerId.c" failed.
Compiler: /opt/local/bin/gcc-mp-4.7
Build flags:
Id flags: -D__CLASSIC_C__

The output was:
No such file or directory

etc.

It seems that I do need to get '/opt/local/bin/gcc-mp-4.7' and '/opt/local/bin/g++-mp-4.7'. Do you know how I could do this?

Thanks.

Best wishes,

Cristian.

Separate calling and re-genotyping of SVs

Hi, first of all, thanks for this great tool!

I wonder whether it would be possible to add a feature for allowing regenotyping of SVs?

One of the great features of the new Delly2 (for Illumina data SV detection) is that you can first call variants independently in all individuals of a family/cohort, then merge the site-list and regenotype in all individuals.

Also, this would make it easier to decide whether an SV is present in a child's parent beyond interval overlaps...

fritzsedlazeck / sniffles Goto Github PK

sniffles's Introduction

Sniffles2

Quick Start: Germline SV calling using Sniffles2

Installation

Requirements

Tested on:

Citation

Use-Cases / Modes

A. General (all Modes)

B. Multi-Sample SV Calling (Trios, Populations)

C. Mosaic SV Calling (Non-germline or somatic SVs)

D. Genotyping a known set of SVs (Force Calling)

Quick Tips

Input / Output

Companion apps

Supplementary tables

sniffles's People

Contributors

Stargazers

Watchers

Forkers

sniffles's Issues

Recommend Projects

Recommend Topics

Recommend Org