Code Monkey home page Code Monkey logo

dipcall's Introduction

Getting Started

wget https://github.com/lh3/dipcall/releases/download/v0.3/dipcall-0.3_x64-linux.tar.bz2
tar -jxf dipcall-0.3_x64-linux.tar.bz2
# for female
dipcall.kit/run-dipcall prefix hs38.fa pat.fa.gz mat.fa.gz > prefix.mak
# or for male, requiring PAR regions in BED
# dipcall.kit/run-dipcall -x dipcall.kit/hs38.PAR.bed prefix hs38.fa pat.fa.gz mat.fa.gz > prefix.mak
make -j2 -f prefix.mak

Introduction

Dipcall is a reference-based variant calling pipeline for a pair of phased haplotype assemblies. It was originally developed for constructing the syndip benchmark dataset and has been applied to other phased assemblies, too. Dipcall can call small variants and long INDELs as long as they are contained in minimap2 alignment.

If you use dipcall, please cite

Li H, Bloom JM, Farjoun Y, Fleharty M, Gauthier L, Neale B, MacArthur D (2018) A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat Methods, 15:595-597. [PMID:30013044]

Output

The final output of dipcall includes two files: prefix.dip.vcf.gz and prefix.dip.bed. A raw variant call is made in the VCF if a non-reference allele is observed in any alignments >=50kb and mapped with mapping quality >=5. In the VCF, dipcall sets a HET1 filter if parent1 is a heterozygous at the raw variant; dipcall sets a GAP1 filter if no >=50kb alignment from parent1 covers the variant. Note that if a call is covered by multiple >=50kb alignments in the same parent but the alignments all have the same allele, the call is not filtered in the VCF.

The BED file gives the confident regions. A base is included in the BED if 1) it is covered by one >=50kb alignment with mapQ>=5 from each parent and 2) it is not covered by other >=10kb alignments in each parent. Nearly all calls filtered in the VCF are excluded in the BED, except very rare edge cases. However, a fraction of unfiltered calls in the VCF may also be excluded by the BED. The BED file applies more stringent filter.

The above is applied to autosomes and female chrX. For a male sample, parent1 is assumed to be the father and parent2 the mother. Dipcall treats PARs the same way as autosomes. However, outside PARs, dipcall filters out chrX regions covered by father contigs and filters out chrY regions covered by mother contigs. To make proper calls on sex chromosomes, users should hard mask PARs on the reference chrY.

dipcall's People

Contributors

hannespetur avatar lh3 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

dipcall's Issues

Interpretation of bed files

$ cat prefix.dip.bed | awk 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
2823519412
$ cat prefix.hap1.bed | awk 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
2690214366
$ cat prefix.hap2.bed | awk 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
2817991873

As per the documentation: The prefix.dip.bed file gives the confident regions. A base is included in the BED if 1) it is covered by one >=50kb alignment with mapQ>=5 from each parent and 2) it is not covered by other >=10kb alignments in each parent. Based on this, shouldn't the length of intervals in prefix.dip.bed be lower than prefix.hap1.bed and prefix.hap2.bed, i.e., should prefix.dip.bed have been intersection of the two haplotype-specific bed files?

Please suggest what is the relationship between these three bed files.

How to extract the inversions from the dipcall results?

Dear @lh3 ,

After running dipcall, it will generate a file named ${pre}.dip.vcf.gz. All the genetic variants including SNVs, Indels, and SVs are all deposited in this VCF file. I know i can separate these variants according the length of the reference allele and alternative allele. However, i do not know to extract the inversions from this VCF file?

Looking forward to your reply, thanks.

Sincerely yours,
Zheng zhuqing

Using winnowmap

Attempted to run dipcall using winnowmap (by passing -W repetitive_k15.txt) and saw the following error in dipcall.hap1.paf.gz.log:

[M::mm_idx_gen::0.003*1.15] reading downweighted kmers
ERROR: input list of k-mers and winnowmap parameter k are inconsistent

My repetitive_kmer file is of the form:

CGGAATTGAATGGAA	18274
CTCAAAAACAAAAAA	4552
CGGAATTGAATGGAA	12469
...

Unless there is a way to pass a value for k to dipcall, what value of k does dipcall anticipate? I did not see it listed in the documentation, or in the error message. I know k=15 is the default for winnowmap, which is why I tried that first.

Can dipcall extend to the polyploidy plant genome ?

Hi, @lh3

Dipcall is designed for the diploidy, can I use the rule in the makefile for handling a plant tetraploidy genome?

# mapping
for hap in `seq 1 4`
do
    dipcall/dipcall.kit/minimap2 -c --paf-no-hit -xasm20 --cs -t16 DM4.chr.fasta query.${hap}.fa 2> query.${hap}.paf.gz.log | gzip > query.${hap}.paf.gz
    dipcall/dipcall.kit/minimap2 -a -xasm20 --cs -t16 ref.fasta query.${hap}.fa 2> query.${hap}.sam.gz.log | gzip > query.${hap}.sam.gz
    gzip -dc query.${hap}.paf.gz | sort -k6,6 -k8,8n | dipcall/dipcall.kit/k8 dipcall/dipcall.kit/paftools.js call - 2> ${hap}.${hap}.var.gz.vst | gzip > query.${hap}.
var.gz
    dipcall/dipcall.kit/k8 dipcall/dipcall.kit/dipcall-aux.js samflt query.${hap}.sam.gz | dipcall/dipcall.kit/samtools sort -m4G --threads 4 -o query.${hap}.bam -
    gzip -dc query.${hap}.var.gz | grep ^R | cut -f2- > query.${hap}.bed
done

# joint calling
dipcall/dipcall.kit/bedtk isec -m query.hap1.bed query.hap2.bed query.hap3.bed query.hap4.bed > query.dip.bed
dipcall/dipcall.kit/htsbox pileup -q5 -evcf ref.fasta query.hap1.bam query.hap2.bam query.hap3.bam query.hap4.bam | dipcall/dipcall.kit/htsbox bgzip > query.pair.vcf.gz

I do get a vcf with four haplotype variation. Would you mind give some advice for this usage? Is it invalid in any step?

image

Genotype in the vcf output

Hi, I have two calls at pos1 and pos2. The last columns are the following

pos1: . . GT:AD 1|0:1,1
pos2: . . GT:AD 1|0:1,1

The VAF of the call at pos1 from the read-based variant calling is 0.9; at pos2 -- 0.4.

Could you explain how to read the last column? I saw the header but was confused about why they had the same genotype but different zygosity.

Interpretation of dipcall vcf file

Dear @lh3,
Thank you for great tools.

I have phased human genome (using trio data) assembly generated from hifiasm. I have generated dipcall results by aligning hap1 and hap2 assemblies to CHM13 genome. Below are the results of filter column-

5134326 .
   5318 DIPY
 144669 GAP1
    548 GAP1;DIPY
  14048 GAP1;HET2
 159554 GAP2
  48065 HET1
  87672 HET1;GAP2
  17434 HET1;HET2
  34374 HET2

May I know how to interpret these results in terms of assembly quality?

Running dipcall for partially-phased assemblies?

Hi, I have a pair of partially-phased assemblies for a sample, generated with minimap2, and wonder if these can be used as input for dipcall.

I have run this program with these two assembly files, assuming randomly one as paternal and the other as maternal (since they are not really phased). Wonder if this is the reason why the resultant VCF file is empty.

run-dipcall  prefix  reference.fa   assembly.bp.hap1.c_tg.fa assembly.bp.hap2.c_tg.fa  > prefix.mak
make -j2 -f prefix.mak

I am assuming here that the samples is females as well, but it is not. Not sure if that could affect as well. What is the PAR.bed input file for males? cannot fin this in the documentation.

Thanks

Reducing alignment size

Hi

From the README:

"A raw variant call is made in the VCF if a non-reference allele is observed in any alignments >=50kb and mapped with mapping quality >=5."

Is it possible to have an option to reduce the alignment length like in paftools call -L?

Thanks

Interpretation of N's in the VCF

I'm seeing N's in my dipcall v0.3 vcf where there are not Ns in the reference.

The variants lie within the high confidence bedfile.

Any idea what could be going wrong?

tabix prefix.dip.vcf.gz chr1:62784-64408
chr1    62784   .       G       A       30      .       .       GT:AD   1|1:0,2
chr1    62915   .       G       A       30      .       .       GT:AD   1|1:0,2
chr1    63003   .       C       T       30      .       .       GT:AD   1|1:0,2
chr1    63015   .       A       AAAT    30      .       .       GT:AD   1|1:0,2
chr1    64401   .       N       A       30      .       .       GT:AD   1|1:0,2
chr1    64402   .       N       C       30      .       .       GT:AD   1|1:0,2
chr1    64403   .       N       C       30      .       .       GT:AD   1|1:0,2
chr1    64404   .       N       A       30      .       .       GT:AD   1|1:0,2
chr1    64405   .       N       A       30      .       .       GT:AD   1|1:0,2
chr1    64406   .       N       T       30      .       .       GT:AD   1|1:0,2
chr1    64407   .       N       G       30      .       .       GT:AD   1|1:0,2
chr1    64408   .       N       C       30      .       .       GT:AD   1|1:0,2

samtools faidx ref.fa.gz chr1:62784-64408
>chr1:62784-64408
GACTGTTACAGGTTCCAGCAGGATAACTGGGTGGAAATGAGTTTGGTTTCACTTAGTCTC
TCTAAAGAGAAAGCAAGTTGGTAGACTAATACCTAATAAAAGCAAAGCTGCCAACAATTG
AAATTGCCTGGGCTGCTCTGTGTGTCCCACATGCATGGGTGTGGGTGCCAGTGTGTGTGC
GTGTGTGCATGCATGTGCATGTGTGTTGGGATAGAGTGGCAAGAAAATGGGAAATAAGAA
TGTTCAGTCCATAGCCCTTCATTATAAAAAGGTGAGCTGTAATAAATACTAGTGCCACAT
TTAGCCAAAACTTTACTCCAGCCAAAGGTGATATTTTCATGATAACATCCTGTGATTGCT
TTGTTCTTCGTCTTTTATGTTCTTCCTAGATGGGCTCAGAACATACAAGAATTAAGTACA
CATCTTATTTTCCAGTGATAATGCTACCGGCAAATTCTGTTGTTTGTATAAACATCAGCC
ATGTTTATATAACTAAACTAGTGTTTTGTTTTGTCAATTCAGCAAGAAATTAGACCACAT
GGTGGCTTAATGCTGCATTGATTTGGCTATCAATTTGTTTTCACTTTTCTGCAAAATATT
TAATACATTATTAAATTGAATTATGCTGATGCCACAGTTGTTCTTATCTCAATTGTCTTA
AAATTCATTTAATTTTTTTTTCCTTTGGTTTCATTATTCAAATTTTAACTTCAGTTCTCA
ACATTTTATCTGATGGAAGAGATGGAGTCCATTACTAAGGACTCCATTGTGCTCCATCAT
GCCAGAGTTGTAAAATAGATCTTTTAAAGGAAATTTACTGTGATTTTTTTCTATTTAAGA
GCTTCCTCTCCAGTTGAGCATGTAAGAAAATTATACCAGGAGAATACAGTAAACTCTATG
AGGCAAGCTATAAACATGTAGCATTGTGATTAGGGCTGGTTCTCCTTCTAGAGACATGGT
AGGATTGCAATTTCATACCATCCTTGAAGTTAGAGAGAGCCACGTGACTCATTTAGCCAA
TGAACTGTGAGCAGAATGACATGTCACTTCCAGCAGAAGCTTTAAGAATCTGAGAGACAT
TCATACGTTTTCCATGTGCTGTAGCCTTATACCCAAAGCCTGGGTCCCAAGTGACCATGA
CAGGCAGAGCTCCCTGTTGAGCCACAGAGATTTAGAGAATGGCTGTTAACACAGCATAAT
CCAGCCCATCCTGACTAATCTGATATTAACATGTATAATAAAGAATTCTATCAATGCTGA
GGGAAGATGATTAGTTAAGGTCCTAGGTTGCAAGTCTCAAAACCTCTTCTAAGGATTGTA
GACAGGAAATTAAATGACTTCTAGTCCCTAGAGTTCCCAATCTCCTACCATCCCATCCTA
ATATGACAGAAGTAATTCCTGAGTTGCTTCTGAAACCAGAGCTTCCCTCAGAACCCTTAG
CCTGCCAGATGGCTTCTTGGAGAGCCCTCACTCACTTTTCTCCTTCTGCTATTGCTGCTC
ATTCATTCCAGCTTTTAAAAATTCATCTTTATCCAGGAACCTCGCTTCTAGAAAAGTCAT
ACAGGTGCTTCCAGGAGGCTACATGGGCACCCATATTTTTCTAGCCACTTTCATTAGACC
AATGC

printf "chr1\t62784\t64408\n" | bedtools intersect -a prefix.dip.bed -wa -b -
chr1    2810    216558

Looking in IGV, the transition to the dense block of variants with ref Ns is not due to the start of a new alignment - and the variants with N as reference are not supported by the haplotype specific assembly to ref alignments.

igv_screenshot_dipcall

Interpretation of empty prefix.dip.vcf.gz

The final output of dipcall includes two files: prefix.dip.vcf.gz and prefix.dip.bed. A raw variant call is made in the VCF if a non-reference allele is observed in any alignments >=50kb and mapped with mapping quality >=5.

So if prefix.dip.vcf.gz is empty (has no variants), then given the example below,

/genetics/elbers/dipcall.kit/run-dipcall -t 75 prefix miniasm.corr.ratatosk.035.fasta \
GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.haplotype0.fasta.gz \
GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.haplotype1.fasta.gz > miniasm.corr.ratatosk.035.mak

miniasm.corr.ratatosk.035.fasta does not have non-reference alleles in the high confidences areas of prefix.dip.bed?

Possible error outputting complex variants

As we've been working towards assembly-based benchmarks for GIAB, I think we've encountered an issue with dipcall a number of times now. In particular, when the alignment in the bam file correctly has an insertion immediately before a deletion (representing a complex variant), the dip.vcf file will sometimes have only the insertion but not the deletion. An example of this is at chrX:69,487,386-69,487,425 on GRCh38 in this image from the HPRC assembly of HG002 with dipcallv0.3 (assembly, bams, and vcf at https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/HPRC-HG002.cur.20211005/). There are similar issues at chrX:27,361,826-27,361,865, chrX:40,208,433-40,208,472, and chrX:69,404,760-69,404,799. Let me know if you have any questions, and thank you for developing this really useful tool!
Screen Shot 2022-03-02 at 12 18 50 PM

License information missing

Hi again, @lh3!

In the same vein as lh3/unimap#5 :
I'd like to package up dipcall for Bioconda, but I couldn't find any license information in the repository.
Would it be possible for you to add a license file here?

Cheers,
Marcel

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.