Code Monkey home page Code Monkey logo

vcfmerger2's Introduction

vcfMerger2

vcfMerger2 is a tool merger for 2 to N somatic variants vcf files.

What is vcfMerger2?

This tool allows merging from 2 to N somatic vcfs obtained after running different variant callers using the same bam file(s).

vcfMerger2 will generate one vcf.

flowchart

How vcfMerger2 operates?

vcfMerger2 takes RAW VCFs as inputs and outputs a Merged VCF. click here for the RAW VCF Definition

Somatic or Germline?

VcfMerger2 has been developped to merge SOMATIC vcfs into one vcf. The current release can merge from 2 to N vcfs.
To see the list variant callers compatible with vcfMerger2 go to "Variant_Callers_compatible" wiki page

Merging GERMLINE vcfs will be available in the next version; some features are already there but are still under development and should be consider as in Alpha version.

How to Install

See wiki Installation page

How to run vcfMerger2?

The core utility and main purpose of that tool is to MERGE VCFs information into one VCF without losing any data from any of the input vcfs files See QuickStart wiki page.

License

vcfMerger2 is under MIT licence.

vcfmerger2's People

Contributors

bryce-turner avatar christophelegendre avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

sfacista

vcfmerger2's Issues

TYPE is a key word for bcftools variant type and should be renamed VTYPE

It has been found that TYPE flag found in INFO field is a reserved word for bcftools which does not recognize the value "ins" as a valid value;

Should we rename the TYPE flag to VTYPE?

Note: renaming the header and the records with VTYPE works; It has been tested on UTJMM merged vcf file.

Same Phase but Non Consecutive Variants got collapsed

Some variants with the same phase (or PS value) got collapsed altogether even though the variants were not consecutives in positions;
This leads to the creation of wrong block subsitutions and break downstream analysis.

Wrong Terminology

--dict to provide .fai file

The .fai is a TABIX index, there is a .dict file, these are two different things. We should use the correct term so if you expect a .fai file as input the variable should be --tabix_index

Suggestion: allow users to add phASER as a path

Requiring phASER to be part of the $PATH makes it harder than it needs to be to integrate vcfMerger2 into snakemake (for example). An option to provide a path to the built local version pf phASER would be appreciated. Thanks for all your work.

Duplicate contig lines in merged header

We need to fix the multiple headers, maybe add to the prep to correct the one to match the others

##fileformat=VCFv4.2
##fileDate=20190502
##contig=<ID=HLA-A01:01:01:01,length=3503,assembly=GRCh38>
##contig=<ID=HLA-A
01:01:01:01,length=3503>
##contig=<ID=HLA-A01:01:01:02N,length=3291,assembly=GRCh38>
##contig=<ID=HLA-A
01:01:01:02N,length=3291>
##contig=<ID=HLA-A01:01:38L,length=3374,assembly=GRCh38>
##contig=<ID=HLA-A
01:01:38L,length=3374>
##contig=<ID=HLA-A01:02,length=3374,assembly=GRCh38>
##contig=<ID=HLA-A
01:02,length=3374>
##contig=<ID=HLA-A01:03,length=3503,assembly=GRCh38>
##contig=<ID=HLA-A
01:03,length=3503>
##contig=<ID=HLA-A01:04N,length=3136,assembly=GRCh38>
##contig=<ID=HLA-A
01:04N,length=3136>
##contig=<ID=HLA-A01:09,length=3105,assembly=GRCh38>
##contig=<ID=HLA-A
01:09,length=3105>
##contig=<ID=HLA-A01:11N,length=3374,assembly=GRCh38>
##contig=<ID=HLA-A
01:11N,length=3374>
##contig=<ID=HLA-A*01:14,length=3095,assembly=GRCh38>

Number of Genotype exceeds 3 in octopus

The number of Genotypes captured by the class Genotype in prep step for Octopus has been set to 3 and must be updated 9 in order to anticipate any genotype value up to 9 or more.

Introduce an ELSE statement to deal with edge cases

if [[ ${C} -eq 0 ]]
then
echo -e "No Hets in VCF, so No Phasing to perform; Skipping phASER" 1>&2
echo "Making the expected VCF filename as if phASER had run:"
echo "zcat \"${VCF_ORIGINAL_INPUT}\" \"${VCF_ORIGINAL_INPUT/.vcf.gz/.blocs.vcf}\" " 1>&2
zcat "${VCF_ORIGINAL_INPUT}" > "${VCF_ORIGINAL_INPUT/.vcf.gz/.blocs.vcf}"
echo "${VCF_ORIGINAL_INPUT/.vcf.gz/.blocs.vcf}" 2>&1
echo "Exiting $0 script without having phASER ran. ev = 0" 1>&2
exit 0
fi

Example of edge case:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NORMAL	TUMOR
chr14	50978766	.	G	T	.	PASS	SOMATIC;QSS=60;TQSS=1;NT=ref;QSS_NT=60;TQSS_NT=1;SGT=GG->GT;DP=115;MQ=59.97;MQ0=0;ReadPosRankSum=1.22;SNVSB=0;SomaticEVS=7.38	GT:DP:FDP:SDP:SUBDP:AU:CU:GU:TU:AR:AD	0/0:21:0:0:0:0,0:0,0:21,21:0,0:0:21,0	0/1:94:6:0:0:0,0:0,0:81,87:7,7:0.0795:81,7
chr14	50978767	.	T	TG	.	PASS	SOMATIC;QSI=47;TQSI=1;NT=ref;QSI_NT=47;TQSI_NT=1;SGT=ref->het;MQ=59.96;MQ0=0;RU=G;RC=0;IC=1;IHP=18;SomaticEVS=6.74	GT:DP:DP2:TAR:TIR:TOR:DP50:FDP50:SUBDP50:BCN50:AR:AD	0/0:19:19:20,20:0,0:0,0:25.1:0.58:0:0:0:20,0	0/1:90:90:71,73:14,14:9,7:100.19:4.93:0:0.04:0.1647:71,14

This is a SNV followed by an indel;
phASER excludes the indel and only the snv remains which is not part of a block anymore --> Therefore phASER fails.

Let's think about any other potential edge case that the phASER tool could be grumpy about and fails because it does not handle single lines after removing or not dealing with the indels anymore.

lancet issue when vcf NOT filtered by PASS

When VCF is not Filtered by PASS, the genotype for variants contains dots and not zeros or one such as:
GT:AD:SR:SA:DP .:0,0:0,0:0,0:0 1/1:0,5:0,0:4,1:5

Having a dot will raise an error while running script addFieldsForVcfMerger2:

Traceback (most recent call last):
 File "${HOME}/gits/vm2_0.7.7/prep_vcfs_somatic/lancet/lancet.somatic.addFieldsForVcfMerger.py", line 391, in <module>
   v = process_records(tot_number_samples, v, col_tumor, col_normal)
 File "${HOME}/gits/vm2_0.7.7/prep_vcfs_somatic/lancet/lancet.somatic.addFieldsForVcfMerger.py", line 366, in process_records
   return process_GTs(tot_number_samples, v, col_tumor, col_normal)
 File "${HOME}/gits/vm2_0.7.7/prep_vcfs_somatic/lancet/lancet.somatic.addFieldsForVcfMerger.py", line 300, in process_GTs
   GTs[idxN] = get_GT_value_from_GT_value(GTOs[idxN])  ## we do not modify the GT field for the Normal sample
 File "${HOME}/gits/vm2_0.7.7/prep_vcfs_somatic/lancet/lancet.somatic.addFieldsForVcfMerger.py", line 265, in get_GT_value_from_GT_value
   x = GenotypeInv(list(GT_value))
 File "${HOME}/gits/vm2_0.7.7/prep_vcfs_somatic/lancet/lancet.somatic.addFieldsForVcfMerger.py", line 65, in __init__
   self.phased = bool(0) if li[1] == "/" else bool(1)
IndexError: list index out of range
ERROR: ${HOME}/gits/vm2_0.7.7/prep_vcfs_somatic/lancet/lancet.somatic.addFieldsForVcfMerger.py  FAILED ;
exit_value:1 ; Aborting!

The Genotype ./. or only . is not taken into account by the Genotype Class.

Filtering Occuring?

Is vcfMerger doing filtering in the pipeline?

UTJMM merged VCF header
##STRELKA2_SnpSiftCmd="SnpSift Filter '(GEN[ALL].DP > 10) & (GEN[0].AR < 0.02 & GEN[1].AR > 0.05)' temp/vcfmerger2/UTJMM_0008_3_PB_Whole_C1_KHSTL-UTJMM_0008_2_BM_CD138pos_T1_KHSTL/UTJMM_0008_3
_PB_Whole_C1_KHSTL-UTJMM_0008_2_BM_CD138pos_T1_KHSTL_bwa_strelka2.prepz.vcf"

Octopus AR precision

AR_tumor = round(float(AD_tumor/ADP_tumor), 2)
## Reformmating AD to expected VCF specs for that Reserved AD field, using the original AD and ADP values
AD_tumor = [ADP_tumor - AD_tumor, ADP_tumor]
else:
AR_tumor = int(-2)
AD_tumor = [0, ADP_tumor]
except ZeroDivisionError:
log.debug("division by zero!")
AD_tumor = [0, 0]
AR_tumor = float(0.00)
try:
if AD_normal != "." and AD_normal >=0:
AR_normal = round(float(AD_normal / ADP_normal), 2)

I would round to 3rd or 4th precision, right now a 2/44 = 0.04545 becomes 0.05 and then gets into merged vcf after filtering AR >= 0.05. But in this example it didn't meet the actual threshold but did because we rounded up to the 2nd precision level

VCF validation issues with VarDict, Mutect2, and Octopus fields

vcf-validator from VCFtools v0.1.13 found issues with three info fields within vcfMerger2 VCFs.

Issue 1:
##INFO=<ID=VARDICT_SOR,Number=1,Type=Float,Description="Odds ratio">
The issue with the line above is that the values for VARDICT_SOR seem to be all "inf" and not an actual float.
Possible Solution:
##INFO=<ID=VARDICT_SOR,Number=1,Type=String,Description="Odds ratio">.
Instead of a string type we could keep it as a float and convert all of the "inf" values to "1.0" or "0.99"

Issue 2:
##INFO=<ID=MUTECT2_AS_SB_TABLE,Number=1,Type=String,Description="Allele-specific forward/reverse read counts for strand bias tests. Includes the reference and alleles separated by |.">
The issue with this line is that there are 3 string values and not just one stated by "Number=1". For example, one of the values is "79,126|29,50." Since commas are used as a delimiter to create a list of values, the total number of strings shown in the example is 3.
Possible Solution:
##INFO=<ID=MUTECT2_AS_SB_TABLE,Number=.,Type=String,Description="Allele-specific forward/reverse read counts for strand bias tests. Includes the reference and alleles separated by |.">

Issue 3:
##INFO=<ID=OCTOPUS_AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele in the same order as listed">
The issue with this one is that the allele count sometimes has values for a secondary alt allele, however, there are no additional alt alleles shown in the ALT field.
Possible Solution:
##INFO=<ID=OCTOPUS_AC,Number=.,Type=Integer,Description="Allele count in genotypes for each ALT allele in the same order as listed">

Strelka2 prep BlockSub creation failing in complex tandem blocksub

This looks like an edge case that may be related to phaser itself or underlying code in this repository.
t
Scenario: Five consecutive calls from two independent but consecutive block substitutions followed by a single SNV

Sample in question: MMRF_2787_1_PB_WBC_C3_KHS5U-MMRF_2787_1_BM_CD138pos_T1_KHS5U.bwa.strelka2.pass.vcf.gz

PASS VCF CALLS:
chr2 88860919 . T A
chr2 88860920 . G A
chr2 88860921 . A G
chr2 88860922 . G C
chr2 88860923 . C T

Post PREP OUTPUT:
chr2 88860919 . TGC AAT (## The "C" is not a ref allele, this is the main issue)
chr2 88860921 . AG GC

Based on the image below the correct output should be:
chr2 88860919 . TG AA
chr2 88860921 . AG GC
chr2 88860923 . C T

It Actually looks like the resulting call that is causing the issue is trying to phase the blocksub and SNV, as they look phased which should have provided a call like
chr 88860919 . TGagC AAagT

its like we stripped out the two ref bases shown in lower case

complex_blocksub_issue

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.