tgen / vcfmerger2 Goto Github PK

View Code? Open in Web Editor NEW

5.0 7.0 1.0 15.34 MB

Dynamic vcfMerger for 2 to N somatic variants vcf files

License: Other

Python 80.10% Shell 19.90%

somatic-vcfs variant-callers vcfs

vcfmerger2's Introduction

vcfMerger2

vcfMerger2 is a tool merger for 2 to N somatic variants vcf files.

What is vcfMerger2?

This tool allows merging from 2 to N somatic vcfs obtained after running different variant callers using the same bam file(s).

vcfMerger2 will generate one vcf.

How vcfMerger2 operates?

vcfMerger2 takes RAW VCFs as inputs and outputs a Merged VCF. click here for the RAW VCF Definition

Somatic or Germline?

VcfMerger2 has been developped to merge SOMATIC vcfs into one vcf. The current release can merge from 2 to N vcfs.
To see the list variant callers compatible with vcfMerger2 go to "Variant_Callers_compatible" wiki page

Merging GERMLINE vcfs will be available in the next version; some features are already there but are still under development and should be consider as in Alpha version.

How to Install

See wiki Installation page

How to run vcfMerger2?

The core utility and main purpose of that tool is to MERGE VCFs information into one VCF without losing any data from any of the input vcfs files See QuickStart wiki page.

License

vcfMerger2 is under MIT licence.

vcfmerger2's People

Contributors

Stargazers

Watchers

Forkers

sfacista

vcfmerger2's Issues

Main script cannot be executable without a shebang

The main script here https://github.com/tgen/vcfMerger2/blob/master/bin/vcfMerger2.py will not work unless there is a shebang at the top. It's already executable, but without the right shebang, shells will just try to execute it as a shell script. Add this line to the top of the file:

#!/usr/bin/env python3

TYPE is a key word for bcftools variant type and should be renamed VTYPE

It has been found that TYPE flag found in INFO field is a reserved word for bcftools which does not recognize the value "ins" as a valid value;

Should we rename the TYPE flag to VTYPE?

Note: renaming the header and the records with VTYPE works; It has been tested on UTJMM merged vcf file.

Same Phase but Non Consecutive Variants got collapsed

Some variants with the same phase (or PS value) got collapsed altogether even though the variants were not consecutives in positions;
This leads to the creation of wrong block subsitutions and break downstream analysis.

Wrong Terminology

--dict to provide .fai file

The .fai is a TABIX index, there is a .dict file, these are two different things. We should use the correct term so if you expect a .fai file as input the variable should be --tabix_index

Octopus AD value are single number separated by commas.

0|1:61:0:6,8,6,8 0|0:65:0.12:7,7,8,8 is the issue; should be: 0|1:61:0:68,68 0|0:65:0.12:77,88

Suggestion: allow users to add phASER as a path

Requiring phASER to be part of the $PATH makes it harder than it needs to be to integrate vcfMerger2 into snakemake (for example). An option to provide a path to the built local version pf phASER would be appreciated. Thanks for all your work.

vcfMerger could use Pisces module

We are using http://uswest.ensembl.org/Canis_familiaris/Info/Index for a reference genome. Unfortunately, all we have left is normalized VCF. I have to go back in a re-run to generate raw VCF's. All I have left right now is the compressed/ normalized VCF's. I will attach it as soon as I acquire it.

Duplicate contig lines in merged header

We need to fix the multiple headers, maybe add to the prep to correct the one to match the others

##fileformat=VCFv4.2
##fileDate=20190502
##contig=<ID=HLA-A01:01:01:01,length=3503,assembly=GRCh38>
##contig=<ID=HLA-A01:01:01:01,length=3503>
##contig=<ID=HLA-A01:01:01:02N,length=3291,assembly=GRCh38>
##contig=<ID=HLA-A01:01:01:02N,length=3291>
##contig=<ID=HLA-A01:01:38L,length=3374,assembly=GRCh38>
##contig=<ID=HLA-A01:01:38L,length=3374>
##contig=<ID=HLA-A01:02,length=3374,assembly=GRCh38>
##contig=<ID=HLA-A01:02,length=3374>
##contig=<ID=HLA-A01:03,length=3503,assembly=GRCh38>
##contig=<ID=HLA-A01:03,length=3503>
##contig=<ID=HLA-A01:04N,length=3136,assembly=GRCh38>
##contig=<ID=HLA-A01:04N,length=3136>
##contig=<ID=HLA-A01:09,length=3105,assembly=GRCh38>
##contig=<ID=HLA-A01:09,length=3105>
##contig=<ID=HLA-A01:11N,length=3374,assembly=GRCh38>
##contig=<ID=HLA-A01:11N,length=3374>
##contig=<ID=HLA-A*01:14,length=3095,assembly=GRCh38>

ERROR: You must provide the Tumor BAM files to the option --lbams ; Aborting.

I am getting the following error, however there is no option "--lbams" based on the options list.
ERROR: You must provide the Tumor BAM files to the option --lbams ; Aborting.

modify title in Venn image

Need to add a way of customizing the Title in the Venn image

changing directory within your script might limit the ability of using relative paths as input.

vcfMerger2/prep_vcfs_somatic/prep_vcf_somatic_functions.sh

Line 611 in 1b2f4d4

cd ${DIR_OUTPUT}

Do we need cd in the vcfmerger_prep?

sorting contigs same way in header and data

The contigs must be ordered the same way in the header of the merged vcf and in the data section.

Number of Genotype exceeds 3 in octopus

The number of Genotypes captured by the class Genotype in prep step for Octopus has been set to 3 and must be updated 9 in order to anticipate any genotype value up to 9 or more.

Introduce an ELSE statement to deal with edge cases

vcfMerger2/prep_vcfs_somatic/strelka2/strelka2.phasing_consecutives_variants_as_blocs.sh

Lines 197 to 206 in 8ed9a05

    
           if [[ ${C} -eq 0 ]] 
        
           then 
        
             echo -e "No Hets in VCF, so No Phasing to perform; Skipping phASER" 1>&2 
        
             echo "Making the expected VCF filename as if phASER had run:" 
        
             echo "zcat \"${VCF_ORIGINAL_INPUT}\"  \"${VCF_ORIGINAL_INPUT/.vcf.gz/.blocs.vcf}\" " 1>&2 
        
             zcat "${VCF_ORIGINAL_INPUT}" > "${VCF_ORIGINAL_INPUT/.vcf.gz/.blocs.vcf}" 
        
             echo "${VCF_ORIGINAL_INPUT/.vcf.gz/.blocs.vcf}" 2>&1 
        
             echo "Exiting $0 script without having phASER ran. ev = 0" 1>&2 
        
             exit 0 
        
           fi

Example of edge case:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NORMAL	TUMOR
chr14	50978766	.	G	T	.	PASS	SOMATIC;QSS=60;TQSS=1;NT=ref;QSS_NT=60;TQSS_NT=1;SGT=GG->GT;DP=115;MQ=59.97;MQ0=0;ReadPosRankSum=1.22;SNVSB=0;SomaticEVS=7.38	GT:DP:FDP:SDP:SUBDP:AU:CU:GU:TU:AR:AD	0/0:21:0:0:0:0,0:0,0:21,21:0,0:0:21,0	0/1:94:6:0:0:0,0:0,0:81,87:7,7:0.0795:81,7
chr14	50978767	.	T	TG	.	PASS	SOMATIC;QSI=47;TQSI=1;NT=ref;QSI_NT=47;TQSI_NT=1;SGT=ref->het;MQ=59.96;MQ0=0;RU=G;RC=0;IC=1;IHP=18;SomaticEVS=6.74	GT:DP:DP2:TAR:TIR:TOR:DP50:FDP50:SUBDP50:BCN50:AR:AD	0/0:19:19:20,20:0,0:0,0:25.1:0.58:0:0:0:20,0	0/1:90:90:71,73:14,14:9,7:100.19:4.93:0:0.04:0.1647:71,14

This is a SNV followed by an indel;
phASER excludes the indel and only the snv remains which is not part of a block anymore --> Therefore phASER fails.

Let's think about any other potential edge case that the phASER tool could be grumpy about and fails because it does not handle single lines after removing or not dealing with the indels anymore.

lancet issue when vcf NOT filtered by PASS

When VCF is not Filtered by PASS, the genotype for variants contains dots and not zeros or one such as:
GT:AD:SR:SA:DP .:0,0:0,0:0,0:0 1/1:0,5:0,0:4,1:5

Having a dot will raise an error while running script addFieldsForVcfMerger2:

Traceback (most recent call last):
 File "${HOME}/gits/vm2_0.7.7/prep_vcfs_somatic/lancet/lancet.somatic.addFieldsForVcfMerger.py", line 391, in <module>
   v = process_records(tot_number_samples, v, col_tumor, col_normal)
 File "${HOME}/gits/vm2_0.7.7/prep_vcfs_somatic/lancet/lancet.somatic.addFieldsForVcfMerger.py", line 366, in process_records
   return process_GTs(tot_number_samples, v, col_tumor, col_normal)
 File "${HOME}/gits/vm2_0.7.7/prep_vcfs_somatic/lancet/lancet.somatic.addFieldsForVcfMerger.py", line 300, in process_GTs
   GTs[idxN] = get_GT_value_from_GT_value(GTOs[idxN])  ## we do not modify the GT field for the Normal sample
 File "${HOME}/gits/vm2_0.7.7/prep_vcfs_somatic/lancet/lancet.somatic.addFieldsForVcfMerger.py", line 265, in get_GT_value_from_GT_value
   x = GenotypeInv(list(GT_value))
 File "${HOME}/gits/vm2_0.7.7/prep_vcfs_somatic/lancet/lancet.somatic.addFieldsForVcfMerger.py", line 65, in __init__
   self.phased = bool(0) if li[1] == "/" else bool(1)
IndexError: list index out of range
ERROR: ${HOME}/gits/vm2_0.7.7/prep_vcfs_somatic/lancet/lancet.somatic.addFieldsForVcfMerger.py  FAILED ;
exit_value:1 ; Aborting!

The Genotype ./. or only . is not taken into account by the Genotype Class.

Filtering Occuring?

Is vcfMerger doing filtering in the pipeline?

UTJMM merged VCF header
##STRELKA2_SnpSiftCmd="SnpSift Filter '(GEN[ALL].DP > 10) & (GEN[0].AR < 0.02 & GEN[1].AR > 0.05)' temp/vcfmerger2/UTJMM_0008_3_PB_Whole_C1_KHSTL-UTJMM_0008_2_BM_CD138pos_T1_KHSTL/UTJMM_0008_3
_PB_Whole_C1_KHSTL-UTJMM_0008_2_BM_CD138pos_T1_KHSTL_bwa_strelka2.prepz.vcf"

typo or wrong annotation

vcfMerger2/prep_vcfs_somatic/strelka2/strelka2.phasing_consecutives_variants_as_blocs.sh

Line 137 in 8ed9a05

    
           echo -e "removing ALT-ALT homozygous from subByConsPos VCF ...  using bcftools ..."

replace homozygous with heterozygous on that line.

Octopus AR precision

vcfMerger2/prep_vcfs_somatic/octopus/octopus.somatic.addFieldsForVcfMerger.py

Lines 398 to 410 in 20676d9

    
           		AR_tumor = round(float(AD_tumor/ADP_tumor), 2) 
        
           		## Reformmating AD to expected VCF specs for that Reserved AD field, using the original AD and ADP values 
        
           		AD_tumor = [ADP_tumor - AD_tumor, ADP_tumor] 
        
           	else: 
        
           		AR_tumor = int(-2) 
        
           		AD_tumor = [0, ADP_tumor] 
        
           except ZeroDivisionError: 
        
           	log.debug("division by zero!") 
        
           	AD_tumor = [0, 0] 
        
           	AR_tumor = float(0.00) 
        
           try: 
        
           	if AD_normal != "." and AD_normal >=0: 
        
           		AR_normal = round(float(AD_normal / ADP_normal), 2)

I would round to 3rd or 4th precision, right now a 2/44 = 0.04545 becomes 0.05 and then gets into merged vcf after filtering AR >= 0.05. But in this example it didn't meet the actual threshold but did because we rounded up to the 2nd precision level

VCF validation issues with VarDict, Mutect2, and Octopus fields

vcf-validator from VCFtools v0.1.13 found issues with three info fields within vcfMerger2 VCFs.

Issue 1:
##INFO=<ID=VARDICT_SOR,Number=1,Type=Float,Description="Odds ratio">
The issue with the line above is that the values for VARDICT_SOR seem to be all "inf" and not an actual float.
Possible Solution:
##INFO=<ID=VARDICT_SOR,Number=1,Type=String,Description="Odds ratio">.
Instead of a string type we could keep it as a float and convert all of the "inf" values to "1.0" or "0.99"

Issue 2:
##INFO=<ID=MUTECT2_AS_SB_TABLE,Number=1,Type=String,Description="Allele-specific forward/reverse read counts for strand bias tests. Includes the reference and alleles separated by |.">
The issue with this line is that there are 3 string values and not just one stated by "Number=1". For example, one of the values is "79,126|29,50." Since commas are used as a delimiter to create a list of values, the total number of strings shown in the example is 3.
Possible Solution:
##INFO=<ID=MUTECT2_AS_SB_TABLE,Number=.,Type=String,Description="Allele-specific forward/reverse read counts for strand bias tests. Includes the reference and alleles separated by |.">

Issue 3:
##INFO=<ID=OCTOPUS_AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele in the same order as listed">
The issue with this one is that the allele count sometimes has values for a secondary alt allele, however, there are no additional alt alleles shown in the ALT field.
Possible Solution:
##INFO=<ID=OCTOPUS_AC,Number=.,Type=Integer,Description="Allele count in genotypes for each ALT allele in the same order as listed">

Strelka2 prep BlockSub creation failing in complex tandem blocksub

This looks like an edge case that may be related to phaser itself or underlying code in this repository.
t
Scenario: Five consecutive calls from two independent but consecutive block substitutions followed by a single SNV

Sample in question: MMRF_2787_1_PB_WBC_C3_KHS5U-MMRF_2787_1_BM_CD138pos_T1_KHS5U.bwa.strelka2.pass.vcf.gz

PASS VCF CALLS:
chr2 88860919 . T A
chr2 88860920 . G A
chr2 88860921 . A G
chr2 88860922 . G C
chr2 88860923 . C T

Post PREP OUTPUT:
chr2 88860919 . TGC AAT (## The "C" is not a ref allele, this is the main issue)
chr2 88860921 . AG GC

Based on the image below the correct output should be:
chr2 88860919 . TG AA
chr2 88860921 . AG GC
chr2 88860923 . C T

It Actually looks like the resulting call that is causing the issue is trying to phase the blocksub and SNV, as they look phased which should have provided a call like
chr 88860919 . TGagC AAagT

its like we stripped out the two ref bases shown in lower case

	if [[ ${C} -eq 0 ]]
	then
	echo -e "No Hets in VCF, so No Phasing to perform; Skipping phASER" 1>&2
	echo "Making the expected VCF filename as if phASER had run:"
	echo "zcat \"${VCF_ORIGINAL_INPUT}\" \"${VCF_ORIGINAL_INPUT/.vcf.gz/.blocs.vcf}\" " 1>&2
	zcat "${VCF_ORIGINAL_INPUT}" > "${VCF_ORIGINAL_INPUT/.vcf.gz/.blocs.vcf}"
	echo "${VCF_ORIGINAL_INPUT/.vcf.gz/.blocs.vcf}" 2>&1
	echo "Exiting $0 script without having phASER ran. ev = 0" 1>&2
	exit 0
	fi

	AR_tumor = round(float(AD_tumor/ADP_tumor), 2)
	## Reformmating AD to expected VCF specs for that Reserved AD field, using the original AD and ADP values
	AD_tumor = [ADP_tumor - AD_tumor, ADP_tumor]
	else:
	AR_tumor = int(-2)
	AD_tumor = [0, ADP_tumor]
	except ZeroDivisionError:
	log.debug("division by zero!")
	AD_tumor = [0, 0]
	AR_tumor = float(0.00)
	try:
	if AD_normal != "." and AD_normal >=0:
	AR_normal = round(float(AD_normal / ADP_normal), 2)