<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Parse pandora VCF to detect minor alleles,about mbhall88/drprg

iqbal-lab commented on July 20, 2024 1

Yes, and just to keep our minds clean, we detect a minor allele (the genotype is major-REF plus minor-ALT), but this leads to a phenotype of R

from drprg.

iqbal-lab commented on July 20, 2024 1

"This position seems to always have depth across the various alleles."

Could also be a paralogous kmer from elsewhere?

from drprg.

mbhall88 commented on July 20, 2024

First eyeball of the results of this feature are that it is way too permissive - there's a lot of minor alleles in each VCF. I'm working on adding in a second layer whereby the the minor allele must also have a GAPS value less than some threshold. Eyeballing some of the "true" minor alleles it seems they all have GAPS < 0.4 (most are 0.0)

from drprg.

iqbal-lab commented on July 20, 2024

Are you looking at illumina or nanopore data?

from drprg.

mbhall88 commented on July 20, 2024

Illumina. I'm not going to both trying to do minor alleles on nanopore (I don't think there's any minor alleles in the H2H dataset anyway).

from drprg.

iqbal-lab commented on July 20, 2024

I'm surprised you're getting too many calls with illumina and a 10% threshold

from drprg.

mbhall88 commented on July 20, 2024

Here are the Illumina results after running with the new implementation of minor allele calling (I chaged the markers so that you can see the error bars better). This is detecting a fraction of 0.2 with a maximum GAPS value of 0.4

The first result that sticks out here is the ETO specificity. We basically call all but 4 samples as resistant (out of ~7000 isolates with phenotype for ETO!!)

I obviously didn't look at all 7000 samples, but all of the samples I did look at have a minor resistance call for mutation ethA ACGGTTTAGCGC1479A. The interesting thing is that the proportion of depth on the ref and alt allele is almost exactly the same for all the samples I looked at (e.g, 0.59/0.41 ref/alt). However, the GAPS for all of the samples I looked at was 0.0/0.333 - we have the max. GAPS threshold set at 0.4. I will try lowering it to 0.3 and see what that looks like.

from drprg.

iqbal-lab commented on July 20, 2024

Tb-profiler does manage to have slightly higher sensitivity for nearly all drugs also

from drprg.

iqbal-lab commented on July 20, 2024

Oh that might be because they use threshold 0.1 and we use 0.2

from drprg.

mbhall88 commented on July 20, 2024

Yeah possibly. I started with 0.2 and was thinking to finetune and then test out 0.1

from drprg.

mbhall88 commented on July 20, 2024

Here are the results of changing the max. GAPS to 0.3

It fixes the ETO specificity issue above.

from drprg.

mbhall88 commented on July 20, 2024

I did some digging into the fluoroquinolone (FQ) FNs as minor alleles are super important for these drugs.

There are two issues that have popped up in most of the cases.

When the minor allele's coverage is close to the major allele - i.e. fractions >0.4 for both alleles - pandora calls a null genotype. I saw this quite a bit (and also saw it a little for INH). When I switch to using local genotyping in pandora though it fixes these null calls. Given that, what is happening here is the maximum likelihood path is choosing on allele but genotyping is picking another, so we nullify the call. Does anyone forsee issues with switching to local genotyping?
There is a silent mutation gyrA S95T (which is susceptible) that kills an assumption I make when doing minor allele calling. My assumption is that if we have an ALT call, then skip checking for a minor allele. Which is fine in nearly all cases as ALTs generally mean resistance; but this allele doesn't mean resistance. So we call an alt, which is just the reference allele with this silent mutation and therefore don't look for minor alleles (which are there and in the same VCF position/site). Here is an example of this site

gyrA    380     a68d4d07        GACAG   AACAC,CACAC,GACAC,GCCAC,GGCAC,TACAC,TACAG       .       PASS    VC=PH_SNPs;GRAPHTYPE=NESTED;PDP=0,0,0,0.82392,0.17608,0,0,0;VARID=gyrA_D94G,gyrA_D94C,gyrA_D94A,gyrA_D94N,gyrA_D94H,gyrA_D94Y,gyrA_S95T;PREDICT=.,.,.,.,.,.,U GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      3:0,0,0,124,27,0,0,0:0,0,0,124,26,0,0,0:0,0,0,124,27,0,0,0:0,0,0,124,26,0,0,0:0,0,0,248,55,0,0,0:0,0,0,249,53,0,0,0:1,1,1,0,0,1,1,1:-1862.16,-1862.16,-1862.16,-247.957,-1250.38,-1862.16,-1862.16,-1862.16:1002.43

The silent mutation is allele 3.

So I think I will need to change some of my assumptions as this pops up a fair bit

from drprg.

iqbal-lab commented on July 20, 2024

When Pandora makes a null call, does it still report depths? If yes, you can just look at the depth and make your own decisioj/call? Local genotyping....I'm not sure.
DrPrg could just ignore Pandora genotyping entirely, and do it's own using depth on alleles.

Find the R allele with highest coverage
Find the S with highest coverage
If the R allele has >0.1 * the sum of these, you have an R call

from drprg.

mbhall88 commented on July 20, 2024

When Pandora makes a null call, does it still report depths? If yes, you can just look at the depth and make your own decisioj/call? Local genotyping....I'm not sure.

Yes, it does.

When I switch to using local genotyping, here is a table fo the changes in FNs and FPs

Tool	Drug	ΔFN	ΔFP
drprg	Amikacin	0	0
drprg	Capreomycin	0	0
drprg	Delamanid	0	0
drprg	Ethambutol	0	-6
drprg	Ethionamide	-5	2
drprg	Isoniazid	-4	0
drprg	Kanamycin	0	0
drprg	Levofloxacin	-3	1
drprg	Linezolid	0	0
drprg	Moxifloxacin	-5	0
drprg	Ofloxacin	-1	0
drprg	Pyrazinamide	-2	0
drprg	Rifampicin	-2	0
drprg	Streptomycin	1	0

e.g. A value of -2 means 2 less FN/FP

This seems like a strong case for local genotyping

DrPrg could just ignore Pandora genotyping entirely, and do it's own using depth on alleles.

This seems unnecessary given the above I think?

Find the R allele with highest coverage Find the S with highest coverage If the R allele has >0.1 * the sum of these, you have an R call

This is almost what I do currently, but I just make some naive assumptions about REF and ALT alleles which I will fix and that wiill make it effectively the same as you suggest

from drprg.

iqbal-lab commented on July 20, 2024

Wow those are great results!

from drprg.

leoisl commented on July 20, 2024

When the minor allele's coverage is close to the major allele - i.e. fractions >0.4 for both alleles - pandora calls a null genotype. I saw this quite a bit (and also saw it a little for INH). When I switch to using local genotyping in pandora though it fixes these null calls. Given that, what is happening here is the maximum likelihood path is choosing on allele but genotyping is picking another, so we nullify the call. Does anyone forsee issues with switching to local genotyping?

Not really, we just switched away from local genotyping because we got worst results in the 20-way dataset. I remember checking a few calls manually, but not as much and not as deep as you are doing here, we were mostly driven by the final results. Would it be possible to have both VCFs from a sample (one with global and the other with local genotyping) and some calls where the local genotyping is better? I'd need that to be able to understand better what is happening. And if it is not an issue, the pandora compare output dir and command would be great too

from drprg.

iqbal-lab commented on July 20, 2024

my guess is michael is just going to go with local genotyping. I think there are two kinds of thing that can happen with local genotyping

if you have a complicated graph (which happens in e coli but not in tb) you can have two diverged arms of a bubble, both with nested snps, and you can end up with global genotyping (=dynamic prog) chooses differently to local. You need to have a process to look at the local genotype calls and make sure they are actually compatible with each other. I think this just wont happen in michael's use case.
Michael is dealiin with explicitly non-haploid things (minor alleles), whereas global genotyping is very vey haploid, and will N things.

from drprg.

mbhall88 commented on July 20, 2024

Would it be possible to have both VCFs from a sample (one with global and the other with local genotyping) and some calls where the local genotyping is better? I'd need that to be able to understand better what is happening.

I don't have both on hand as they have been overridden with the local genotyping one. But could easily regenerate a pre-local one if you want. But what happened was when I look at the pandora consensus VCF the genotype was 0 and then genotyping VCF was trying to call GT 1, so pandora nullifies the call as a result.

And if it is not an issue, the pandora compare output dir and command would be great too

I don't use compare, just discover and map.

I suspect Zam's first point above is the likely situation

from drprg.

leoisl commented on July 20, 2024

I don't use compare, just discover and map.

Fine, can be discover/map output dirs

This example means it was not a correct decision to deprecate the local genotyping for users, and it has its uses.

But what happened was when I look at the pandora consensus VCF the genotype was 0 and then genotyping VCF was trying to call GT 1, so pandora nullifies the call as a result.

This is one example I want to understand better why we have such a divergence, and if we have a better solution than simply nullifying the call. We need to revisit pandora genotyping and improve...

from drprg.

leoisl commented on July 20, 2024

Could you rerun drprg using the most recent make_prg (0.4.0) and pandora (0.10.0-alpha.0) versions to check if results improve in any way?

from drprg.

mbhall88 commented on July 20, 2024

Could you rerun drprg using the most recent make_prg (0.4.0) and pandora (0.10.0-alpha.0) versions to check if results improve in any way?

Yep, I will do this one at a time and check the change once this PR is complete

from drprg.

mbhall88 commented on July 20, 2024

Alright, so after implementing the fix for point 2 here, the diff between the results before minor allele feature is

Tool	Drug	ΔFN	ΔFP
drprg	Amikacin	-8	7
drprg	Capreomycin	-5	3
drprg	Delamanid	0	0
drprg	Ethambutol	-9	6
drprg	Ethionamide	-12	22
drprg	Isoniazid	-35	6
drprg	Kanamycin	-13	6
drprg	Levofloxacin	-21	4
drprg	Linezolid	0	0
drprg	Moxifloxacin	-16	13
drprg	Ofloxacin	-5	1
drprg	Pyrazinamide	-2	3
drprg	Rifampicin	-4	0
drprg	Streptomycin	-9	2

The next two things to check before finalising this feature are using a minor allele fraction of 0.1 (results above were for 0.2) and removing the FRS filter

from drprg.

iqbal-lab commented on July 20, 2024

But wait , why suddenly so many more FPs?? Will only get worse if you lower to 0.1 and remove FRS?

from drprg.

mbhall88 commented on July 20, 2024

More FPs because we are doing minor allele calling now.... This diff is with the results before I implemented minor allele calling. The diff before this was an earlier minor allele implementation before/after using local genotyping

from drprg.

iqbal-lab commented on July 20, 2024

Right but tb-profiler is also doing this without those FPs ? My expectation was that it would work well for fluoroquinolones at least (some precedent in mykrone paper and the new biorxiv from Alice brankin and Phil Fowler). Obviously those two aren't on this dataset tho

from drprg.

mbhall88 commented on July 20, 2024

Right but tb-profiler is also doing this without those FPs ?

Not necessarily... These results are purely drprg FNs and FPs. We actually have less FPs than tb-profiler for all FQs at the moment

Drug	Tool	FN(R)	FP(S)	Sensitivity (95% CI)	Specificity (95% CI)
Amikacin	drprg	69(485)	57(6958)	85.8% (82.4-88.6%)	99.2% (98.9-99.4%)
Amikacin	mykrobe	93(485)	51(6958)	80.8% (77.1-84.1%)	99.3% (99.0-99.4%)
Amikacin	tbprofiler	62(485)	59(6958)	87.2% (83.9-89.9%)	99.2% (98.9-99.3%)
Capreomycin	drprg	57(235)	95(2449)	75.7% (69.9-80.8%)	96.1% (95.3-96.8%)
Capreomycin	mykrobe	72(235)	88(2449)	69.4% (63.2-74.9%)	96.4% (95.6-97.1%)
Capreomycin	tbprofiler	54(235)	96(2449)	77.0% (71.2-81.9%)	96.1% (95.2-96.8%)
Delamanid	drprg	111(116)	1(8152)	4.3% (1.9-9.7%)	100.0% (99.9-100.0%)
Delamanid	mykrobe	111(116)	2(8152)	4.3% (1.9-9.7%)	100.0% (99.9-100.0%)
Delamanid	tbprofiler	111(116)	2(8152)	4.3% (1.9-9.7%)	100.0% (99.9-100.0%)
Ethambutol	drprg	137(1538)	742(4936)	91.1% (89.6-92.4%)	85.0% (83.9-85.9%)
Ethambutol	mykrobe	133(1538)	747(4936)	91.4% (89.8-92.7%)	84.9% (83.8-85.8%)
Ethambutol	tbprofiler	118(1538)	765(4936)	92.3% (90.9-93.6%)	84.5% (83.5-85.5%)
Ethionamide	drprg	329(1104)	394(6105)	70.2% (67.4-72.8%)	93.5% (92.9-94.1%)
Ethionamide	mykrobe	265(1104)	413(6105)	76.0% (73.4-78.4%)	93.2% (92.6-93.8%)
Ethionamide	tbprofiler	272(1104)	414(6105)	75.4% (72.7-77.8%)	93.2% (92.6-93.8%)
Isoniazid	drprg	327(3900)	170(4194)	91.6% (90.7-92.4%)	95.9% (95.3-96.5%)
Isoniazid	mykrobe	333(3900)	170(4194)	91.5% (90.5-92.3%)	95.9% (95.3-96.5%)
Isoniazid	tbprofiler	297(3900)	181(4194)	92.4% (91.5-93.2%)	95.7% (95.0-96.3%)
Kanamycin	drprg	129(670)	107(6975)	80.7% (77.6-83.6%)	98.5% (98.1-98.7%)
Kanamycin	mykrobe	152(670)	98(6975)	77.3% (74.0-80.3%)	98.6% (98.3-98.8%)
Kanamycin	tbprofiler	122(670)	107(6975)	81.8% (78.7-84.5%)	98.5% (98.1-98.7%)
Levofloxacin	drprg	84(1040)	101(5454)	91.9% (90.1-93.4%)	98.1% (97.8-98.5%)
Levofloxacin	mykrobe	88(1040)	102(5454)	91.5% (89.7-93.1%)	98.1% (97.7-98.5%)
Levofloxacin	tbprofiler	85(1040)	109(5454)	91.8% (90.0-93.3%)	98.0% (97.6-98.3%)
Linezolid	drprg	49(65)	4(6110)	24.6% (15.8-36.3%)	99.9% (99.8-100.0%)
Linezolid	mykrobe	48(65)	4(6110)	26.2% (17.0-38.0%)	99.9% (99.8-100.0%)
Linezolid	tbprofiler	48(65)	5(6110)	26.2% (17.0-38.0%)	99.9% (99.8-100.0%)
Moxifloxacin	drprg	44(603)	477(5431)	92.7% (90.3-94.5%)	91.2% (90.4-91.9%)
Moxifloxacin	mykrobe	44(603)	473(5431)	92.7% (90.3-94.5%)	91.3% (90.5-92.0%)
Moxifloxacin	tbprofiler	42(603)	482(5431)	93.0% (90.7-94.8%)	91.1% (90.3-91.9%)
Ofloxacin	drprg	26(105)	5(424)	75.2% (66.2-82.5%)	98.8% (97.3-99.5%)
Ofloxacin	mykrobe	26(105)	5(424)	75.2% (66.2-82.5%)	98.8% (97.3-99.5%)
Ofloxacin	tbprofiler	26(105)	6(424)	75.2% (66.2-82.5%)	98.6% (96.9-99.3%)
Pyrazinamide	drprg	73(341)	50(822)	78.6% (73.9-82.6%)	93.9% (92.1-95.4%)
Pyrazinamide	mykrobe	55(341)	56(822)	83.9% (79.6-87.4%)	93.2% (91.3-94.7%)
Pyrazinamide	tbprofiler	45(341)	62(822)	86.8% (82.8-90.0%)	92.5% (90.4-94.1%)
Rifampicin	drprg	138(3222)	166(4586)	95.7% (95.0-96.4%)	96.4% (95.8-96.9%)
Rifampicin	mykrobe	164(3222)	169(4586)	94.9% (94.1-95.6%)	96.3% (95.7-96.8%)
Rifampicin	tbprofiler	102(3222)	177(4586)	96.8% (96.2-97.4%)	96.1% (95.5-96.7%)
Streptomycin	drprg	269(1042)	132(1205)	74.2% (71.4-76.7%)	89.0% (87.2-90.7%)
Streptomycin	mykrobe	282(1042)	135(1205)	72.9% (70.2-75.5%)	88.8% (86.9-90.5%)
Streptomycin	tbprofiler	257(1042)	136(1205)	75.3% (72.6-77.9%)	88.7% (86.8-90.4%)

These are the results from the most recent run.

from drprg.

iqbal-lab commented on July 20, 2024

Wow that comparison looks great. Moxi and ethionamide stand out for everyone tho, weird

from drprg.

iqbal-lab commented on July 20, 2024

No I did mean moxi , standing out for many FPs, so specificity is low

from drprg.

iqbal-lab commented on July 20, 2024

Frameshit eh

from drprg.

mbhall88 commented on July 20, 2024

No I did mean moxi , standing out for many FPs, so specificity is low

Ah, right. Well the minor allele calling hasn't really increased the FPs much for it (see table here), but the sensitivity has come up a lot. I guess this is to be expected given Phil's recent paper

I'll take a look at those FPs once I've finalised the minor allele stuff

from drprg.

iqbal-lab commented on July 20, 2024

You are right, and anyway same for all tools

from drprg.

mbhall88 commented on July 20, 2024

Alright, here is the diff when reducing minor allele calling from depth fraction 0.2 to 0.1

Tool	Drug	ΔFN	ΔFP
drprg	Amikacin	-1	0
drprg	Capreomycin	0	0
drprg	Delamanid	0	0
drprg	Ethambutol	-15	8
drprg	Ethionamide	0	1
drprg	Isoniazid	-10	3
drprg	Kanamycin	-1	0
drprg	Levofloxacin	-3	2
drprg	Linezolid	0	0
drprg	Moxifloxacin	-3	3
drprg	Ofloxacin	-2	0
drprg	Pyrazinamide	-1	1
drprg	Rifampicin	0	1
drprg	Streptomycin	-7	2

I'm quite impressed with this. I was expecting a lot more FPs. So I think 0.1 is the winner!

Here is the latest Illumina results with the 0.1 fraction

Drug	Tool	FN(R)	FP(S)	Sensitivity (95% CI)	Specificity (95% CI)	MCC
Amikacin	drprg	68(485)	57(6958)	86.0% (82.6-88.8%)	99.2% (98.9-99.4%)	0.861
Amikacin	mykrobe	93(485)	51(6958)	80.8% (77.1-84.1%)	99.3% (99.0-99.4%)	0.836
Amikacin	tbprofiler	62(485)	59(6958)	87.2% (83.9-89.9%)	99.2% (98.9-99.3%)	0.866
Capreomycin	drprg	57(235)	95(2449)	75.7% (69.9-80.8%)	96.1% (95.3-96.8%)	0.672
Capreomycin	mykrobe	72(235)	88(2449)	69.4% (63.2-74.9%)	96.4% (95.6-97.1%)	0.638
Capreomycin	tbprofiler	54(235)	96(2449)	77.0% (71.2-81.9%)	96.1% (95.2-96.8%)	0.679
Delamanid	drprg	111(116)	1(8152)	4.3% (1.9-9.7%)	100.0% (99.9-100.0%)	0.188
Delamanid	mykrobe	111(116)	2(8152)	4.3% (1.9-9.7%)	100.0% (99.9-100.0%)	0.173
Delamanid	tbprofiler	111(116)	2(8152)	4.3% (1.9-9.7%)	100.0% (99.9-100.0%)	0.173
Ethambutol	drprg	122(1538)	750(4936)	92.1% (90.6-93.3%)	84.8% (83.8-85.8%)	0.693
Ethambutol	mykrobe	133(1538)	747(4936)	91.4% (89.8-92.7%)	84.9% (83.8-85.8%)	0.689
Ethambutol	tbprofiler	118(1538)	765(4936)	92.3% (90.9-93.6%)	84.5% (83.5-85.5%)	0.691
Ethionamide	drprg	329(1104)	395(6105)	70.2% (67.4-72.8%)	93.5% (92.9-94.1%)	0.622
Ethionamide	mykrobe	265(1104)	413(6105)	76.0% (73.4-78.4%)	93.2% (92.6-93.8%)	0.658
Ethionamide	tbprofiler	272(1104)	414(6105)	75.4% (72.7-77.8%)	93.2% (92.6-93.8%)	0.653
Isoniazid	drprg	317(3900)	173(4194)	91.9% (91.0-92.7%)	95.9% (95.2-96.4%)	0.879
Isoniazid	mykrobe	333(3900)	170(4194)	91.5% (90.5-92.3%)	95.9% (95.3-96.5%)	0.876
Isoniazid	tbprofiler	297(3900)	181(4194)	92.4% (91.5-93.2%)	95.7% (95.0-96.3%)	0.882
Kanamycin	drprg	128(670)	107(6975)	80.9% (77.7-83.7%)	98.5% (98.1-98.7%)	0.805
Kanamycin	mykrobe	152(670)	98(6975)	77.3% (74.0-80.3%)	98.6% (98.3-98.8%)	0.789
Kanamycin	tbprofiler	122(670)	107(6975)	81.8% (78.7-84.5%)	98.5% (98.1-98.7%)	0.811
Levofloxacin	drprg	81(1040)	103(5454)	92.2% (90.4-93.7%)	98.1% (97.7-98.4%)	0.896
Levofloxacin	mykrobe	88(1040)	102(5454)	91.5% (89.7-93.1%)	98.1% (97.7-98.5%)	0.892
Levofloxacin	tbprofiler	85(1040)	109(5454)	91.8% (90.0-93.3%)	98.0% (97.6-98.3%)	0.89
Linezolid	drprg	49(65)	4(6110)	24.6% (15.8-36.3%)	99.9% (99.8-100.0%)	0.441
Linezolid	mykrobe	48(65)	4(6110)	26.2% (17.0-38.0%)	99.9% (99.8-100.0%)	0.457
Linezolid	tbprofiler	48(65)	5(6110)	26.2% (17.0-38.0%)	99.9% (99.8-100.0%)	0.447
Moxifloxacin	drprg	41(603)	480(5431)	93.2% (90.9-94.9%)	91.2% (90.4-91.9%)	0.669
Moxifloxacin	mykrobe	44(603)	473(5431)	92.7% (90.3-94.5%)	91.3% (90.5-92.0%)	0.669
Moxifloxacin	tbprofiler	42(603)	482(5431)	93.0% (90.7-94.8%)	91.1% (90.3-91.9%)	0.668
Ofloxacin	drprg	24(105)	5(424)	77.1% (68.2-84.1%)	98.8% (97.3-99.5%)	0.821
Ofloxacin	mykrobe	26(105)	5(424)	75.2% (66.2-82.5%)	98.8% (97.3-99.5%)	0.808
Ofloxacin	tbprofiler	26(105)	6(424)	75.2% (66.2-82.5%)	98.6% (96.9-99.3%)	0.802
Pyrazinamide	drprg	72(341)	51(822)	78.9% (74.2-82.9%)	93.8% (91.9-95.2%)	0.741
Pyrazinamide	mykrobe	55(341)	56(822)	83.9% (79.6-87.4%)	93.2% (91.3-94.7%)	0.77
Pyrazinamide	tbprofiler	45(341)	62(822)	86.8% (82.8-90.0%)	92.5% (90.4-94.1%)	0.782
Rifampicin	drprg	138(3222)	167(4586)	95.7% (95.0-96.4%)	96.4% (95.8-96.9%)	0.92
Rifampicin	mykrobe	164(3222)	169(4586)	94.9% (94.1-95.6%)	96.3% (95.7-96.8%)	0.912
Rifampicin	tbprofiler	102(3222)	177(4586)	96.8% (96.2-97.4%)	96.1% (95.5-96.7%)	0.927
Streptomycin	drprg	262(1042)	134(1205)	74.9% (72.1-77.4%)	88.9% (87.0-90.5%)	0.647
Streptomycin	mykrobe	282(1042)	135(1205)	72.9% (70.2-75.5%)	88.8% (86.9-90.5%)	0.629
Streptomycin	tbprofiler	257(1042)	136(1205)	75.3% (72.6-77.9%)	88.7% (86.8-90.4%)	0.649

from drprg.

iqbal-lab commented on July 20, 2024

Thus is much better isn't it!
Questions

just checking, is Mykrobe pinned to haploid here, or can it call minors?
tbprofiler still manages to call 36 rif TPs and 20 isoniazid TPs and 27 pyrazinamids TPs that we don't (although drprg beats Mykrobe which is cool!) . So, double checking: this is the same catalogue right? However tbprofiler takes vcf. Maybe some small issue in probe generation, especially where we have dense variation? But racon ought to plug those gaps. Weird. Are they all in the same samples?

from drprg.

mbhall88 commented on July 20, 2024

just checking, is Mykrobe pinned to haploid here, or can it call minors?

it can call minor alleles

So, double checking: this is the same catalogue right? However tbprofiler takes vcf. Maybe some small issue in probe generation, especially where we have dense variation? But racon ought to plug those gaps. Weird. Are they all in the same samples?

100% the same catalogue for all tools. I'm going to dig into the discrepancies this week

from drprg.

mbhall88 commented on July 20, 2024

It seems there are a decent number of cases where we don't call a minor allele because we fail the GAPS filer (min 0.3). However, in many of these cases, even the ref fails the GAPS filter. A lot of examples I am seeing have a small difference in GAPS between the ref and alt. (rpoB S450L has a lot of minor allele calls where both the ref and alt have GAPS 0.333). I'll try increasing the GAPS filter to 0.5, but also adding in a GAPS diff filter that sets an allowed max. difference between the ref and alt GAPS value.

from drprg.

iqbal-lab commented on July 20, 2024

The GAPS filter threshold is set in the expectation were looking at a major/haploid allele. I'd ignore it for minor allele detection, or as you're doing, change the threshold

from drprg.

iqbal-lab commented on July 20, 2024

Although ref failing gaps also is interesting

from drprg.

mbhall88 commented on July 20, 2024

I added in a GAPS diff parameter. So we have a max GAPS value and a max GAPS diff, which is the difference between the major and minor allele's GAPS value (this can actually be negative - I found a case of the minor allele GAPS being lower than the called allele!).

I ran with max GAPS 0.5 and max diff 0.2 and I get the following diff

Tool	Drug	ΔFN	ΔFP
drprg	Amikacin	0	0
drprg	Capreomycin	0	0
drprg	Delamanid	0	0
drprg	Ethambutol	-101	4135
drprg	Ethionamide	-1	3
drprg	Isoniazid	0	0
drprg	Kanamycin	0	0
drprg	Levofloxacin	-2	2
drprg	Linezolid	0	1
drprg	Moxifloxacin	-1	0
drprg	Ofloxacin	0	0
drprg	Pyrazinamide	-1	1
drprg	Rifampicin	-21	2
drprg	Streptomycin	1	1

This is nearly great. We save a lot of RIF FNs. BUT, we end up calling everything EMB resistant by the looks of it. I'll dig into this today and see if I can get to the bottom of that

from drprg.

mbhall88 commented on July 20, 2024

All of the EMB calls are due to the same position

embA    85      4b6d566c        CCTACC  CCTACA,CCTATC,TCTACC    .       PASS    VC=PH_SNPs;GRAPHTYPE=SIMPLE;PDP=0.423729,0.169492,0.0847458,0.322034;OGT=0;VARID=embA_C-12T,embA_C-16G,embA_C-16T;PREDICT=.,.,r       GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      3:18,8,4,14:7,2,1,5:25,0,0,12:7,0,0,3:90,25,25,59:37,7,7,22:0.4,0.666667,0.833333,0.5:-207.412,-316.464,-362.32,-249.056:41.6434

The minor allele has GAPS 0.5 and the ref has 0.4. This position seems to always have depth across the various alleles. My assumption is there is a shared minimizer between these alleles. For instance, the ref is CCTACC and the alt that always gets called is TCTACC. The last 5 bases of these two alleles are the same, so it is conceiveable that they share a minimizer that covers some/all of those 5 bases and not the first, given the small difference in GAPS.

I'll have a think about the best way to address this without losing the RIF TPs we gained.

from drprg.

mbhall88 commented on July 20, 2024

If I break this variant into two sites (there is 3bp that match in between the first base and last two bases) this GAPS issue goes away.

So I see two options here

Reduce the minimum match length to 3 when building the index
Add in another GAPS filter 'max called gaps' which sets a maximum value for the called allele's GAPS value. If it is above this value, we don't consider minor alleles for that site. This would get rid of all of the bad EMB calls

I'm kind of tempted to try both and see what happens separately for each...

from drprg.

mbhall88 commented on July 20, 2024

Reducing the lin. match length to 3 gives the following diff (replaces the last diff)

Tool	Drug	ΔFN	ΔFP
drprg	Amikacin	0	0
drprg	Capreomycin	0	0
drprg	Delamanid	0	0
drprg	Ethambutol	-8	24
drprg	Ethionamide	3	2
drprg	Isoniazid	12	-1
drprg	Kanamycin	0	0
drprg	Levofloxacin	-1	2
drprg	Linezolid	0	1
drprg	Moxifloxacin	-1	0
drprg	Ofloxacin	0	0
drprg	Pyrazinamide	0	3
drprg	Rifampicin	-19	2
drprg	Streptomycin	1	2

Not super happy with this one. Currently trying option 2 above

from drprg.

iqbal-lab commented on July 20, 2024

Bit weird, just this one position always having coverage on both alleles. Your idea of a shared minimizer would effectively mean those 2 alleles collapse to one path in the kmer graph instead of a bubble, or potentially the bubble becomes a bit shorter I suppose. Is there always the same coverage on both?

from drprg.

iqbal-lab commented on July 20, 2024

No, this is only an issue when we pay attention to minor alleles? So we always get a tiny bit of coverage on one allele? So this can't be a shared minimizer can it?

from drprg.

iqbal-lab commented on July 20, 2024

Maybe this is a mad idea, but what happens if you change your hash function?

from drprg.

mbhall88 commented on July 20, 2024

No, this is only an issue when we pay attention to minor alleles? So we always get a tiny bit of coverage on one allele? So this can't be a shared minimizer can it?

All alleles get some coverage, but not the same amount on each. You're right, it can't be shared minimizers...

When I split the allele into two, I still got some coverage on the minor allele, but the GAPS increased on the minor and decreased on the called allele.

Maybe this is a mad idea, but what happens if you change your hash function?

As in the pandora hash function? I think this is what we currently use https://github.com/rmcolq/pandora/blob/945c0255e99ab32fd5b056dc1c16ffdf611ef478/thirdparty/src/inthash.cpp#L106-L117. Do you think there might be a hash collision?

from drprg.

mbhall88 commented on July 20, 2024

This is the diff for using the max called GAPS filter (option 2 from above)

Tool	Drug	ΔFN	ΔFP
drprg	Amikacin	0	0
drprg	Capreomycin	0	0
drprg	Delamanid	0	0
drprg	Ethambutol	0	0
drprg	Ethionamide	-1	3
drprg	Isoniazid	0	0
drprg	Kanamycin	0	0
drprg	Levofloxacin	-2	2
drprg	Linezolid	0	0
drprg	Moxifloxacin	-1	0
drprg	Ofloxacin	0	0
drprg	Pyrazinamide	-1	1
drprg	Rifampicin	-19	2
drprg	Streptomycin	1	1

This looks pretty good.

from drprg.

Parse pandora VCF to detect minor alleles about drprg HOT 46 CLOSED

Comments (46)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent