Comments (8)
The way this analysis was setup is I took 183 isolates from the head to head dataset which have a minimum of 3x coverage for both Illumina and Nanopore.
I then run drprg
on those isolates with the following window and kmer sizes (no combinations where w>k though)
w
- 10 # default (nanopore) in minimap2
- 11 # short read default in minimap2
- 14 # current default
- 19 # best illumina found in Rachel's thesis
k
- 15 # current default
- 17
- 19
- 21 # value used for short reads by minimap2
- 25
- 31 # best illumina value found in Rachel's thesis
I then aggregate the number of false positive and negative resistance calls across all drugs and plot them.
In the above plot, the dashed lines with the triangle points are FP counts and the solid lines with circle points are FN counts. The different colours indicate the window size, with kmer size on the x-axis.
Seems like something weird has happened in both technologies at k-17 w=14 as the FPs randomly sky-rocket. I will look into what has caused this because based on the trends, this looks like it might have been the best combination for both technologies. As it stands k=17 and w=11 looks to be the best.
from drprg.
The jump in FNs are effectively all in Rifampicin and specifically in the mutation rpoB_S450X
. What is happening (in both technolopgies) is that for some reason at that precise (w, k) pairing, we start getting hits on the reference allele, which leads to the FRS (fraction of read support) filter being applied. There are no hits on the reference allele for k=15 and k=19 at that w=14 - for the isolates I looked at.
Example VCF entires (I have removed excess VARID
and PREDICT
values)
sample mada_1-5
nanopore w=14 k=15
rpoB 1449 28f3a195 C T . PASS SVTYPE=SNP;GRAPHTYPE=SIMPLE;VARID=rpoB_S450X;PREDICT=R GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 1:0,25:0,27:0,35:0,33:0,77:0,81:1,0.333333:-421.469,-43.1295:378.339
sample mada_1-5
nanopore w=14 k=17
rpoB 1449 f431d8b5 C T . frs SVTYPE=SNP;GRAPHTYPE=SIMPLE;VARID=rpoB_S450X;PREDICT=R GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 1:18,38:18,42:18,38:18,42:37,77:37,84:0.5,0:-434.949,-169.272:265.677
sample mada_1-5
nanopore w=14 k=19
rpoB 1449 a0e01ee5 C T . PASS SVTYPE=SNP;GRAPHTYPE=SIMPLE;VARID=rpoB_S450X;PREDICT=R GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 1:0,38:0,42:0,38:0,42:0,77:0,85:1,0:-538.414,-3.26102:535.153
For @lachlancoin's understanding, what you want to focus on here for the coverage is the MEAN_FWD_COVG
and MEAN_REV_COVG
fields
I don't really get why this particular window size (14) at that particular kmer size (17) would cause this. But as this allele is inside the RRDR there could be something funky happening in the graph surrounding this position.
I'm going to add in w=13 and w=15 to get a better idea of exactly where this breaks down.
Any other ideas?
from drprg.
Waiting on resolution of iqbal-lab-org/pandora#293 as a few samples failed for that reason
from drprg.
How is the graph built? Are you starting with a list of variants/vcf and implicitly allowing loads of recombination, or starting with an MSA of known haplotypes and using make-prg? And does drprg do de novo via de bruin or racon, or neither?
from drprg.
How is the graph built? Are you starting with a list of variants/vcf and implicitly allowing loads of recombination, or starting with an MSA of known haplotypes and using make-prg?
I started with the cryptic VCF, subsampling 50 isolates from each lineage. Then apply the haplotype for each isolate to H37Rv. Then MSA that, then make_prg. I then run de novo (dBG) on that graph, update the graph with make prg, and then I hit the error when genotyping with pandora map. I only get it on 4 w-k combination, and, interestingly, all are when w=k. I don't hit this for any other w-k combination. In total, with the 181 isolates across the two sequencing modalities there are ~7-8k jobs in the pipeline and only 4 hit this error. I'll skip them for now, but it seems important to understand what is causing this.
from drprg.
I'm going to add in w=13 and w=15 to get a better idea of exactly where this breaks down.
Here is the plot with the extra window and kmer sizes.
tl;dr k=17 and w=11 seems best for both technologies
from drprg.
After the work to-date on #11 these results have changed quite noticeably - for FPs mainly.
After ranking the F1 scores for all of these combinations (and eyeballing) it seems the original default of w=14 and k=15 gives the best results. I'll revert back to that.
from drprg.
After all the recent work I thought I would revisit this.
Looking at the raw values I thought I'd try out w=11 and k=15 (currently the default is w=14 k=15) and I get the following diff for Illumina (no change for nanopore)
Tool | Drug | ΔFN | ΔFP |
---|---|---|---|
drprg | Amikacin | -2 | 0 |
drprg | Capreomycin | -1 | 0 |
drprg | Delamanid | 0 | -1 |
drprg | Ethambutol | -1 | 2 |
drprg | Ethionamide | -3 | -7 |
drprg | Isoniazid | 0 | 0 |
drprg | Kanamycin | -2 | 0 |
drprg | Levofloxacin | 1 | 2 |
drprg | Linezolid | 0 | 0 |
drprg | Moxifloxacin | -1 | -1 |
drprg | Ofloxacin | 1 | 0 |
drprg | Pyrazinamide | 0 | -1 |
drprg | Rifampicin | -1 | 0 |
drprg | Streptomycin | -3 | 0 |
So I think I'll stick with w=11 and k=15
from drprg.
Related Issues (20)
- False positive argmatch results HOT 1
- Run drprg HOT 4
- Add grammar for specifying variant "expert" rules
- Benchmark figures HOT 7
- Parse pandora VCF to detect minor alleles HOT 46
- Update pandora and make_prg HOT 2
- Deal with gene absence HOT 9
- Add some common resistance-conferring mutations that do no exist in population graph HOT 8
- Notice partial gene deletion that spans start codon HOT 8
- Targeted sequencing mode HOT 1
- Disruptive in-frame indels HOT 3
- Lineage calling HOT 1
- Add install instructions to README.md
- How do i pronounce this tool name HOT 1
- Installation_error HOT 2
- Collate_drprg_results HOT 1
- error: unrecognized subcommand 'index' HOT 4
- Paired-end fastq files HOT 1
- Missing expected output file /test/outdir/discover/denovo_sequences.fa HOT 5
- Index is not valid, missing files error HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from drprg.