mbhall88 / drprg Goto Github PK
View Code? Open in Web Editor NEWDrug Resistance Prediction with Reference Graphs
Home Page: https://mbh.sh/drprg/
License: MIT License
Drug Resistance Prediction with Reference Graphs
Home Page: https://mbh.sh/drprg/
License: MIT License
See mbhall88/drprg-paper#2 and mbhall88/drprg-paper#2 for the motivation.
There are some common mutations which do not exist in the reference graph. As such, when there is a minor allele in a sample for one of these mutations, we do not detect it. This is because discover
only finds major alleles, and to detect minor alleles, the allele must exist in the graph in the first place.
Run mykrobe on the isolates we select from mbhall88/drprg-paper#1. Use v0.12.1 - i.e., the WHO catalogue.
I think detecting minors could easily be done directly in drprg, no need to implement in Pandora. You get coverage info on the S and R alleles right? Just ask if the coverage on any R allele is >0.1 of the total
_Originally posted by @iqbal-lab in mbhall88/drprg-paper#2
From mbhall88/drprg-paper#2 we see that the bulk of the gap in sensitivity between drprg and tbprofiler for INH is to do with minor allele calls.
We can implement a variant of minor allele detectiong in drprg by looking at the coverage reported in the pandora VCF. If a variant is called reference allele, but we detect more than f (fraction) of the reads are from an allele that corresponds to a resistance mutation, then we call the resistance allele.
Run TBProfiler on the isolates we select from mbhall88/drprg-paper#1
The first part of this will be making a TBProfiler panel from our panel so that we are using the same.
Could you rerun drprg using the most recent make_prg (0.4.0) and pandora (0.10.0-alpha.0) versions to check if results improve in any way?
Originally posted by @leoisl in #19 (comment)
Am interested in a script that collates all json file outputs into one text file - that shows the drug and predicted genotypic outcome for all samples. Thx
Dear Developers, I am trying to use and run your software, however nothing helps to fix the errors, which appear:
#!/bin/bash
sudo docker run --rm=True -v $PWD:/home/mat29/indata/ -u $(id -u):$(id -g) staphb/drprg drprg index --download -o/home/mat29/indata/
sudo docker run --rm=True -v $PWD:/home/mat29/indata/ -u $(id -u):$(id -g) staphb/drprg drprg predict -x/home/mat29/indata/mtb -i/home/mat29/indata/reads.fastq --illumina -o/home/mat29/indata/
System shows:
#[2024-03-27T17:25:47Z INFO ] mtb index version 20230308 already downloaded. Skipping...
#[2024-03-27T17:25:47Z INFO ] Done!
#Error: Index is not valid due to missing file "/home/mat29/indata/mtb/dr.prg.kX.wY.idx"
If I create the new file dr.prg.kX.wY.idx, system throws the other similar error that there is another file missing. And it never ends
I need to run your software on the M. Tuberculosis FASTQ Illumina genomic files.
Please advise what Shell code to run and what to change, how to fix the problem?
Thank you.
Dr. Matvey
My apologies if this is in the documentation somewhere, but I am unable to figure out how to run drprg on two fastq files at the same time.
drprg predict -x mtb -i reads.fq --illumina -o outdir/
^ only has one fastq file.
I have now been through all of the FNs where one of the other callers has a TP.
There are 5 (ethionamide), 1 (isoniazid), 1 (pyrazinamide), and 1 (streptomycin) FN where tbprofiler calls a disruptive in-frame indel. The genes associated with these four drugs have the rule "any frameshift causes mutation", but these aren't frameshifts. The nomenclature used by tbprofiler is from snpEff I believe. By disruptive it means it deletes a multiple of 3 bases, but that these don't fall evenly into codon boundaries, so they create a new amino acid.
Is this something we also want to call? I can't find much info in the literature about the impact of disruptive in-frame indels...
Create benchmark figures for time and memory usage.
How do we want to do time? CPU time, wall time? I guess this gets complicated by threads, although I am using 2 for each tool at the moment.
I got the error below when trying to install in a conda env;
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: /
Found conflicts! Looking for incompatible packages. failed
UnsatisfiableError: The following specifications were found to be incompatible with each other:
Output in format: Requested package -> Available versionsThe following specifications were found to be incompatible with your system:
- feature:/linux-64::__glibc==2.31=0
- feature:|@/linux-64::__glibc==2.31=0
Your installed version is: 2.31
What might be the issue? Thanks
I see it writen as "Dr PRG"
So it is Doctor Prag, Preg, Prog, Prig, or Prug?
Or "Doctor Pee Ahh Jee" ?
Arnold provided me with a MTB sample that has a 32kbp deletion which include katG. This seems to be quite rare in clinical samples? I see https://doi.org/10.1016/j.ijmm.2021.151506 (2021) claims to be the first report of this...
When I run drprg on this sample, we produce an S call for INH, but katG indeed is missing from the VCF - i.e. pandora has noted it's absence. Both mykrobe and tb-profiler call S too. I think mykrobe's panel can be altered to detect gene absence from looking at the wiki, but I'm not clear on how you link this to a drug...
So, this means we should be able to quite easily detect gene deletions. The next question becomes should we do about this. I did a literature search for gene deletions of some other resistance-associated genes but I'm struggling to find any papers that have assessed gene deletion impact on resistance, except for katG/INH. Based on the intro to that paper above maybe people assume the fitness cost is too high for the bug to survive?
A last question is whether this goes in the paper with some case studie(s) of samples with gene deletions?
Hello, we've been trying to use dr-prg to our research but somehow after installing the program with docker and singularity separately, the index subcommand is not there.. Do you have any idea what went wrong?
Do we want to do this?
If we do, what catalogue(s) of markers do we use, and are there any "benchmark" datasets we can test on?
Run drprg
on all samples.
This is nearly sorted, but there is one sample in the initial testing set that fails due to iqbal-lab-org/pandora#294.
I fear I may be missing something.
This is the set of commands that I used:
# downloading some TB reads (works fine)
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR230/005/SRR23086705/SRR23086705_1.fastq.gz
# getting the index (works fine)
drprg index --download mtb
# running predict (fails)
drprg predict -x mtb -i SRR23086705_1.fastq.gz --illumina -o outdir/
This is the error from drprg predict
11.12 [2023-11-14T23:44:10Z INFO ] Outdir doesn't exist...creating...
11.12 [2023-11-14T23:44:10Z INFO ] Discovering variants...
36.82 Error: Failed to update discover PRG
36.82
36.82 Caused by:
36.82 Missing expected output file /test/outdir/discover/denovo_sequences.fa
Is there something I can do to get drprg predict
to run to completion? I fear I have missed something in the documentation somewhere.
The argmatch
function currentlty does what it is supposed to; it returns whether a panel variant matches with a variant in the pandora VCF. But there can be false positives which lead to FP resistance calls. For example
gid 715 cd8b83c0 ACGACG ACGACA,GCGACG . PASS SVTYPE=PH_SNPs;GRAPHTYPE=SIMPLE;VARID=gid_A205E,gid_A205*;PREDICT=R,R GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 2:13,10,25:14,9,27:12,0,24:14,0,28:55,30,102:57,28,110:0.5,0.666667,0:-360.9,-412.153,-214.773:146.127
(I've removed some VARIDs for brevity)
This variant simplifies to
gid 715 cd8b83c0 A G . PASS SVTYPE=PH_SNPs;GRAPHTYPE=SIMPLE;VARID=gid_A205E,gid_A205*;PREDICT=R,R GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 2:13,10,25:14,9,27:12,0,24:14,0,28:55,30,102:57,28,110:0.5,0.666667,0:-360.9,-412.153,-214.773:146.127
The two variants gid_A205E
and gid_A205*
do overlap this VCF record, but they're not technically a match.
This position, 715, maps to codon 205, which in the reference is GCA
. So this VCF position is the last base in the codon.
Changing the last base to G
, as this record calls, would make the codon GCA->GCG
which in amino space is A->A
- i.e., synonymous. However, what argmatch
does (by design) is look in the panel VCF and see that a G
at this position matches with those other two variants, which it does, but it's complicated....
Here are the panel VCF records for those two variants
gid 713 gid_A205E GCA GAA,GAG 0 . PAD=100;GENE=gid;VAR=A205E;RES=PROT;DRUGS=Streptomycin;ST=-
gid 713 gid_A205* GCA TGA,TAA,TAG 0 . PAD=100;GENE=gid;VAR=A205*;RES=PROT;DRUGS=Streptomycin;ST=-
Extracting just position 715 from these variants does indeed provide A->G
as an option. But it ignores the fact that you would also need to change positions 713 and/or 714, which our running example at the top does not.
I need to think about the cleanest way to handle this... Any thoughts are most welcome though.
Implementation of this feature will rely on the successful resolution of iqbal-lab-org/pandora#263 for starters.
In addition, we will probably need to require a BED file of the regions targeted by the user so that we don't trigger things like whole/partial gene deletions in samples where the whole gene was no amplified. This will also alllow us to set the correct genome size for the sample in pandora.
One last question is whether we should include this in the paper or delay it until after the paper (and maybe do a follow-up paper describing the results)? cc @lachlancoin @iqbal-lab
At the moment, for an unknown mutation, we just output the nucleotide sequence and coordinates. It might be more useful to know the amino acid consequence. This will also be useful for being able to ignore synonymous mutations, which account for a lot of the unknown mutations.
Instead of listing all indels in katG in the panel for instance, allow passing "expert" rules into the index build step. This will also help clean up the VCFs and make the inference of variant consequences much easier.
I have been thinking a lot about the best way to implement/handle this. Do we require the rules in the panel, or do we require them in a separate file? My thoughts are a separate file would be best to avoid confusing the panel format with the expert rule format.
The other question is around nomenclature of the expert rules. Phil Fowler created a nomenclature (GARC), but it clashes with existing conventions such as *
meaning stop codon (Phil uses !
). Additionally, HGVS does not support these types of rules. I don't think it needs to be too fancy, so I think we could use something straightforward like
type,gene,start,end,drugs
missense,rpoB,426,452,Rifampicin
nonsense,rpoB,426,452,Rifampicin
frameshift,rpoB,426,452,Rifampicin
nonsense,katG,,,Isoniazid
frameshift,katG,,,Isoniazid
nonsense,ethA,,,Ethionamide
frameshift,ethA,,,Ethionamide
nonsense,gid,,,Streptomycin
frameshift,gid,,,Streptomycin
nonsense,pncA,,,Pyrazinamide
frameshift,pncA,,,Pyrazinamide
where if start and end are empty, the entire gene is inferred.
I've been through all of the drprg PZA FNs that are called by at least one other tool.
There are two overarching problems drprg has
pncA 1 . GTCATGTTCGCGATCGTCGCGGCGTCATGGACCCTATATCTGTGGCTGCCGCGTCGGTAGGCAAACTGCCCGGGCAGTCGCCCGAACGTATGGTGGACGTATGCGGGCGTTGATCATCGTCGACGTGCAGAACGACTTCTGCGAGGGTGGCTCGCTGGCGGTAACCGGTGGCGCCGCGCTGGCCCGCGCCATCAGCGACTACCTGGCCGAAGCGGCGGACTACCATCACGTCGTGGCAACCAAGGACTTCCACATCGACCCGGGTGACCACTTCTCCGGCACACCGGACTATTCCT G,GTCATGTTCGCGATCGTCGCGGCGTCATGGACCCTATATCTGTGGCTGCCGCGTCGGTAGGCAAACTGCCCGGGCAGTCGCCCGAACGTATGGTGGACGTATGCGGGCGTTGATCATCGTCGACGTGCAGAACGACTGACTTCTGCGAGGGTGGCTCGCTGGCGGTAACCGGTGGCGCCGCGCTGGCCCGCGCCATCAGCGACTACCTGGCCGAAGCGGCGGACTACCATCACGTCGTGGCAACCAAGGACTTCCACATCGACCCGGGTGACCACTTCTCCGGCACACCGGACTATTCCT,GTCATGTTCGCGATCGTCGCGGCGTCATGGACCCTATATCTGTGGCTGCCGCGTCGGTAGGCAAACTGCCCGGGCAGTCGCCCGAACGTATGGTGGACGTATGCGGGCGTTGATCATCGTCGACGTGCAGAACGACTTCTGCGAGGGTGGCTCGCGGGCGGTAACCGGTGGCGCCGCGCTGGCCCGCGCCATCAGCGACTACCTGGCCGAAGCGGCGGACTACCATCACGTCGTGGCAACCAAGGACTTCCACATCGACCCGGGTGACCACTTCTCCGGCACACCGGACTATATCTT,GTCATGTTCGCGATCGTCGCGGCGTCATGGACCCTATATCTGTGGCTGCCGCGTCGGTAGGCAAACTGCCCGGGCAGTCGCCCGAACGTATGGTGGACGTATGCGGGCGTTGATCATCGTCGACGTGCAGAACGACTTCTGCGAGGGTGGCTCGCTGGCGGTAACCGGTGGCGCCGCGCCGGCCCGCGCCATCAGCGACTACCTGGCCGAAGCGGCGGACTACCATCACGTCGTGGCAACCAAGGACTTCCACATCGACCCGGGTGACCACTTCTCCGGCACACCGGACTATTCCT,GTCATGTTCGCGATCGTCGCGGCGTCATGGACCCTATATCTGTGGCTGCCGCGTCGGTAGGCAAACTGCCCGGGCAGTCGCCCGAACGTATGGTGGACGTATGCGGGCGTTGATCATCGTCGACGTGCAGAACGACTTCTGCGAGGGTGGCTCGCTGGCGGTAACCGGTGGCGCCGCGCTGGCCCGCGCCATCAGCGACTACCTGGCCGAAGCGGCGGACTACCATCACGTCGTGGCAACCAAGGACTTCCACATCGACCCGGGTGACCACCTCTCCGGCACACCGGACTATTCCT,GTCATGTTCGCGATCGTCGCGGCGTCATGGACCCTATATCTGTGGCTGCCGCGTCGGTAGGCAAACTGCCCGGGCAGTCGCCCGAACGTATGGTGGACGTATGCGGGCGTTGATCATCGTCGACGTGCAGAACGACTTCTGCGAGGGTGGCTCGCTGGCGGTAACCGGTGGCGCCGCGCTGGCCCGCGCCATCAGCGACTACCTGGCCGAAGCGGCGGACTACCATCACGTCGTGGCAACCAAGGACTTCCACATCGACCCGGGTGACCACTTCTCCGGCACACCGGACTATATCTT,GTCATGTTCGCGATCGTCGCGGCGTCATGGACCCTATATCTGTGGCTGCCGCGTCGGTAGGCAAACTGCCCGGGCAGTCGCCCGAACGTATGGTGGACGTATGCGGGCGTTGATCATCGTCGACGTGCAGAACGACTTCTGCGAGGGTGGCTCGCTGGCGGTAACCGGTGGCGCCGCGCTGGCCCGCGCCATCAGCGACTACCTGGCCGAAGCGGCGGACTACCATCACGTCGTGGCAACCAAGGACTTCCACATCGACCCGGGTGACGACTTCTCCGGCACACCGGACTATTCCT,GTCATGTTCGCGATCGTCGCGGCGTCATGGACCCTATATCTGTGGCTGCCGCGTCGGTAGGCAAACTGCCCGGGCAGTCGCCCGAACGTATGGTGGACGTATGCGGGCGTTGATCATCGTCGACGTGCAGAACGACTTCTGCGAGGGTGGCTCGCTGGCGGTAACCGGTGGCGCCGCGCTGGCCCGCGCCATCAGCGACTACCTGGCCGAAGCGGCGGACTACCATCACGTCGTGGCAACCAAGGACTTCCACATCGACCCGGGTGACTACTTCTCCGGCACACCGGACTATTCCT,GTCATGTTCGCGATCGTCGCGGCGTCATGGACCCTATATCTGTGGCTGCCGCGTCGGTAGGCAAACTGCCCGGGCAGTCGCCCGAACGTATGGTGGACGTATGCGGGCGTTGATCATCGTCGACGTGCCGAACGACTTCTGCGAGGGTGGCTCGCTGGCGGTAACCGGTGGCGCCGCGCTGGCCCGCGCCATCAGCGACTACCTGGCCGAAGCGGCGGACTACCATCACGTCGTGGCAACCAAGGACTTCCACATCGACCCGGGTGACCACTTCTCCGGCACACCGGACTATTCCT . . VC=INDEL;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF .:0,0,0,0,0,0,0,0,0,0:0,0,0,0,0,0,0,0,0,0:0,0,0,0,0,0,0,0,0,0:0,0,0,0,0,0,0,0,0,0:0,0,0,0,0,0,0,0,0,0:0,0,0,0,0,0,0,0,0,0:1,1,1,1,1,1,1,1,1,1:-488,-488,-488,-488,-488,-488,-488,-488,-488,-488:0
One way around this could be to notice when we have more than n consecutive VCF entries with a failed/null call and just call resistant? Or, to be more precise, notice when we have a failed position(s) that spans the start codon and then call resistant if it is one of the genes where gene deletion causes resistance.
_Originally posted by @mbhall88 in mbhall88/drprg-paper#2
Quick start wasn't helpful
as i had to dig into the long
docs to figure out how to installl
maybe just add the conda install command
to the README ?
Do a sweep through pandora
window and kmer sizes and see which result in the best drprg performance for Illumina and Nanopore
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.