mbhall88 / drprg Goto Github PK

View Code? Open in Web Editor NEW

19.0 8.0 1.0 18.49 MB

Drug Resistance Prediction with Reference Graphs

Home Page: https://mbh.sh/drprg/

License: MIT License

Rust 92.71% Dockerfile 0.22% xBase 0.39% Just 0.44% Python 4.14% Shell 2.09%

bioinformatics drug-resistance-prediction genome-graphs reference-graphs

drprg's People

Contributors

Stargazers

Watchers

Forkers

jianshu93

drprg's Issues

Add grammar for specifying variant "expert" rules

Instead of listing all indels in katG in the panel for instance, allow passing "expert" rules into the index build step. This will also help clean up the VCFs and make the inference of variant consequences much easier.

I have been thinking a lot about the best way to implement/handle this. Do we require the rules in the panel, or do we require them in a separate file? My thoughts are a separate file would be best to avoid confusing the panel format with the expert rule format.

The other question is around nomenclature of the expert rules. Phil Fowler created a nomenclature (GARC), but it clashes with existing conventions such as * meaning stop codon (Phil uses !). Additionally, HGVS does not support these types of rules. I don't think it needs to be too fancy, so I think we could use something straightforward like

type,gene,start,end,drugs
missense,rpoB,426,452,Rifampicin
nonsense,rpoB,426,452,Rifampicin
frameshift,rpoB,426,452,Rifampicin
nonsense,katG,,,Isoniazid
frameshift,katG,,,Isoniazid
nonsense,ethA,,,Ethionamide
frameshift,ethA,,,Ethionamide
nonsense,gid,,,Streptomycin
frameshift,gid,,,Streptomycin
nonsense,pncA,,,Pyrazinamide
frameshift,pncA,,,Pyrazinamide

where if start and end are empty, the entire gene is inferred.

False positive argmatch results

The argmatch function currentlty does what it is supposed to; it returns whether a panel variant matches with a variant in the pandora VCF. But there can be false positives which lead to FP resistance calls. For example

gid     715     cd8b83c0        ACGACG  ACGACA,GCGACG   .       PASS    SVTYPE=PH_SNPs;GRAPHTYPE=SIMPLE;VARID=gid_A205E,gid_A205*;PREDICT=R,R        GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF    2:13,10,25:14,9,27:12,0,24:14,0,28:55,30,102:57,28,110:0.5,0.666667,0:-360.9,-412.153,-214.773:146.127

(I've removed some VARIDs for brevity)

This variant simplifies to

gid     715     cd8b83c0        A  G   .       PASS    SVTYPE=PH_SNPs;GRAPHTYPE=SIMPLE;VARID=gid_A205E,gid_A205*;PREDICT=R,R        GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF    2:13,10,25:14,9,27:12,0,24:14,0,28:55,30,102:57,28,110:0.5,0.666667,0:-360.9,-412.153,-214.773:146.127

The two variants gid_A205E and gid_A205* do overlap this VCF record, but they're not technically a match.

This position, 715, maps to codon 205, which in the reference is GCA. So this VCF position is the last base in the codon.

Changing the last base to G, as this record calls, would make the codon GCA->GCG which in amino space is A->A - i.e., synonymous. However, what argmatch does (by design) is look in the panel VCF and see that a G at this position matches with those other two variants, which it does, but it's complicated....

Here are the panel VCF records for those two variants

gid     713     gid_A205E       GCA     GAA,GAG 0       .       PAD=100;GENE=gid;VAR=A205E;RES=PROT;DRUGS=Streptomycin;ST=-
gid     713     gid_A205*       GCA     TGA,TAA,TAG     0       .       PAD=100;GENE=gid;VAR=A205*;RES=PROT;DRUGS=Streptomycin;ST=-

Extracting just position 715 from these variants does indeed provide A->G as an option. But it ignores the fact that you would also need to change positions 713 and/or 714, which our running example at the top does not.

I need to think about the cleanest way to handle this... Any thoughts are most welcome though.

Determine the optimum w and k sizes

Do a sweep through pandora window and kmer sizes and see which result in the best drprg performance for Illumina and Nanopore

Index is not valid, missing files error

Dear Developers, I am trying to use and run your software, however nothing helps to fix the errors, which appear:

#!/bin/bash 

sudo docker run --rm=True -v $PWD:/home/mat29/indata/ -u $(id -u):$(id -g) staphb/drprg drprg index --download  -o/home/mat29/indata/   

sudo docker run --rm=True -v $PWD:/home/mat29/indata/ -u $(id -u):$(id -g) staphb/drprg drprg predict -x/home/mat29/indata/mtb -i/home/mat29/indata/reads.fastq  --illumina -o/home/mat29/indata/ 

System shows:

#[2024-03-27T17:25:47Z INFO ] mtb index version 20230308 already downloaded. Skipping...
#[2024-03-27T17:25:47Z INFO ] Done!
#Error: Index is not valid due to missing file "/home/mat29/indata/mtb/dr.prg.kX.wY.idx"

If I create the new file dr.prg.kX.wY.idx, system throws the other similar error that there is another file missing. And it never ends
I need to run your software on the M. Tuberculosis FASTQ Illumina genomic files.

Please advise what Shell code to run and what to change, how to fix the problem?
Thank you.

Dr. Matvey

Benchmark figures

Create benchmark figures for time and memory usage.

How do we want to do time? CPU time, wall time? I guess this gets complicated by threads, although I am using 2 for each tool at the moment.

Deal with gene absence

Arnold provided me with a MTB sample that has a 32kbp deletion which include katG. This seems to be quite rare in clinical samples? I see https://doi.org/10.1016/j.ijmm.2021.151506 (2021) claims to be the first report of this...

When I run drprg on this sample, we produce an S call for INH, but katG indeed is missing from the VCF - i.e. pandora has noted it's absence. Both mykrobe and tb-profiler call S too. I think mykrobe's panel can be altered to detect gene absence from looking at the wiki, but I'm not clear on how you link this to a drug...

So, this means we should be able to quite easily detect gene deletions. The next question becomes should we do about this. I did a literature search for gene deletions of some other resistance-associated genes but I'm struggling to find any papers that have assessed gene deletion impact on resistance, except for katG/INH. Based on the intro to that paper above maybe people assume the fitness cost is too high for the bug to survive?

A last question is whether this goes in the paper with some case studie(s) of samples with gene deletions?

Parse pandora VCF to detect minor alleles

    I think detecting minors  could easily be done directly in drprg, no need to implement in Pandora. You get coverage info on the S and R alleles right? Just ask if the coverage on any R allele is >0.1 of the total

_Originally posted by @iqbal-lab in mbhall88/drprg-paper#2

From mbhall88/drprg-paper#2 we see that the bulk of the gap in sensitivity between drprg and tbprofiler for INH is to do with minor allele calls.

We can implement a variant of minor allele detectiong in drprg by looking at the coverage reported in the pandora VCF. If a variant is called reference allele, but we detect more than f (fraction) of the reads are from an allele that corresponds to a resistance mutation, then we call the resistance allele.

Installation_error

I got the error below when trying to install in a conda env;

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: /
Found conflicts! Looking for incompatible packages.                                                                                                                           failed

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versionsThe following specifications were found to be incompatible with your system:

  - feature:/linux-64::__glibc==2.31=0
  - feature:|@/linux-64::__glibc==2.31=0

Your installed version is: 2.31

What might be the issue? Thanks

Run TBProfiler

Run TBProfiler on the isolates we select from mbhall88/drprg-paper#1

The first part of this will be making a TBProfiler panel from our panel so that we are using the same.

Collate_drprg_results

Am interested in a script that collates all json file outputs into one text file - that shows the drug and predicted genotypic outcome for all samples. Thx

Update pandora and make_prg

    Could you rerun drprg using the most recent make_prg (0.4.0) and pandora (0.10.0-alpha.0) versions to check if results improve in any way?

Originally posted by @leoisl in #19 (comment)

Add install instructions to README.md

Quick start wasn't helpful
as i had to dig into the long
docs to figure out how to installl

maybe just add the conda install command
to the README ?

Run mykrobe

Run mykrobe on the isolates we select from mbhall88/drprg-paper#1. Use v0.12.1 - i.e., the WHO catalogue.

Paired-end fastq files

My apologies if this is in the documentation somewhere, but I am unable to figure out how to run drprg on two fastq files at the same time.

drprg predict -x mtb -i reads.fq --illumina -o outdir/

^ only has one fastq file.

Targeted sequencing mode

Implementation of this feature will rely on the successful resolution of iqbal-lab-org/pandora#263 for starters.

In addition, we will probably need to require a BED file of the regions targeted by the user so that we don't trigger things like whole/partial gene deletions in samples where the whole gene was no amplified. This will also alllow us to set the correct genome size for the sample in pandora.

One last question is whether we should include this in the paper or delay it until after the paper (and maybe do a follow-up paper describing the results)? cc @lachlancoin @iqbal-lab

How do i pronounce this tool name

I see it writen as "Dr PRG"
So it is Doctor Prag, Preg, Prog, Prig, or Prug?
Or "Doctor Pee Ahh Jee" ?

Lineage calling

Do we want to do this?

If we do, what catalogue(s) of markers do we use, and are there any "benchmark" datasets we can test on?

Missing expected output file /test/outdir/discover/denovo_sequences.fa

I fear I may be missing something.

This is the set of commands that I used:

# downloading some TB reads (works fine)
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR230/005/SRR23086705/SRR23086705_1.fastq.gz

# getting the index (works fine)
drprg index --download mtb

# running predict (fails)
drprg predict -x mtb -i SRR23086705_1.fastq.gz --illumina -o outdir/

This is the error from drprg predict

11.12 [2023-11-14T23:44:10Z INFO ] Outdir doesn't exist...creating...
11.12 [2023-11-14T23:44:10Z INFO ] Discovering variants...
36.82 Error: Failed to update discover PRG
36.82 
36.82 Caused by:
36.82     Missing expected output file /test/outdir/discover/denovo_sequences.fa

Is there something I can do to get drprg predict to run to completion? I fear I have missed something in the documentation somewhere.

Consequence of unknown mutations

At the moment, for an unknown mutation, we just output the nucleotide sequence and coordinates. It might be more useful to know the amino acid consequence. This will also be useful for being able to ignore synonymous mutations, which account for a lot of the unknown mutations.

Disruptive in-frame indels

I have now been through all of the FNs where one of the other callers has a TP.

There are 5 (ethionamide), 1 (isoniazid), 1 (pyrazinamide), and 1 (streptomycin) FN where tbprofiler calls a disruptive in-frame indel. The genes associated with these four drugs have the rule "any frameshift causes mutation", but these aren't frameshifts. The nomenclature used by tbprofiler is from snpEff I believe. By disruptive it means it deletes a multiple of 3 bases, but that these don't fall evenly into codon boundaries, so they create a new amino acid.

Is this something we also want to call? I can't find much info in the literature about the impact of disruptive in-frame indels...

error: unrecognized subcommand 'index'

Hello, we've been trying to use dr-prg to our research but somehow after installing the program with docker and singularity separately, the index subcommand is not there.. Do you have any idea what went wrong?

Add some common resistance-conferring mutations that do no exist in population graph

See mbhall88/drprg-paper#2 and mbhall88/drprg-paper#2 for the motivation.

There are some common mutations which do not exist in the reference graph. As such, when there is a minor allele in a sample for one of these mutations, we do not detect it. This is because discover only finds major alleles, and to detect minor alleles, the allele must exist in the graph in the first place.

Run drprg

Run drprg on all samples.

This is nearly sorted, but there is one sample in the initial testing set that fails due to iqbal-lab-org/pandora#294.

Notice partial gene deletion that spans start codon

    I've been through all of the drprg PZA FNs that are called by at least one other tool.

There are two overarching problems drprg has

A lot of the missed calls are minor allele calls for variants not covered by anything in the PRG. So, because they're minor alleles, they don't get discovered as novel. THe pncA PRG is quite sparse so it might be worth us adding some more PZA-resistant isolates to the reference PRG to try and capture more of the popn. variation. And where the minor alleles are covered by the PRG they seem to fail the GAPS threshold of 0.3
There are some big deletions that knock out the start codon, and some. We (surprisingly) discover the deletion, but get no coverage on it (or the ref)

pncA    1       .       GTCATGTTCGCGATCGTCGCGGCGTCATGGACCCTATATCTGTGGCTGCCGCGTCGGTAGGCAAACTGCCCGGGCAGTCGCCCGAACGTATGGTGGACGTATGCGGGCGTTGATCATCGTCGACGTGCAGAACGACTTCTGCGAGGGTGGCTCGCTGGCGGTAACCGGTGGCGCCGCGCTGGCCCGCGCCATCAGCGACTACCTGGCCGAAGCGGCGGACTACCATCACGTCGTGGCAACCAAGGACTTCCACATCGACCCGGGTGACCACTTCTCCGGCACACCGGACTATTCCT        G,GTCATGTTCGCGATCGTCGCGGCGTCATGGACCCTATATCTGTGGCTGCCGCGTCGGTAGGCAAACTGCCCGGGCAGTCGCCCGAACGTATGGTGGACGTATGCGGGCGTTGATCATCGTCGACGTGCAGAACGACTGACTTCTGCGAGGGTGGCTCGCTGGCGGTAACCGGTGGCGCCGCGCTGGCCCGCGCCATCAGCGACTACCTGGCCGAAGCGGCGGACTACCATCACGTCGTGGCAACCAAGGACTTCCACATCGACCCGGGTGACCACTTCTCCGGCACACCGGACTATTCCT,GTCATGTTCGCGATCGTCGCGGCGTCATGGACCCTATATCTGTGGCTGCCGCGTCGGTAGGCAAACTGCCCGGGCAGTCGCCCGAACGTATGGTGGACGTATGCGGGCGTTGATCATCGTCGACGTGCAGAACGACTTCTGCGAGGGTGGCTCGCGGGCGGTAACCGGTGGCGCCGCGCTGGCCCGCGCCATCAGCGACTACCTGGCCGAAGCGGCGGACTACCATCACGTCGTGGCAACCAAGGACTTCCACATCGACCCGGGTGACCACTTCTCCGGCACACCGGACTATATCTT,GTCATGTTCGCGATCGTCGCGGCGTCATGGACCCTATATCTGTGGCTGCCGCGTCGGTAGGCAAACTGCCCGGGCAGTCGCCCGAACGTATGGTGGACGTATGCGGGCGTTGATCATCGTCGACGTGCAGAACGACTTCTGCGAGGGTGGCTCGCTGGCGGTAACCGGTGGCGCCGCGCCGGCCCGCGCCATCAGCGACTACCTGGCCGAAGCGGCGGACTACCATCACGTCGTGGCAACCAAGGACTTCCACATCGACCCGGGTGACCACTTCTCCGGCACACCGGACTATTCCT,GTCATGTTCGCGATCGTCGCGGCGTCATGGACCCTATATCTGTGGCTGCCGCGTCGGTAGGCAAACTGCCCGGGCAGTCGCCCGAACGTATGGTGGACGTATGCGGGCGTTGATCATCGTCGACGTGCAGAACGACTTCTGCGAGGGTGGCTCGCTGGCGGTAACCGGTGGCGCCGCGCTGGCCCGCGCCATCAGCGACTACCTGGCCGAAGCGGCGGACTACCATCACGTCGTGGCAACCAAGGACTTCCACATCGACCCGGGTGACCACCTCTCCGGCACACCGGACTATTCCT,GTCATGTTCGCGATCGTCGCGGCGTCATGGACCCTATATCTGTGGCTGCCGCGTCGGTAGGCAAACTGCCCGGGCAGTCGCCCGAACGTATGGTGGACGTATGCGGGCGTTGATCATCGTCGACGTGCAGAACGACTTCTGCGAGGGTGGCTCGCTGGCGGTAACCGGTGGCGCCGCGCTGGCCCGCGCCATCAGCGACTACCTGGCCGAAGCGGCGGACTACCATCACGTCGTGGCAACCAAGGACTTCCACATCGACCCGGGTGACCACTTCTCCGGCACACCGGACTATATCTT,GTCATGTTCGCGATCGTCGCGGCGTCATGGACCCTATATCTGTGGCTGCCGCGTCGGTAGGCAAACTGCCCGGGCAGTCGCCCGAACGTATGGTGGACGTATGCGGGCGTTGATCATCGTCGACGTGCAGAACGACTTCTGCGAGGGTGGCTCGCTGGCGGTAACCGGTGGCGCCGCGCTGGCCCGCGCCATCAGCGACTACCTGGCCGAAGCGGCGGACTACCATCACGTCGTGGCAACCAAGGACTTCCACATCGACCCGGGTGACGACTTCTCCGGCACACCGGACTATTCCT,GTCATGTTCGCGATCGTCGCGGCGTCATGGACCCTATATCTGTGGCTGCCGCGTCGGTAGGCAAACTGCCCGGGCAGTCGCCCGAACGTATGGTGGACGTATGCGGGCGTTGATCATCGTCGACGTGCAGAACGACTTCTGCGAGGGTGGCTCGCTGGCGGTAACCGGTGGCGCCGCGCTGGCCCGCGCCATCAGCGACTACCTGGCCGAAGCGGCGGACTACCATCACGTCGTGGCAACCAAGGACTTCCACATCGACCCGGGTGACTACTTCTCCGGCACACCGGACTATTCCT,GTCATGTTCGCGATCGTCGCGGCGTCATGGACCCTATATCTGTGGCTGCCGCGTCGGTAGGCAAACTGCCCGGGCAGTCGCCCGAACGTATGGTGGACGTATGCGGGCGTTGATCATCGTCGACGTGCCGAACGACTTCTGCGAGGGTGGCTCGCTGGCGGTAACCGGTGGCGCCGCGCTGGCCCGCGCCATCAGCGACTACCTGGCCGAAGCGGCGGACTACCATCACGTCGTGGCAACCAAGGACTTCCACATCGACCCGGGTGACCACTTCTCCGGCACACCGGACTATTCCT .       .       VC=INDEL;GRAPHTYPE=SIMPLE       GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      .:0,0,0,0,0,0,0,0,0,0:0,0,0,0,0,0,0,0,0,0:0,0,0,0,0,0,0,0,0,0:0,0,0,0,0,0,0,0,0,0:0,0,0,0,0,0,0,0,0,0:0,0,0,0,0,0,0,0,0,0:1,1,1,1,1,1,1,1,1,1:-488,-488,-488,-488,-488,-488,-488,-488,-488,-488:0

One way around this could be to notice when we have more than n consecutive VCF entries with a failed/null call and just call resistant? Or, to be more precise, notice when we have a failed position(s) that spans the start codon and then call resistant if it is one of the genes where gene deletion causes resistance.

_Originally posted by @mbhall88 in mbhall88/drprg-paper#2