xavierdidelot / transphylo Goto Github PK

View Code? Open in Web Editor NEW

59.0 59.0 22.0 8.59 MB

Reconstruction of transmission trees using genomic data

Home Page: http://xavierdidelot.github.io/TransPhylo/

License: GNU General Public License v2.0

R 85.52% C++ 14.48%

transphylo's People

Contributors

Stargazers

Watchers

Forkers

sdwfrost gtonkinhill thkuo donkeyshot monicaabrudan yuanwxu tenglongli jtmccr1 sauravdhr gouldmatt msenghore dhelekal wook2014 jessalynnsebastian jadvincent justicengom jamesmbaazam gilmahu omidgheysar

transphylo's Issues

Please create conda package

Hi there! I could not find TransPhylo in bioconda. Please create a conda package for this software.

Consensus and medoid trees on mult_ttree

Hi,

I am using TransPhylo to infer transmission trees starting from transmission clusters of_M. tubeculosis_ defined by a given number of SNPs. TransPhylo is an awesome piece of software and so far I had no troubles going through the tutorials using my own data. I have, however, a conceptual question as I am not sure if I am using TransPhylo correctly.

I would like to take into account as much phylogenetic uncertainty as possible, so I am using the new function infer_multittree_share_param to infer transmission trees for a given cluster, starting from many trees that are subsampled from the BEAST posterior as in (Xu et al. 2020). Now for each input phylogeny I obtain as many transmission trees as MCMC iterations, and following the mentioned paper, I could just take the MAP transmission tree for downstream analysis. My question is whether instead of the MAP, it would be correct to merge all transmission trees (given that they come from different posterior phylogenies of the same transmission cluster) and calculate the consensus/medoid transmission tree. Previous to the merge, I "manually" do the burn-in of the transmission trees for each phylogeny. This is what I have done so far and at least it works!

Also, I did not manage to find in the documentation what is the difference between the consensus transmission tree obtained with consTTree and the medoid obtained with medTTree, so I cannot figure out which one most likely represents the true transmission links.

Any help or suggestions on this topic are very much welcomed,

Thank you very much,

Galo A. Goig

summary.resTransPhylo

Hi,

I'm wondering what the visual output/values for summar.resTransPhylo is supposed to be? Does it provide information about the avg number of hosts identified (unsampled and sampled) across the MCMC iterations and so on? I can get that for an individual iteration using extractCTree but would be good to have an overall summary if that makes sense.

I am also having trouble getting it working...
When I do just summary(res) I get the output Result from TransPhylo analysis and nothing else, when I do print.phylo(phy) by itself it gives me some info about the input tree.
I have both transphylo and ape loaded but the function summary.resTransphylo isn't being recognised.

Any help would be appreciated!
Thanks,
Abbi

Trouble importing ML tree into TransPhylo

I have been trying to run TransPhylo on an ML tree, the output from IQTree. I've then transformed the tree file from IQtree into a newick format. For some reason when I try to run:

t <- read.tree('zika_103_ML.nwk') t <- multi2di(t) # remove multifurcations t$edge.length <- pmax(t$edge.length,1/365) # use a day as minimum branch length ptree <- ptreeFromPhylo(t, dateLastSample=2017) plot(ptree)
I Then get the error:

Error in xy.coords(x, y, xlabel, ylabel, log) : 'x' is a list, but does not have components 'x' and 'y'

I am unable to find what this error means in terms on how to fix my tree for the input of Transphylo. Is there anything I can do to fix this?

Errors when using infer_multittree_share_param

Dear author,

I am usig infer_multittree_share_param (TransPhylo v1.4.10) to infer the parameters of transmission chain with my data. I found this function performed well when the dataset is not big. However, error occurred when I used a big dataset(>3000 tips). I have no idea how to solve this problem. Do you have any suggestions? The script and error information was as follow:

library(ape)
library(TransPhylo)
library(coda)
packageVersion("TransPhylo")
[1] ‘1.4.10’
tree1<-multi2di(read.nexus("timetree_21.nexus"))
tree1$edge.length[tree1$edge.length<=0]<-0.1/365
tree1$edge.length<-tree1$edge.length*365
ptree1<-ptreeFromPhylo(tree1,dateLastSample=60)

tree2<-multi2di(read.nexus("timetree_36.nexus"))
tree2$edge.length[tree2$edge.length<=0]<-0.1/365
tree2$edge.length<-tree2$edge.length*365
ptree2<-ptreeFromPhylo(tree2,dateLastSample=60)

ptrees<-list(ptree1,ptree2)
w.shape<-2.892313
w.scale<-2.938824
dateT<-61
record<-infer_multittree_share_param(ptrees,mcmcIterations=1000000,thinning=100,w.shape=w.shape,w.scale=w.scale,updateNeg=T,updateOff.p=T,dateT=dateT,share=c("off.r","off.p","pi"),delta_t=1)

|
| | 0%Error in if (log(runif(1)) < sum(pTTree2) - sum(pTTree)) { :
missing value where TRUE/FALSE needed
Calls: infer_multittree_share_param ... with -> with.default -> eval -> eval -> one_update_share
Execution halted

Run time differences depending on tree time scale

Hi there. I've been using Transphylo on some simulations for a while and noticed something quite curious. In my experience Transphylo has much faster run times if the input tree is on the scale of years rather than months. In my own work, trees on the scale of months can take 10+ hours to run, while the same trees on the scale of years can take less than hour. My current understanding of the method is that time scale shouldn't impact the analysis particularly, and this discrepancy has me worried about my understanding of the methodology. Any insights are appreciated. Code example using the Transphylo tutorial below, where the original tree in years takes around 1.5 seconds to run, a rescaled version of the tree in months takes around 23 seconds to run on my computer.

# simulating outbreak in transphylo, and checking if changing time scale matters
library(TransPhylo)
library(ape)
set.seed(0)


neg=100/365
off.r=5
w.shape=10
w.scale=0.1
pi=0.25


simu <- simulateOutbreak(neg=neg,pi=pi,off.r=off.r,w.shape=w.shape,
                         w.scale=w.scale,dateStartOutbreak=2005,dateT=2008)

ptree<-extractPTree(simu)
plot(ptree)
p<-phyloFromPTree(ptree)
plot(p)
axisPhylo(backward = F)

w.shape=10
w.scale=0.1
dateT=2008
ptm <- proc.time()
res<-inferTTree(ptree,mcmcIterations=1000,w.shape=w.shape,w.scale=w.scale,dateT=dateT)

year_time <- proc.time() - ptm
# year time is 1.5

# repeating the analysis using the same tree scaled to months not years
month_p <- p
month_p$edge.length <- month_p$edge.length*12
plot(month_p)
axisPhylo(backward = F)

month_DTL <- (max(ptree$ptree[,1]) - 2005)*12
month_ptree<-ptreeFromPhylo(month_p,dateLastSample=month_DTL)

month_w.shape=10
month_w.scale=12/10

ptm_month <- proc.time()
month_res<-inferTTree(month_ptree,
                      mcmcIterations=1000,
                      w.shape=month_w.shape,
                      w.scale=month_w.scale,
                      dateT= 36)

month_time <- proc.time() - ptm_month
# month time is 23.13

Thank you for your time,
Isaac Goldstein

Too many iterations error

Hi,
I'm coming across an error and was wondering how to suppress this so I can run long MCMC chains. My chain length is 10000. Thank you in advance!
Error in probTTree(ttree$ttree, off.r2, off.p, pi, w.shape, w.scale, ws.shape, :
too many iterations, giving up!

Adding priors and/or postprocessing

Hello,
Do you have any suggestions on how to pass priors to TransPhylo? Something that can capture other information such as the geographic locations of the cases and penalize creating links between geographically separated cases, etc. Or is this something we can do in postprocessing e.g. take multiple possible transmission trees and choose ones that minimize our other constraints?

Thanks!

Generation time first to Sampling time?

Hello,

I am working with the TransPhylo program to infer transmission in some real TB clusters. When running the inference, it is necessary to set up a gamma distribution for the generation time (the delay from infection to transmission) and for the sampling time (the delay from infection to sampling). I have been wondering whether the sampling time is added to the generation time or not.

From the definition of both parameters, I understand that both start from the same time (the time at which an individual is infected) and, therefore, it might be better to establish a sampling distribution that is shifted X time to the right with respect to the generation time. I assume that one individual is more likely to infect another before they are detected and sampled. However, once an individual is identified, they are either treated or isolated, which greatly reduces their probability of infecting another individual.

It would be very helpful if you could clarify this question.

Thank you very much in advance and sorry for any inconvenience.

Error message when running inferTTree

Xavier,

When running inferTTree on the maximum clade credibility tree you supplied from Genomic Infectious Disease Epi..., I encounter the following error:

> tb_ttree <- inferTTree(tb_ptree, ws.shape = 1.1, ws.scale = 2.5, w.shape = 1.3, w.scale = 10/3)
[1] "errorThere"
[1] "errorThere"
[1] "errorThere"
Error in if (ex[cur] == 1) { : argument is of length zero

Do you know what is causing this? I tried diving into the code for inferTTree and the other functions it calls, but so far I've been unable to figure out where it's breaking. I've included the tree and the R code below to reproduce the error.

TransPhylo inferTTree Error.zip

Thanks,
Shane

Manipulating/edit plots

Hi Xavier,

Is there a way to manipulate or edit plots created by Transphylo?
For example this consTTree (link below), the scale bar is presumably just plotting for the dates I have sampled but the transmission tree clearly goes back beyond the sampled sequences. Secondly my tip labels are overlapping or cut off - this I can adjust by exporting as an svg and changing manually in Inkscape but is time consuming if there's a quicker way to do so in R.
The one I'd most like to be done in R is the time scale bar to make sure that's accurate.
Any help would be appreciated, is it a case of using ggplot (or similar) or can something be done within the Transphylo package?

Best wishes,
Abbi

(https://user-images.githubusercontent.com/108280328/226549891-e1dd79eb-170c-4668-9aec-df855b3deab1.png)

Samples dated earlier than actual sampling date in t-tree

Dear All,

I am experiencing an issue with the transmission tree I generated. It appears to set the dates of certain samples (dated by sampling dates) earlier than the actual sampling date (e.g 2018 instead of 2021). The phylo tree was dated using bactdating and all samples are dated by sampling date. Is it normal behavior?

best regards,

Loïc

bactdating followed by transphylo

Hi,
It's me again, thanks for this lovely program. I've got a few questions, if you please

I'm wondering if it is correct to use the output of BactDating as the input for TransPhylo, by using "write.tree(result$tree,'tree.nwk')", where "result" is the output of BactDating.
In the example, w.shape=10 and w.scale=0.1 were used, however in your paper, of the TB case, w.shape=1.3 and w.scale=0.3 were used instead. Such two sets of parameters varies largely, so I'm wondering how I can determine the generation time parameter for my own dataset, a group of very closely related (pairwise SNP < 20) Klebsiella pneumoniae strains? which example should be used as the start point?
In an outbreak analysis, multiple isolates were sampled from the same patient at different time points, can I use them all in the transphylo analysis? or I should just leave one for each patient on the tree?

sorry for asking these questions and many thanks to you.

Question about inferTTree parameters

Hi,

Quick question on some of the parameters that can be used within the inferTTree command as I am wanting to include startOff.r, start.Off.p, startNeg and startPi.
I would just like to check I've understood what each of those parameters are so I can adjust accordingly, I'm very new to modelling and very much in at the deep end!

Off.r is the R0 number for the pathogen of interest and to clarify, is Off.p the probability of transmission from one individual to another?

Neg represents the average time of coalescence of two lineages - could I use the clock rate from my input dated BEAST tree for this?

Finally, pi = sample fraction which represents the proportion of population units that are selected in the sample - so the estimated number of individuals sampled across the phylogeny being investigated?

Thanks in advance for any help you can offer,
Abbi

underflow on large trees

Hi Xavier,

I've been running into some underflow issues when running TransPhylo on largish trees. I'm attaching an example along with a go I had at fixing it.

It seems that the alpha functions in the c++ code cause some issues with very low probabilities. Adding a few lines to keep things in log space seemed to fix the issue.

I've also noticed that the program spends a very large amount of time optimising the MCMC start point for some trees. Is it appropriate to set the optiStart option to FALSE for these?

Thanks for making a very useful package!

Best

Gerry

test_data.zip

error?

Last date format

I am trying to use your package using a tree build IQTREE (dated) and I cannot figure out where to find the right formatted date of my data. On top of that if I input the date in the Y-M-D format it results in the following error:

Error in tr$edge.length[iedge] <- ptree[ptree[i, 3], 1] - ptree[i, 1] : 
  replacement has length zero

best regards,

Loïc

installation problem

My R version is 3.4.3, and the g++ version is 5.4.0. I tried to reinstall transphylo after updating R, but it wasn't successful. Here are the messages:

> devtools::install_github('xavierdidelot/TransPhylo')
Downloading GitHub repo xavierdidelot/TransPhylo@master
from URL https://api.github.com/repos/xavierdidelot/TransPhylo/zipball/master
Installing TransPhylo
'/usr/lib/R/bin/R' --no-site-file --no-environ --no-save  \
  --no-restore --quiet CMD INSTALL  \
  '/tmp/RtmpxG6MNN/devtools1676630f312f/xavierdidelot-TransPhylo-9c9a54d'  \
  --library='/home/thkuo/R/x86_64-pc-linux-gnu-library/3.4'  \
  --install-tests 

* installing *source* package ‘TransPhylo’ ...
** libs
g++ -std=gnu++11 -I/usr/share/R/include -DNDEBUG  -I"/home/thkuo/R/x86_64-pc-linux-gnu-library/3.4/Rcpp/include" -I"/home/thkuo/R/x86_64-pc-linux-gnu-library/3.4/BH/include"    -fpic  -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c RcppExports.cpp -o RcppExports.o
g++ -std=gnu++11 -I/usr/share/R/include -DNDEBUG  -I"/home/thkuo/R/x86_64-pc-linux-gnu-library/3.4/Rcpp/include" -I"/home/thkuo/R/x86_64-pc-linux-gnu-library/3.4/BH/include"    -fpic  -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c probTTree.cpp -o probTTree.o
probTTree.cpp: In function ‘Rcpp::NumericVector wbar(double, double, double, double, double, double, double, double, double, double)’:
probTTree.cpp:152:20: error: ‘isnan’ was not declared in this scope
     if(isnan(out[i])) throw(Rcpp::exception("error!! NA value in calulating wbar."));
                    ^
probTTree.cpp:152:20: note: suggested alternatives:
In file included from /home/thkuo/R/x86_64-pc-linux-gnu-library/3.4/Rcpp/include/Rcpp/platform/compiler.h:100:0,
                 from /home/thkuo/R/x86_64-pc-linux-gnu-library/3.4/Rcpp/include/Rcpp/r/headers.h:48,
                 from /home/thkuo/R/x86_64-pc-linux-gnu-library/3.4/Rcpp/include/RcppCommon.h:29,
                 from /home/thkuo/R/x86_64-pc-linux-gnu-library/3.4/Rcpp/include/Rcpp.h:27,
                 from probTTree.cpp:17:
/usr/include/c++/5/cmath:641:5: note:   ‘std::isnan’
     isnan(_Tp __x)
     ^
In file included from /home/thkuo/R/x86_64-pc-linux-gnu-library/3.4/BH/include/boost/lexical_cast/detail/inf_nan.hpp:35:0,
                 from /home/thkuo/R/x86_64-pc-linux-gnu-library/3.4/BH/include/boost/lexical_cast/detail/converter_lexical_streams.hpp:63,
                 from /home/thkuo/R/x86_64-pc-linux-gnu-library/3.4/BH/include/boost/lexical_cast/detail/converter_lexical.hpp:54,
                 from /home/thkuo/R/x86_64-pc-linux-gnu-library/3.4/BH/include/boost/lexical_cast/try_lexical_convert.hpp:42,
                 from /home/thkuo/R/x86_64-pc-linux-gnu-library/3.4/BH/include/boost/lexical_cast.hpp:32,
                 from /home/thkuo/R/x86_64-pc-linux-gnu-library/3.4/BH/include/boost/math/tools/convert_from_string.hpp:15,
                 from /home/thkuo/R/x86_64-pc-linux-gnu-library/3.4/BH/include/boost/math/constants/constants.hpp:13,
                 from /home/thkuo/R/x86_64-pc-linux-gnu-library/3.4/BH/include/boost/math/special_functions/gamma.hpp:24,
                 from /home/thkuo/R/x86_64-pc-linux-gnu-library/3.4/BH/include/boost/math/special_functions/beta.hpp:15,
                 from /home/thkuo/R/x86_64-pc-linux-gnu-library/3.4/BH/include/boost/math/distributions/negative_binomial.hpp:48,
                 from probTTree.cpp:19:
/home/thkuo/R/x86_64-pc-linux-gnu-library/3.4/BH/include/boost/math/special_functions/fpclassify.hpp:606:14: note:   ‘boost::math::isnan’
 inline bool (isnan)(T x)
              ^
/usr/lib/R/etc/Makeconf:168: recipe for target 'probTTree.o' failed
make: *** [probTTree.o] Error 1
ERROR: compilation failed for package ‘TransPhylo’
* removing ‘/home/thkuo/R/x86_64-pc-linux-gnu-library/3.4/TransPhylo’
Installation failed: Command failed (1)

The same command (ie. devtools) could work with my another machine (R version 3.4.1; g++ version 4.9.2). Can you please suggest some solutions?

Approach to select appropriate priors for generation and sampling time distributions

Hi Didelot,

I am using the infer_multittree function in the TransPhylo package in R to infer transmission in a group of Mycobacterium tuberculosis clusters from a specific region. However, I am not sure which distribution to use for the generation and sampling time distributions for all the clusters. To address this, I plan to run the inference individually for each cluster, varying the prior distributions for the generation and sampling time distributions. Based on the literature, I have observed that the generation and sampling time distribution ranges from 1 to 2.5. Therefore, I will try different shape values within that range and distributions with different tails (scale parameter). I will examine how well the posterior distributions fit with the prior distributions and use this information to obtain the best values for the generation and sampling time parameters (with higher fitting) for all the clusters.

My question is whether this approach is reasonable for estimating the generation and sampling time parameters. Once I have obtained the best values for all clusters, I plan to perform multi-inference with those values.

Thank you very much in advance for your help, and I apologize for any inconvenience.

Error "negative branch lengths" w/o negative branch lengths

Hi,

This is my input tree in newick format:

(VA13414_h:0.15657,(BK17487_h:0.09975,(VA2520_h:0.97287,VA15976_h:0.38081,(VA21335_h:0.19916,(TP1148_h:0.38815,VA24749_h:0.09725):1.29904):0.35961,(VA22695_h:0.48789,((VA25100_h:0.00000,VA27811_h:0.10130):0.48282,((UR8729_h:0.51740,(TP3995_h:0.28445,VA18593_h:0.26528):0.36710):0.52579,(VA22360_h:0.23612,(VA22374_h:0.00000,(VA6218_h:0.14273,VA8292_h:0.20844):0.35076):0.23612):0.00967):0.14668):0.08447):0.12016):0.31647):0.43191):0.10000;

Then I run

library('TransPhylo')
library(ape)

ptree <- ptreeFromPhylo(read.tree('tree.nwk'), dateLastSample=2013.3)
record <- inferTTree(ptree, dateT=2014)

Error in if (ptree$ptree[ptree$ptree[i, j], 1] - ptree$ptree[i, 1] < 0) stop("The phylogenetic tree contains negative branch lengths!") :
  argument is of length zero

p<-phyloFromPTree(ptree)

Error in tr$edge.length[iedge] <- ptree[ptree[i, 2], 1] - ptree[i, 1] :
  replacement has length zero

However, the tree does not contain negative branch lengths and the format seems to be right compared to the simulated data in the tutorial.

I tried the output from FastTree, RAXML and Treetime, all w/ the same error. What could be the cause?

Thank you!

Choosing a plausible Gamma prior

Hi,

We have run TransPhylo w/ a couple of different Gamma params and looked at convergence, the predicted transmission events, their dates as well as the sampling density. Of course, choosing different priors yields very different results.

I wonder whether you have any advice on how to perform some sort of sensitivity analysis, or how to choose appropriate Gamma params (shape, scale)?

For example, we have epidemiological data for a hospital outbreak on which patient was on which ward w/ dates. This guides us somewhat on parameter choice, because we can easily spot when the model generates predictions that deviate a lot from this data (like patients being infected before being admitted to the hospital).

Also, we tried fitting a Gamma distribution via maximum likelihood to the minimum pairwise distance between tips of the dated tree. But this does not seem to be the right thing either.

Any advice would be greatly appreciated.

Temporal singnal detection

Thank you very much for your excellent tranmission analysis tool !

I have collected 23 isolates from a hospital infection outbreak and want to know the transmission relationship between the different patients.

A phylogenetic tree was build based on the recombination-free whole genome alignment using the iqtree software, and the BactDating was used to detect the temporal structure, however no temporal singal was deteced, the R suquare was very small (<0.1).

My question is, based on this situation can I contiunue to use these 23 isolate genomes to build a tip-dated tree from Beast software and build a transmission tree using TransPhylo.

Thansk very much !