Code Monkey home page Code Monkey logo

Comments (8)

matsen avatar matsen commented on June 3, 2024

My first thought was to throw geographic information in a second data partition. IQTree allows this: http://www.iqtree.org/doc/Complex-Models#partition-models . There aren't currently any completely custom models, but you can supply a custom amino acid transition matrix, and so as a proof of concept we could use 20 states.

If it gives reasonable results, I very much suspect that the IQTree folks would accept a PR to implement more general custom models, given the importance of your application.

@afmagee points out, correctly, that this has an inherent difference to the types of discrete-trait migration models we are using here, versus those that BEAST does. Namely, in BEAST they are using a clock proportional to calendar time on the time tree, whereas here migration would happen in units of mutation-time.

My response to that issue is that you're re-doing all the branch lengths in TreeTime anyway, and this would just be a way to get a branch for cases where we know things should cluster together.

We had a short meeting with @trvrb about this. He thought it was not-crazy, but I'd like to hear your thoughts.
I acknowledge that this doesn't solve your larger problem of polytomy resolution, but we're thinking about that too.

from treetime.

afmagee avatar afmagee commented on June 3, 2024

As @matsen pointed out to me, one possible way to resolve polytomies is to simply run neighbor joining. I came up with a back of the envelope calculation that might allow one to combine neighbor joining distances between two different data sources, such as DNA and geography, and then tried it out on some simulated datasets.

The attached PDF has some more details on the experiments. I think the key takeaways are:

  1. Plain NJ might be able to resolve polytomies.
  2. Combining nucleotide and geography data at the level of unrooted trees could work (that is, the suggestion to add geography at the IQTree level could work despite my previous objections).

One could also consider using bootstrapping to generate a set of possible resolutions to the polytomy and considering those as well.

unrooted_trees_with_nucleotides_and_mugration.pdf

from treetime.

rneher avatar rneher commented on June 3, 2024

The main puzzle I always faced here is how to combine temporal and geographical constraints. With genetic constraints, we assume a strict hierarchy (genetics first, temporal constraints affect topology only in absence of mutations). But geography and time don't have a clear hierarchy as far as I can tell. So somehow I think we need an objective function that listens to time and geo constraints...

from treetime.

afmagee avatar afmagee commented on June 3, 2024

@matsen and I were thinking about geography as data in the same way that the genetic sequences are data. In a full BEAST analysis, with nucleotide data and a geographic character, and not using a structured coalescent model, the geographic character is treated another site with its own substitution model and clock rate. The clock rate helps determine its weight in the likelihood against the nucleotide sites.

If geography incorporated in the initial tree inference, before running TreeTime, then it might reduce the number/size of polytomies that exist. Otherwise, it would have to be incorporated later, probably at the polytomy resolution stage.

If geography evolves enough faster than the nucleotides, then given both a single mutation and a single change in geography, the mutation should be weighted more highly by the likelihood to resolve a relationship. In this case, using geography as data only the polytomy-breaking stage, to favor some relationships over others, might be close to equivalent to having used it to infer the tree as well as during a TreeTime analysis. I think in general the rate of geographic change (the mugration rate) should be higher, so this might be viable as an approximation.

from treetime.

matsen avatar matsen commented on June 3, 2024

We have

  • time
  • number of mutations that happened on branches (these are, practically-speaking, terminal mutations)
  • sampling location

from treetime.

afmagee avatar afmagee commented on June 3, 2024

I think that there may be a relatively fast way to pull out a subtree from a polytomy, and in doing so increase the likelihood. This is a partial solution to the overall problem, but it may be a useful starting point.

I'm imagining that we have a polytomy somewhere in the tree with n children. If we can identify that m of these children are identical to the parent node (which is an O(n) comparison), then what we essentially want to do is minimize the sum of the branch lengths involving these nodes. The likelihood for a single branch with identical mother and daughter sequences is maximized when the branch is length 0, and should decrease monotonically. There are some number of identical sequences with currently non-zero branch lengths between them, so if we shorten those branches we should have a better likelihood.

I think that an agglomerative approach would work to make a new subtree for the m identical sequences. One would need the time t_i of each node. The algorithm proceeds as follows, where node p is the parent of the polytomy, time increases backwards from the present, and l_i is the length of the branch subtending node i:

Initialize a set of active nodes with the m nodes.

while ( size(active) > 1 ) { // this would be if( size(active) > 2 ) if m == n
    choose the pair of nodes i,j with the most similar ages
    define d = t_p - max(t_i,t_j)
    remove i,j as children of p
    add a new node k that is a child of p with branch length l_p = d
    give node k children i,j with branch lengths l_i - d and l_k - d
    remove i,j from the active set
    add k to the active set
}

Each iteration of the algorithm reduces the sum of branch lengths with no mutations (by an amount d), and in doing so should increase the likelihood. When done, it is possible that one would need to account for the possible placements of the n-m non-identical nodes into the subtree. I don't know that there is a guarantee that they would not best be placed somewhere in the new subtree.

Here's a cartoon of the process with m=4 and n=6, where the branches with tick-marks indicate that the child node is not identical to the parent (there are mutations) and the bare branches indicate that the child node is identical (there are no notations).
Image from iOS(1)

from treetime.

matsen avatar matsen commented on June 3, 2024

Thanks, @afmagee ! I might also suggest two things:

  • this could merely supply a ranking of pairs of nodes to try, and how far down that list would be up to TreeTime
  • such a list could also incorporate geography in a very simple-minded way, just biasing us towards parsimonious geographic transitions in the absence of any signal to the contrary.

from treetime.

rneher avatar rneher commented on June 3, 2024

Thanks! I am just trying to think under what conditions we might actually want to group the nodes with >0 mutations with some that have none. But the case of subset of identical nodes is definitely a super useful one to explore.

from treetime.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.