Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Break tsinfer polytomies using mutational density #363

Open
hyanwong opened this issue Jan 11, 2024 · 4 comments
Open

Break tsinfer polytomies using mutational density #363

hyanwong opened this issue Jan 11, 2024 · 4 comments

Comments

@hyanwong
Copy link
Member

hyanwong commented Jan 11, 2024

If we had a single dated tree with a polytomy and mutations on the branches, we could probably use the relative rate of mutation along the branches to break the polytomy. E.g. in a polytomy like the following, it would probably be reasonable to split the polytomy by grouping together 2 and 3
Screenshot 2024-01-11 at 22 17 16. Or more formally. we could probably calculate the mutational likelihood of the 15 possible bifurcating topologies and pick the most likely.

Another more heuristic approach would be to calculate the average mutational rate between each pair of tips (i.e. the number of mutations between them divided by the branch-length distance), and then do something like UPGMA on the pairwise rates.

It's unclear to me how to extend this to a polytomy in a tree sequence, which could have different numbers of child edges coming and going. But generally, you could imagine the children of a polytomy "pulling" or "pushing" the parent node to differing strengths, measured by the difference between the parent & child posterior distributions and the edge mutational area separating them. It should be possible to work out the bets way to resolve this tension by creating new internal nodes that break the polytomy.

@nspope
Copy link
Contributor

nspope commented Jan 12, 2024

I can think of a simple model-based way to approach this, using EP and the existing variational approximation:

  1. Consider a node $k$. If $k$ has two children, skip it. Otherwise, introduce a hypothetical node $j$ that is an "unobserved" child of $k$.
  2. Assume that the only parent of $j$ is $k$. This makes the problem local.
  3. Use a 2-state mixture model, where "observations" are the edges beneath $k$, and hidden states represent (A) edges where $k$ is the parent; (B) edges where $j$ is the parent.
  4. This could be solved by EP: iterate over edges that are beneath $k$. For each edge, the EP update needs to sum over the two possibilities, while marginalizing out parent/child node ages. This is pretty similar to the current scheme, and requires an additional parameter per edge, call it $\rho_i$ for the $i$th edge, that is the probability that $j$ is the parent.
  5. On convergence, use some threshold $p$ to sort edges. Edges where $\rho_i < p$ are kept with $k$ as the parent; edges where $\rho_i > p$ are assigned a new parent which is $j$. A new edge with 0 mutations is added between $k$ and $j$, with a span that is the union of the span of all $j$'s children.
  6. Repeat steps 1-5 on $j$ rather than $k$ (e.g. with the subset of edges that were assigned to $j$ on the last go-round). And so on, until the polytomy is "resolved" and the remaining edges all have $\rho_i < p$.
  7. Because the problem is local, we don't have to rebuild/resort the tree sequence after each node is visited: we can do a single pass through the tree sequence to find polytomies, then resolve each polytomy separately (which could be parallelized), then modify the tree sequence to introduce the new daughter nodes/edges.

This wouldn't resolve polytomies into binary trees -- instead, it'd resolve polytomies into a set of multiple, smaller polytomies. In the pic you included, we'd probably end up with something like ((2, 3), 0, 1) so that node 4 has three children instead of four. But this could potentially help with reducing inconsistencies that impact dating (and it'd be scalable).

Do you have an algorithmic method for introducing polytomies, @hyanwong? If we have a small test problem (a mini-tree-sequence with a single non-trivial polytomy that involves 10 nodes or so), then I can try to implement the above and see if it'd work

@hyanwong
Copy link
Member Author

Really neat! I'm unclear how this would deal with edges that span a single parent node but where that node has different numbers of children in different regions of the genome. E.g. what happens if we have a node with 3 children at one end of the genome, 2 in the middle, and 3 again at the end?

Re algorithmic methods for generating polytomies: no, but I guess it would be quite easy to write one. I'll have a ply. We do have a lot of inferred ones with polytomies, of course, but you are right that knowing the ground truth would be much better.

@nspope
Copy link
Contributor

nspope commented Jan 12, 2024

I'm unclear how this would deal with edges that span a single parent node but where that node has different numbers of children in different regions of the genome. E.g. what happens if we have a node with 3 children at one end of the genome, 2 in the middle, and 3 again at the end?

I'm probably underestimating how complicated this is, but I was thinking of ignoring position entirely. For example, the node you describe could have sequence of children (A, B, C) -> (B, D) -> (A, C, D). The algorithm I describe would "cluster" A,B,C,D into two groups, and assign the latter group a new node as a parent. Call the new node E, and say B,D get assigned to it. Then the sequence of children for the original node would be (A, E, C) -> (E) -> (A, C, E). And the sequence of children for E (over the same spans) would be B -> (B, D) -> D. I think this always works out, because if we assume that E has a single parent (the original node) then the transmission paths are the same.

@hyanwong
Copy link
Member Author

I've just put a naive method for collapsing bifurcations into polytomies at tskit-dev/tskit#2885

I guess this could be used to test if we can recover extra topology from mutation density.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants