-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodes with larger polytomies in the Viridian ARG #215
Comments
I can't for the moment think of any easy tweaks that could result in a better breaking-up of polytomies, other than re-running the matching over the time periods of interest with fewer excluded samples, but perhaps there are other things we could do? As there are only a few of these huge 1,000+ child polytomies, It may be worth zooming in on one or two and seeing what could have broken them up? |
We could try widening the retrospective window to 60 days? Would be good to identify specific example here so I can look back through the log and see if we could have done better |
I updated the plot to show node numbers. The most egregious examples are these nodes in the long ARG: 70 Is that enough to identify the examples? I'm assuming you can check that these node IDs all have >3000 unique children in the new long ARG? |
That could make a difference, although I expect that the nodes which could potentially break the large polytomies are quite close in time to the polytomies themselves, so I wouldn't be surprised if we still ended up with substantial polytomies. |
I think we're getting bigger polytomies because the data quality is better. Using this code I've had a look at the distributions of branch lengths and numbers of mutations on those nodes (I'll add it to the TreeInfo at some point): def examine_polytomy(focal):
tree = ts.first()
node_mutations = {}
for v in tree.children(focal):
edge = ts.edge(tree.edge(v))
#assert edge.left == 0 and edge.right == ts.sequence_length
node_mutations[v] = sc2ts.node_mutation_descriptors(ts, v)
mutations_above_focal = len(sc2ts.node_mutation_descriptors(ts, focal))
num_reversions = 0
num_mutations = 0
for mut_map in node_mutations.values():
for mut_id in mut_map.values():
num_mutations += 1
num_reversions += ti.mutations_is_reversion[mut_id]
fig, ax = plt.subplots(1, 2, figsize=(10,4))
fig.suptitle(f"Polytomy details for node {focal}. Has {mutations_above_focal} mutations above,"
f"{num_mutations} below ({num_reversions} reversions)")
ax[0].hist([len(muts) for muts in node_mutations.values()] ,range(7), rwidth=0.7, align="left");
ax[0].set_xlabel('Number of mutations on child branches');
#ax[0].set_title("Number of mutations on branches");
ax[1].hist([tree.branch_length(v) for v in tree.children(focal)], range(250), label="all");
ax[1].hist([tree.branch_length(v) for v in tree.children(focal) if len(node_mutations[v]) == 0], range(250),
label="exact matches");
ax[1].set_xlabel('Length of child branches');
#ax[1].set_title("Branch length of children in polytomy");
ax[1].legend(); Here's what we get for node 70. Notice that for 2500 of the children there are zero mutations. There are 2500 samples that were added into the dataset that are exact matches of that node. That's not an "egregious" polytomy, that just a reflection of the actual thing that happened. Of the remaining children, over 1000 have one mutation, all of which are unique. There's possibly some more mutation sharing in the other children, but not a great deal I would guess because the mutation overlap hueristics will have had a lot of goes at fixing this up. So, in my opinion node 70 is correctly reflecting biological reality by having a large number of children and there's nothing to be done. |
Yes, thanks for looking into that. That does seem very conclusive in that case. |
I'm not sure what to make of these three: In all these cases anyway, it's the number of exact matches and unique one mutation differences that dominates, so I don't think the size of a polytomy tells us very much. Large polytomies just flag large outbreaks. Probably what's more interesting is how we're doing on branches with lots of mutations leading to major clades. That's a likely place that a lack of resolution could have an impact. |
Digging in a bit more to the interesting one above it does look suspiciously like a bunch of correlated time travellers have tripped us up.
So the strains are
What looks most suspicious above these is that we have a very different set of pango lineages associated with the 20 strains that form the retrospective group:
So, I'm inclined to think we have an unlucky combination of time travellers that happen to occur within the time window. I think it's important that we try to filter these out a bit better, so any thoughts on how we can improve the retrospective hueristics would be much appreciated. Currently we add a batch of samples reprospectively if:
The last clause was added to catch the case where a bunch of samples are all misdated together, so you get a large bunch of time travellers on the same day. Should we also add something about the different Pango lineages we observe as a filter? |
What if we just see if the sample sequences cluster by mutations directly instead of relying on the Viridian Pangolin labels (which kind of means we are cheating a bit as we are using signals in the Viridian UShER tree)? Time travellers bundled together in a retrospective group should likely have different mutational signatures, which is what you are saying above @jeromekelleher (I think?). I tried to look at the mutations in the node metadata for the strains, but could only find eight of them by iterating through the nodes. (We are looking at I suppose if we were to examine a "good" retrospective group, then we should expect to see that there is a set of mutations that are shared. If this is the case, then I guess we could think about thresholds. |
That is a good point - we needed 209 mutations to add the group of 20 here with a tree of depth 2. Tree building did not go well in this case. So we could try getting some threshold on tree 'quality' before insertion? |
We could also just try making the required group size larger. We were unlucky here in that we got about 20 time travellers on the same day, within a few days of a bunch of others. I think I'll try setting the required group size to 100 and set the window to like 20 days, see if this smooths out the branch length distribution. I'd much rather lose a bit of resolution at the start of a clade, than have it too early and muddled with much later outbreaks (which is what's happening here, I think) |
From Slack:
The code for producing is very simple, just involving 2 calls to
np.unique
:The text was updated successfully, but these errors were encountered: