Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Visualisation of simplified backbone phylogenies #222

Open
szhan opened this issue Sep 25, 2024 · 4 comments
Open

Visualisation of simplified backbone phylogenies #222

szhan opened this issue Sep 25, 2024 · 4 comments

Comments

@szhan
Copy link
Collaborator

szhan commented Sep 25, 2024

Probably a simple approach to visually compare the backbones of the Viridian UShER tree and our pandemic-scale ARG is to leverage the Pango lineage roots, excluding the Pango recombinants. The Pango lineage roots are already labelled in the UShER tree, but it is trickier to get the corresponding nodes in our ARGs. Suppose we do have the nodes identified, then we could simplify down to only those nodes (n = 2,131). For a cleaner view, we could exclude the less evolutionarily/epidemiologically relevant Pango lineages.

@szhan
Copy link
Collaborator Author

szhan commented Sep 25, 2024

This can similarly be done for the global, all-time Nextstrain tree. Instead of using the Pango lineage labels, we would use the Nextstrain clade definitions. I think the easiest way may be to use the nucleotide definitions for each Nextstrain clade in order to identify the corresponding node in our ARGs, at least for the first pass. We should see the same clade hierarchy.

Some useful files for this analysis are here:
https://github.com/nextstrain/ncov/blob/master/defaults/clade_hierarchy.tsv
https://github.com/nextstrain/ncov/blob/master/defaults/clades_who.tsv
https://github.com/nextstrain/ncov/blob/master/defaults/clades.tsv

@szhan
Copy link
Collaborator Author

szhan commented Sep 26, 2024

Also just encountered this list of mutations in the founder sequences of the Nextstrain clades assembled by Richard Neher. Tagging the list here in case they come in handy.

https://raw.githubusercontent.com/neherlab/SC2_variant_rates/cd6e016a511098123b6ce9ed874f58a7b789b34c/data/clade_gts.json

@hyanwong
Copy link
Collaborator

hyanwong commented Oct 11, 2024

I did that in a simplistic way, by taking the MRCA of all the samples labelled as PangoNNN as the origin of that lineage, and collapsing those nodes. This is highly sensitive to errors in lineage designation, but easy to do. It does lead to a tree in which many Pango lineages share the same origin node, and it shows many large polytomies retained.

It's interesting that simplifying to these lineages means that we only have 5 trees, so presumably 4 recombination nodes. this is few enough that we could probably look at them by hand to check how believable they are (presumably, not very), and what might be triggering a recombination at those points.

import collections
tree = ts.at(21563)  # start of spike
pango_mrcas = {}
node_labels = collections.defaultdict(list)
for p, samples in ti.pango_lineage_samples.items():
    if not p.startswith("X") and not p == 'unknown':
        if len(samples) == 1:
            pango_mrcas[p] = samples[0]
        else:
            pango_mrcas[p] = tree.mrca(*samples)
        node_labels[pango_mrcas[p]].append(p)
node_labels = {k: "/".join([p for p in v]) for k, v in node_labels.items()}
sts = ts.simplify(list(set(pango_mrcas.values())), filter_nodes=False, keep_unary=True)
print("ARG simplified to pango non-X lineages has", sts.num_trees, "trees")
sts.at(21563).draw_svg(
    size=(2500, 1000),
    node_labels=node_labels,
    style=".leaf > .lab {text-anchor: start; transform: rotate(90deg) translate(6px)} .node text {font-size: 9px}",
    omit_sites=True,
)
Screenshot 2024-10-11 at 14 11 38

@hyanwong
Copy link
Collaborator

Jerome and I thought of another way, or intermediate hackiness: we could use Ana's lineage imputation method, and simply find the earliest node for each imputed pango lineage. We could test by looking both at the proportion of times that the standard lineage-defining mutations occur above this node (NB: if it is a unary node, we should include nodes below it too).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants