Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Origins of Alpha #225

Open
hyanwong opened this issue Oct 10, 2024 · 7 comments
Open

Origins of Alpha #225

hyanwong opened this issue Oct 10, 2024 · 7 comments

Comments

@hyanwong
Copy link
Collaborator

hyanwong commented Oct 10, 2024

This would be worth a line or two in the main text, and some digging for the supplementary. In the "maskreg-psv2-v1-mm_4-f500-clustloc-mrm_2-rw_10-mgs_10-2021-01-28" tree, it appears as if we can resolve the first of the Alpha mutations to matching against some samples sequenced in Cambridge (suspiciously geographically close to Kent, the origin of the alpha outbreak) in particular these samples, which are all B.1.1

{4598: 'ERR4413600', 6226: 'ERR4460507', 7812: 'ERR4460993', 1149: 'ERR4458709', 1150: 'ERR4458827', 8457: 'ERR4461558'}
Screenshot 2024-10-10 at 16 39 21

This is so early in the pandemic that we can easily re-run the matching up to that point. @jeromekelleher points out that node 4725 here is likely to be caused by a reversion push, so it is a result of the tree-building part of the sc2ts algorithm. It has 3 exact matches:

Node(id=4725, flags=4194304, time=318.250002625, population=-1, individual=-1, metadata={'sc2ts': {'date_added': '2020-03-31', 'num_exact_matches': 3, 'sites': [22909]}})

@szhan dug out a paper which also finds a Cambridge sample near the root of the alpha-defining lineages: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9752794/. We exclude their sample (CAMC-946506 / ERR4638413) because it has too much missing data: sc2ts.inference Filter ERR4638413: missing=2393 > 500, but it could be interesting to match it in afterwards.

Note that the taxonium tree below matches these 6 samples (circled) quite a long way from the root of alpha (alpha in cyan here). Shing suspects this is because of treating deletions as ancestral states.

tmp

Code to replicate
# will need to sub in any changed node IDs here
extra_nodes = [u for u in ts.at(21563).children(4725) if ts.node(u).is_sample()] + [u for u in ts.at(21563).samples(1151)]
svg = ti.draw_subtree(
    tracked_samples=[81660, 43835, 51309],  # Pick a few selected early samples from "B.1.1.7" / alpha
    size=(1100, 2000),
    canvas_size=(1100, 2000),
    time_scale="time",
    extra_tracked_samples=extra_nodes,
    style=".plotbox {transform: translateX(20px)}.leaf > .lab {text-anchor: start; transform: rotate(90deg) translate(6px)}",
)
svg
@hyanwong
Copy link
Collaborator Author

That paper also mentions the later MILK-B154B6, GISAID ID: EPI_ISL_2735517 (which I think corresponds to ERR4869224), concluding that it could be a recombinant or result of lab contamination. That also isn't in our dataset (and was presumably filtered out). This would be another interesting sequence to match in, after the fact.

@jeromekelleher
Copy link
Owner

jeromekelleher commented Oct 10, 2024

2024-10-10 15:02:22 DEBUG sc2ts.inference Final HMM pass hmm_cost=12.0 ERR4869224 2020-10-23 B.1.1 path=(0:29904, 4462) mutations(12)=[445T>C, 3264C>T, 5986C>T, 6286C>T, 6808T>C, 12247T>C, 15279C>T, 15775A>T, 23604C>A, 23709C>T, 25455G>T, 28977C>T]

It's from 2020-10-23, and I think it just was a bit too early to gather up any other Alpha samples be within the window for inclusion. Would be interesting to see what would happen from matching it back later and adding it in all right.

@szhan
Copy link
Collaborator

szhan commented Oct 11, 2024

Looking at the input sequence of ERR4869224, which has a decent number of Ns in these regions.
ORF1a:6866-7054 (which contains defining mutation T6954C:I2230T)
ORF1ab:21428-21458
S:22339-22523

I wonder how many imputed bases in at those sites (based on the best match) would be different than the ref. If all those Ns were supposed to be the same as ref., then we could have quite different placement in the trees.

@hyanwong
Copy link
Collaborator Author

The all-samples ARG is showing much the same patterns for Alpha, with a GAT->CTA triplet at the base of B.1.1.7 (cyan samples). We have removed the deletions here, but there are still lots of alpha-specific mutations, most of which are unique.

I'm suspicious about the C3267T recurrent mutation just below the alpha root (node 83297).

Screenshot 2024-10-21 at 15 34 08

@jeromekelleher
Copy link
Owner

I'm suspicious about the C3267T recurrent mutation just below the alpha root (node 83297).

Yes, this is odd, good spot. Let's keep an eye out for this on the next (final??) iteration. I'm suspicious that the HMM current settings will lead to slightly suboptimal HMM matches for really unlikely matches, and it may look something like this.

@hyanwong
Copy link
Collaborator Author

It would be nice to be able to show, graphically, where the two "intermediate" samples (ERR4869224 and ERR4638413) which we excluded fit onto the tree, and how they would shuffle the order of mutations on that long branch. The routine that @jeromekelleher coded at jeromekelleher/sc2ts#360 should do most of it, but maybe not the mutation pushing etc required to change the mutation order.

I assume we'd either match these sequences to the top or bottom node of that long branch, and then change the mutation ordering using reversion pushes, etc.

@jeromekelleher
Copy link
Owner

If you run the match and post the results we can figure out what inserting would do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants