Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Position of 11288-11296 deletion #256

Open
hyanwong opened this issue Dec 4, 2024 · 7 comments
Open

Position of 11288-11296 deletion #256

hyanwong opened this issue Dec 4, 2024 · 7 comments

Comments

@hyanwong
Copy link
Collaborator

hyanwong commented Dec 4, 2024

Opening up a new issue for this, following on from @jeromekelleher 's suggestion in #249 (comment), and to avoid bloating that issue thread.

I tried out running the sc2ts matching on a "synthetic" BA.1 haplotype (ERR7602255), which had the deletion shifted to the right by 5bp, to align with the Alpha deletion. To make this match against an alpha strain in the deletion region, you need to treat the deletion as being 5 times most costly than a single mutation: i.e. you treat it as 5 separate SNPS, match using the following alignments.

match ERR7602255 TAGTTTT-----.....AAG
Alpha strains    TAGTTTG-----TTTTAAG

In this case, BA.1 becomes a recombinant with the deleted section coming from an Alpha strain:

Matched against [PathSegment(left=0, right=11288, parent=63688), PathSegment(left=11288, right=14676, parent=93235), PathSegment(left=14676, right=29904, parent=11410)] (likelihood=1.0223206643739531e-93, cost=49)
Node(id=93235, flags=4194304, time=742.8365547602211, population=-1, individual=-1, metadata={'Imputed_Viridian_pangolin': 'B.1.1.7'...)

If you treat a deletion as 4 separate SNPs, we revert to assuming that BA.1 is not a recombinant, but a direct descendant and the deletion(s) have happened independently.

Matched against [PathSegment(left=0, right=29904, parent=11410)] (likelihood=4.484155085839427e-92, cost=48)
Node(id=11410, flags=2097152, time=1022.3997839692491, population=-1, individual=-1, metadata={'Imputed_Viridian_pangolin': 'B.1.1',...)

I think it's more likely that the deletions in BA.1 and BA.2 have the same origin, so I'll repeat the test on those.

hyanwong added a commit to hyanwong/sc2ts-paper that referenced this issue Dec 4, 2024
@jeromekelleher
Copy link
Owner

It's also a 3 way recomb that's just snipping out the section of genome with the deletion, isn't it? Seems pretty unlikely 👍

@hyanwong
Copy link
Collaborator Author

hyanwong commented Dec 4, 2024

So, repeating the same test, I added deletion-mutations at the BA.1 origin node in the pre-BA.2-treeseq, using the same positions as in BA.2 (which are the same as Alpha). I then tried matching sample SRR17712694, which is the nearest BA.2 sample to the BA.2 origination node in the current ARG.

As I expected, this is more equivocal than the Alpha / BA.1 case: we only require weighting the deletion as equivalent to 3 or more SNPs to find that there is a single origin.

In all settings, the BA.2 sample is treated as a recombinant of B.1 and BA.1.1.

  • When treating the deletion as equivalent to 4 SNPs, both the forward and backward HMM runs take the deletion from the BA.1.1 parent (i.e. they find that the deletion in BA.1 and BA.2 has a common origin).
  • When treating the deletion as equivalent to 3 SNPs, the forward HMM run creates 3 independent deletions, but the backward run takes the deletion from the BA.1.1 parent (see below)
  • When treating the deletion as equivalent to 2 SNPs, both the forward and backward HMM create 2 independent deletions.
Forward matched against [PathSegment(left=0, right=22674, parent=67513), PathSegment(left=22674, right=29904, parent=338914)] (likelihood=6.502475810729579e-73, cost=38)
Reverse matched against [PathSegment(left=0, right=9345, parent=67513), PathSegment(left=9345, right=29904, parent=338914)] (likelihood=6.502475810729579e-73, cost=38)
HMM direction: forward
 Mutations to a deletion: T11288-, C11289-, T11290-
HMM direction: reverse
 Mutations to a deletion: None

The notebook is in #257

hyanwong added a commit to hyanwong/sc2ts-paper that referenced this issue Dec 4, 2024
@hyanwong
Copy link
Collaborator Author

hyanwong commented Dec 4, 2024

That's fantastic, seems like really strong evidence for the deletion recurring in this case.

Hmm, that might be true for Omicron vs Alpha, but I'm not sure about BA.1 and BA.2. The question there is whether a 9bp deletion should be weighted as an equivalent to 3 SNPs. It seems plausible to me that a specific 9bp deletion is rarer than getting 3 SNPs, I think? I don;t know what @IsobelGuthrie thinks?

@jeromekelleher
Copy link
Owner

Amazing. The key thing is that the recombinant origins are stable. We can discuss the uncertainty around the deletions, but it's totally fine to say a full investigation requires future work.

@jeromekelleher
Copy link
Owner

Hmm, that might be true for Omicron vs Alpha, but I'm not sure about BA.1 and BA.2. The question there is whether a 9bp deletion should be weighted as an equivalent to 3 SNPs. It seems plausible to me that a specific 9bp deletion is rarer than getting 3 SNPs, I think? I don;t know what @IsobelGuthrie thinks?

Our comments are crossing each other here, I was talking about the first example. I just commented above on the second.

Fundamentally we don't really care about the deletions for this paper once they don't muck up the arg toplogy.

@hyanwong
Copy link
Collaborator Author

hyanwong commented Dec 5, 2024

Our comments are crossing each other here

Ah, sorry for the miscommunication

Fundamentally we don't really care about the deletions for this paper once they don't muck up the arg toplogy.

Here's a more detailed breakdown. It appears as if we do get a (slightly) different topology from forward & backward passes, even without looking at deletions (B.1+BA.1.1, vs B.1.1+BA.1.1)

Deletion as missing:
Matching forward: [left=0, right=22674, parent=node 67513 (B.1)], [left=22674, right=29904, parent=node 338914 (BA.1.1)] Del muts: None (likelihood=3.329267615093544e-67, cost=35)
Matching reverse: [left=0, right=22674, parent=node 9348 (B.1.1)], [left=22674, right=29904, parent=node 332479 (BA.1.1)] Del muts: None (likelihood=3.329267615093544e-67, cost=35)

Deletion modelled as 1 SNP:
Matching forward: [left=0, right=22674, parent=node 67513 (B.1)], [left=22674, right=29904, parent=node 338914 (BA.1.1)] Del muts: T11288- (likelihood=4.1615845188669307e-69, cost=36)
Matching reverse: [left=0, right=22674, parent=node 9348 (B.1.1)], [left=22674, right=29904, parent=node 332479 (BA.1.1)] Del muts: T11288- (likelihood=4.1615845188669307e-69, cost=36)

Deletion modelled as 2 SNPs:
Matching forward: [left=0, right=22674, parent=node 2774 (B.1)], [left=22674, right=29904, parent=node 338914 (BA.1.1)] Del muts: T11288-, C11289- (likelihood=5.201980648583663e-71, cost=37)
Matching reverse: [left=0, right=22674, parent=node 9348 (B.1.1)], [left=22674, right=29904, parent=node 332479 (BA.1.1)] Del muts: T11288-, C11289- (likelihood=5.201980648583663e-71, cost=37)

Deletion modelled as 3 SNPs:
Matching forward: [left=0, right=22674, parent=node 67513 (B.1)], [left=22674, right=29904, parent=node 338914 (BA.1.1)] Del muts: T11288-, C11289-, T11290- (likelihood=6.502475810729579e-73, cost=38)
Matching reverse: [left=0, right=9345, parent=node 67513 (B.1)], [left=9345, right=29904, parent=node 338914 (BA.1.1)] Del muts: None (likelihood=6.502475810729579e-73, cost=38)

Deletion modelled as 4 SNPs:
Matching forward: [left=0, right=10029, parent=node 7956 (B.1.1.369)], [left=10029, right=29904, parent=node 338914 (BA.1.1)] Del muts: None (likelihood=6.502475810729579e-73, cost=38)
Matching reverse: [left=0, right=9345, parent=node 67513 (B.1)], [left=9345, right=29904, parent=node 338914 (BA.1.1)] Del muts: None (likelihood=6.502475810729579e-73, cost=38)

@jeromekelleher
Copy link
Owner

That's good to know, thanks. Presumably what's happening here is that there are quite a few close solutions to the HMM and there's a bit of uncertainty about where the left, very deep, parent comes from. This seems quite reasonable, and reassuring in some ways.

This may clear up a bit in the next version where we get rid of the top 100 recurrent mutation sites. Such a long branch is going to have a bunch of unseen mutations on these sites, which is perhaps confusing things.

Overall, this is extremely encouraging to me!

hyanwong added a commit to hyanwong/sc2ts-paper that referenced this issue Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants