Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeds for BA.2 #265

Open
szhan opened this issue Dec 17, 2024 · 15 comments
Open

Seeds for BA.2 #265

szhan opened this issue Dec 17, 2024 · 15 comments

Comments

@szhan
Copy link
Collaborator

szhan commented Dec 17, 2024

A couple of simple ways to get some candidate seeds for BA.2.

  1. Get the intersection of the samples in the designated sequence list from pango-designation and the samples sequenced by the COG-UK. Then, get the samples with the earliest collection dates (before 2022-01-08).
  2. Get the earliest samples collected in South Africa (before 2021-12-01).
@jeromekelleher
Copy link
Owner

Using COG-UK sequences joined with the pango lineage data doesn't work because the first sample is ERR7965207 from 2022-01-03 , which happens after a retro group of BA.2s gets added:

2021-12-30 Add retro group 6d8d5f9adbd4d85f04633470491c6b3e n=10 {'2021-12-24': 4, '2021-12-25': 1, '2021-12-26': 1, '2021-12-28': 2, '2021-12-29': 2} {'BA.2': 8, 'BA.2.10': 2} 

While it looks OK generally, it's annoying that we've got the BA.2.10 creeping in here. It may not be worth worrying about though and we should just let the algorithm do it's thing with minimal intervention.

@jeromekelleher
Copy link
Owner

Do we have the experiment accessions for the resequenced early omicron sequences discussed in the Viridian paper @szhan? Just looking for "South Africa" as the region is too broad I think. Surely there's some systematic way we can track down the sequences that were actually generated??

@szhan
Copy link
Collaborator Author

szhan commented Dec 20, 2024

Yes, Supplementary Table S7 in this Excel file https://figshare.com/articles/dataset/Supplementary_Tables_S2-9/27195315/2?file=49784541.

@jeromekelleher
Copy link
Owner

Ah, that's much better. OK, I'll just pick the earliest BA.2 from that.

@szhan
Copy link
Collaborator Author

szhan commented Dec 20, 2024

So, let's find a suitable seed among the new African samples collected before 2021-12-30? The dates for these samples should be trustworthy.

@szhan
Copy link
Collaborator Author

szhan commented Dec 20, 2024

I can automate this and update the notebook accordingly?

@jeromekelleher
Copy link
Owner

Yes please. Can you review #275 first so I can merge?

@szhan
Copy link
Collaborator Author

szhan commented Dec 20, 2024

Okay, now for the Omicron sublineages, we are looking among the new African samples (not only those from South Africa).

@jeromekelleher
Copy link
Owner

Searching manually among the BA.2s that show up in the max-daily-samples=1000 version I'm working with the first from the African truth-set is SRR17461930. Would be good to cross-check that this shows up in the notebook.

@szhan
Copy link
Collaborator Author

szhan commented Dec 20, 2024

SRR17461930 isn't in the top 10, but 20th.

@jeromekelleher
Copy link
Owner

I'm restarting from the start of Omicron now using SRR17089886, 2021-11-17 as the BA.2 seed

That came out as the first BA.2 in my quick hacks - do you agree?

@szhan
Copy link
Collaborator Author

szhan commented Dec 20, 2024

Yes, SRR17089886 is coming up as the earliest sample, after filtering out those with more than 3 hets.

@jeromekelleher
Copy link
Owner

A tricky issue here is that we don't want to add BA.2s in too early if we do think it might be a recombinant, because we have to wait for BA.1 to get properly established in the ARG (which we think may be one of the parents). If we put BA.2 sequences in too soon, then it'll be more parsimonious to just copy from one deep parent (as there's tons of mutations anyway).

This is all very difficult - I don't think we can say very much about the possible origins of these large saltational changes with these tools. The combination of the huge numbers of mutations and poor representation of the parents in the reference panel make this a bad match for our model. We can pick out recombinations very well when the parent lineages are well represented in the ARG and when we're balancing the probability of a handful of mutations vs a recombination. Massive departures from the model like Delta and Omicron are just not a good fit.

@jeromekelleher
Copy link
Owner

OK, this has worked out quite straightforwardly I think. Taking the first few COG-UK and African samples and matching against the 2021-12-01 ARG we get the following:

ERR7965207 2022-01-03 [{'left': 0, 'right': 22674, 'parent': 66458}, {'left': 22674, 'right': 29904, 'parent': 301233}] 86
ERR7972740 2022-01-04 [{'left': 0, 'right': 22674, 'parent': 66458}, {'left': 22674, 'right': 29904, 'parent': 301233}] 86
SRR17461953 2021-12-07 [{'left': 0, 'right': 22674, 'parent': 66458}, {'left': 22674, 'right': 29904, 'parent': 301233}] 86
SRR17461765 2021-12-09 [{'left': 0, 'right': 22674, 'parent': 66458}, {'left': 22674, 'right': 29904, 'parent': 301233}] 86
SRR17712996 2021-12-09 [{'left': 0, 'right': 22674, 'parent': 66458}, {'left': 22674, 'right': 29904, 'parent': 301233}] 84
SRR19117196 2021-12-10 [{'left': 0, 'right': 22674, 'parent': 66458}, {'left': 22674, 'right': 29904, 'parent': 301233}] 83
SRR17461792 2021-11-27 [{'left': 0, 'right': 22674, 'parent': 66458}, {'left': 22674, 'right': 29904, 'parent': 301233}] 84

So, the path is absolutely consistent here and I think it's reasonable to take the earliest sequence here for 2021-11-27. The left-parent node here 301233 gets added on 2021-11-23, and is clearly a solid inference and very successful node:
Screenshot from 2024-12-21 23-41-39

@jeromekelleher
Copy link
Owner

jeromekelleher commented Dec 22, 2024

Just for later reference, using this seed (SRR17461792) results (after a few weeks) in a "push" node that descends from the original recombinant with 5 reversions. These are all reversions of mutations that happen on the root of the BA.1 lineage. I don't really know what this means, but it doesn't look to me like something that would result from choosing the wrong seed and is probably more to do with having insufficient sampling (much like Delta in #226). It will be something we need to look at quite closely, as the HMM is quite clear about BA.2 being a recombinant, given the reference panel we have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants