Seeds for BA.1 #264

szhan · 2024-12-17T12:55:55Z

A couple of simple ways to get some candidate seeds for BA.1.

Get the intersection of the samples in the designated sequence list from pango-designation and the samples sequenced by the COG-UK. Then, get the samples with the earliest collection dates (before 2021-12-01).
Get the earliest samples collected in South Africa (before 2021-10-01).

The text was updated successfully, but these errors were encountered:

jeromekelleher · 2024-12-19T09:31:07Z

Using the notebook in #275 we picked the first ten sequences:

    # first 10 BA.1
    "ERR7443564", # 2021-11-22  BA.1    BA.1
    "ERR7552222", # 2021-11-23  BA.1    BA.1
    "ERR7600669", # 2021-11-25  BA.1    BA.1
    "ERR7612412", # 2021-11-25  BA.1    BA.1
    "ERR7601682", # 2021-11-26  BA.1    BA.1
    "ERR7601847", # 2021-11-26  BA.1    BA.1
    "ERR7611335", # 2021-11-27  BA.1    BA.1
    "ERR7615361", # 2021-11-27  BA.1    BA.1
    "ERR7713581", # 2021-11-27  BA.1    BA.1
    "ERR7650807", # 2021-11-27  BA.1    BA.1

It seems to be working reasonably well, and we only actually need the first one (as of the 26th - I'll check on the 27th sequences later). We should follow up with a detailed analysis on the main ARG to make sure that we've got the correct mutations, which we can report on in the paper.

jeromekelleher · 2024-12-20T12:58:51Z

Should we use African truth set for this also (like #265), or do we expect COG-UK to be as good a place as any to find early BA.1s?

The trouble with just merging with the massive set from pango-lineages is that it doesn't really rule out time travellers.

szhan · 2024-12-20T13:06:54Z

I really hope that we can find better seeds in the new African samples. I thought the point of sequencing the samples is supposed to better understand the early evolution of Omicron.

If we do look among the new samples, then we can't intersect with the pango-designation sequences.

jeromekelleher · 2024-12-20T13:45:40Z

I'm restarting from 2021-11-09 to use these truth sequences as the Omicron seeds.

I using SRR17041373 (2021-11-09) as the first (SRR18533633 comes up as earlier, but that doesn't seem to be present in current Viridian alignments)

szhan · 2024-12-20T13:47:34Z

SRR18533633 got filtered out because its number of hets is 9. If I recall right, the threshold used in the Viridian paper is 3.

jeromekelleher · 2024-12-20T22:45:30Z

This is tricky... The first few samples from the African dataset give pretty different results (last number is num mutations)

SRR17041373 [{'left': 0, 'right': 29904, 'parent': 18970}] 66
SRR17041376 [{'left': 0, 'right': 29904, 'parent': 11881}] 69
SRR17089889 [{'left': 0, 'right': 29904, 'parent': 6965}] 82
SRR17089888 [{'left': 0, 'right': 29904, 'parent': 6965}] 81

whereas the first few from the COG-UK set are:

ERR7443564 [{'left': 0, 'right': 29904, 'parent': 11881}] 80
ERR7552222 [{'left': 0, 'right': 29904, 'parent': 11881}] 78
ERR7612412 [{'left': 0, 'right': 29904, 'parent': 11881}] 83
ERR7601682 [{'left': 0, 'right': 29904, 'parent': 11881}] 82

Seeding with SRR17041373 above doesn't get things started, as although the COG-UK copy from it, they do so at a high HMM cost

As we're pretty sure of the COG-UK ones here and the African samples are only a few days earlier, I'm inclined to just use it. The point of the exercise here isn't to pinpoint the omicron outbreaks, but to make sure we capture them reasonably cleanly so that we can accurately identify recombinants.

jeromekelleher · 2024-12-20T22:59:06Z

OK, I'm going to try seeding with SRR17041376 and ERR7443564 as they are both pointing at the same node, and it gives us a chance to find extra BA.1 samples if they exist. As the timescale between BA.1 and BA.2 is very tight, it may be important to get more BA.1 samples in so that we can judge the likelihood of recombination.

jeromekelleher · 2024-12-20T23:07:13Z

Linking to cov-lineages/pango-designation#361 here for useful context in splitting BA.1 and BA.2

(@szhan - this is the sort of thing I meant by #278 - let's link in with pango designation issues in these threads so we can find the background info easily)

jeromekelleher · 2024-12-21T23:08:31Z

From what I can see this has worked really well and BA.1 looks clean. After SRR17041376 goes in like this:

SRR17041376 2021-11-12 BA.1 path=(0:29904, 6965) mutations(39)=[A2832G, T5386G, G8393A, C10029T, C10449A, A11537G, T13195C, A18163G, C21762T, C21846T, G22578A, T22673C, C22674T, T22679C, C22686T, C22995A, A23013C, A23040G, G23048A, A23055G, A23063T, T23075C, C23202A, C23525T, T23599G,C23604A, C24130A, A24424T, T24469A, C24503T, C25000T, C25584T, C26270T, A26530G, C26577G, A27259C, C27807T, A28271T, C28311T]

10 days later we get

ERR7443564 2021-11-22 BA.1 path=(0:29904, 291678) mutations(3)=[G5515T, C23854A, G26709A]

So there's no actual need to seed in with ERR7443564 here.

Looking at some BA.1s later on the path looks very clean, with no reversions.

I think this is about as good as can be done, and we can do some further analysis later to justify the choice of SRR17041376.

szhan mentioned this issue Dec 17, 2024

Add notebook for getting candidate seeds #268

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seeds for BA.1 #264

Seeds for BA.1 #264

szhan commented Dec 17, 2024

jeromekelleher commented Dec 19, 2024

jeromekelleher commented Dec 20, 2024

szhan commented Dec 20, 2024

jeromekelleher commented Dec 20, 2024

szhan commented Dec 20, 2024

jeromekelleher commented Dec 20, 2024

jeromekelleher commented Dec 20, 2024

jeromekelleher commented Dec 20, 2024

jeromekelleher commented Dec 21, 2024

Seeds for BA.1 #264

Seeds for BA.1 #264

Comments

szhan commented Dec 17, 2024

jeromekelleher commented Dec 19, 2024

jeromekelleher commented Dec 20, 2024

szhan commented Dec 20, 2024

jeromekelleher commented Dec 20, 2024

szhan commented Dec 20, 2024

jeromekelleher commented Dec 20, 2024

jeromekelleher commented Dec 20, 2024

jeromekelleher commented Dec 20, 2024

jeromekelleher commented Dec 21, 2024