Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeds for BA.1 #264

Open
szhan opened this issue Dec 17, 2024 · 9 comments
Open

Seeds for BA.1 #264

szhan opened this issue Dec 17, 2024 · 9 comments

Comments

@szhan
Copy link
Collaborator

szhan commented Dec 17, 2024

A couple of simple ways to get some candidate seeds for BA.1.

  1. Get the intersection of the samples in the designated sequence list from pango-designation and the samples sequenced by the COG-UK. Then, get the samples with the earliest collection dates (before 2021-12-01).
  2. Get the earliest samples collected in South Africa (before 2021-10-01).
@jeromekelleher
Copy link
Owner

Using the notebook in #275 we picked the first ten sequences:

    # first 10 BA.1
    "ERR7443564", # 2021-11-22  BA.1    BA.1
    "ERR7552222", # 2021-11-23  BA.1    BA.1
    "ERR7600669", # 2021-11-25  BA.1    BA.1
    "ERR7612412", # 2021-11-25  BA.1    BA.1
    "ERR7601682", # 2021-11-26  BA.1    BA.1
    "ERR7601847", # 2021-11-26  BA.1    BA.1
    "ERR7611335", # 2021-11-27  BA.1    BA.1
    "ERR7615361", # 2021-11-27  BA.1    BA.1
    "ERR7713581", # 2021-11-27  BA.1    BA.1
    "ERR7650807", # 2021-11-27  BA.1    BA.1

It seems to be working reasonably well, and we only actually need the first one (as of the 26th - I'll check on the 27th sequences later). We should follow up with a detailed analysis on the main ARG to make sure that we've got the correct mutations, which we can report on in the paper.

@jeromekelleher
Copy link
Owner

Should we use African truth set for this also (like #265), or do we expect COG-UK to be as good a place as any to find early BA.1s?

The trouble with just merging with the massive set from pango-lineages is that it doesn't really rule out time travellers.

@szhan
Copy link
Collaborator Author

szhan commented Dec 20, 2024

I really hope that we can find better seeds in the new African samples. I thought the point of sequencing the samples is supposed to better understand the early evolution of Omicron.

If we do look among the new samples, then we can't intersect with the pango-designation sequences.

@jeromekelleher
Copy link
Owner

I'm restarting from 2021-11-09 to use these truth sequences as the Omicron seeds.

I using SRR17041373 (2021-11-09) as the first (SRR18533633 comes up as earlier, but that doesn't seem to be present in current Viridian alignments)

@szhan
Copy link
Collaborator Author

szhan commented Dec 20, 2024

SRR18533633 got filtered out because its number of hets is 9. If I recall right, the threshold used in the Viridian paper is 3.

@jeromekelleher
Copy link
Owner

This is tricky... The first few samples from the African dataset give pretty different results (last number is num mutations)

SRR17041373 [{'left': 0, 'right': 29904, 'parent': 18970}] 66
SRR17041376 [{'left': 0, 'right': 29904, 'parent': 11881}] 69
SRR17089889 [{'left': 0, 'right': 29904, 'parent': 6965}] 82
SRR17089888 [{'left': 0, 'right': 29904, 'parent': 6965}] 81

whereas the first few from the COG-UK set are:

ERR7443564 [{'left': 0, 'right': 29904, 'parent': 11881}] 80
ERR7552222 [{'left': 0, 'right': 29904, 'parent': 11881}] 78
ERR7612412 [{'left': 0, 'right': 29904, 'parent': 11881}] 83
ERR7601682 [{'left': 0, 'right': 29904, 'parent': 11881}] 82

Seeding with SRR17041373 above doesn't get things started, as although the COG-UK copy from it, they do so at a high HMM cost

As we're pretty sure of the COG-UK ones here and the African samples are only a few days earlier, I'm inclined to just use it. The point of the exercise here isn't to pinpoint the omicron outbreaks, but to make sure we capture them reasonably cleanly so that we can accurately identify recombinants.

@jeromekelleher
Copy link
Owner

OK, I'm going to try seeding with SRR17041376 and ERR7443564 as they are both pointing at the same node, and it gives us a chance to find extra BA.1 samples if they exist. As the timescale between BA.1 and BA.2 is very tight, it may be important to get more BA.1 samples in so that we can judge the likelihood of recombination.

@jeromekelleher
Copy link
Owner

Linking to cov-lineages/pango-designation#361 here for useful context in splitting BA.1 and BA.2

(@szhan - this is the sort of thing I meant by #278 - let's link in with pango designation issues in these threads so we can find the background info easily)

@jeromekelleher
Copy link
Owner

From what I can see this has worked really well and BA.1 looks clean. After SRR17041376 goes in like this:

SRR17041376 2021-11-12 BA.1 path=(0:29904, 6965) mutations(39)=[A2832G, T5386G, G8393A, C10029T, C10449A, A11537G, T13195C, A18163G, C21762T, C21846T, G22578A, T22673C, C22674T, T22679C, C22686T, C22995A, A23013C, A23040G, G23048A, A23055G, A23063T, T23075C, C23202A, C23525T, T23599G,C23604A, C24130A, A24424T, T24469A, C24503T, C25000T, C25584T, C26270T, A26530G, C26577G, A27259C, C27807T, A28271T, C28311T]

10 days later we get

ERR7443564 2021-11-22 BA.1 path=(0:29904, 291678) mutations(3)=[G5515T, C23854A, G26709A]

So there's no actual need to seed in with ERR7443564 here.

Looking at some BA.1s later on the path looks very clean, with no reversions.

I think this is about as good as can be done, and we can do some further analysis later to justify the choice of SRR17041376.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants