-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Origins of delta #226
Comments
Note that SRR11810706 is from Gujarat, and we know that Delta originated in India, so this might be believable. |
The newer all-sample ARG, as of this morning ("maskdel-v1-mm_4-f_500-mrm_2-mms_5-mrec_2-rw_7-mgs_10-2021-08-28") has no major recombination problems, but a different problem with the start of Delta. In particular, it infers 2 separate Delta origins, one comprising about 2.3rds of the AY- lineages plus B.1.617, B.1.617.1, and B.1.617.2, the other comprising about 1/3rd of the remaining AYs. There are loads of parallel mutations on the stem leading to each clade. This is clearly wrong, and we should try to figure out why, and check that it doesn't happen in future ARGs. @jeromekelleher dug into the logs and saw two retro groups being added on the same day, which could be the source of this:
Alternatively, it could be some of the tweaked HMM parameters. Here's a plot subset down to about 30 AY.4 samples (cyan), one of which is an outlier and groups under a recomination node on the far right. The others are all in the 1/3rd clade, which is independently picking up the same mutations that lead to the bulk of the delta-origin "B.1.617.2" samples (in orange, below): Here's the main plot of all AY lineages, with Delta (bottom right) showing 3 independent origins (urgh): |
An important thing to note here is that these two retrogroups are clearly a mix of time travelling lineages. Most retro groups consist of just one pango lineage (indicating that we're picking up the origin of that lineage). A mixture of lineages indicates potential problems. A mixture of highly distinct lineages across many months (here) indicates time travel an big trouble. |
Good point about a mix of lineages. In real time we might not have the lineage information for new samples (lineages may not have been devised yet), but in that case we shouldn't have so many time travelling problems either. |
Thanks Yan, super helpful. The seed sample here was ERR5876690. ERR5965862 happens to be one of the Delta samples that arrived soon after the seed sample was added, and it matched to it with 2 mutations [T10651C, G11201A]. G11201A is an immediate reversion, and a reversion push node was therefore created which became the ultimate "Delta node". The single side shoot branch here is just a function of chance I think, and odd stuff like that's going to happen. The recurrent mutations aren't brilliant, but I think we can live with that. It really is quite hard to find a good starting point with all the noise, so unless there's something badly wrong with this proposal I think we should stick with it. |
Here's a useful paper talking about Delta sequences in the UK: https://www.nature.com/articles/s41586-022-05200-3 |
We are trying to find a decent Delta seed, that is not a time traveller. I see an article that mentions the earliest Delta in GISAID is on 5th Oct 2020 (see https://www.sciencemediacentre.org/expert-reaction-to-cases-of-variant-b-1-617-the-indian-variant-being-investigated-in-the-uk/). @szhan: would it be possible to locate the GISAID submission that is discussed in that article? It seems to be from Maharashtra state. This paper uses EPI_ISL_1360382, but that's from 2021, I think. It also says "Most isolates sequenced by India originated from Maharashtra and West Bengal, but B.1.617 has been identified in several other states.", so we could potentially find other believable seeds by looking for submissions in mid-oct from those states? |
This could be useful, from https://pubmed.ncbi.nlm.nih.gov/33961693/. The preprint (https://www.biorxiv.org/content/10.1101/2021.04.23.441101v1.full.pdf) sounds like it could be helpful. |
Another possibility is to look for plots in papers published near the time. E.g. https://weekly.chinacdc.cn/fileCCDCW/journal/article/ccdcw/2021/30/PDF/CCDCW210107.pdf shows a few interesting strains: hCoV-19/India/MH-NCCS-87448/2021|EPI ISL 1415203.2|2021-02-16 (deep branching, Indian) |
BLASTed the ILSGS00308 GISAID sequence against the Viridian batch 1 sequences (min. sequence identity = 99.95%), and got ERR5461550 as an imperfect hit (99.97%) with sampling date of 2021-02-22. Note I'm excluding samples with reported collection date of 2020 NY Eve. |
For MH-NCCS-87448, I'm getting a hit with %identity = 100. It's sample SRR14388093 with collection date of 2021-02-16, which is the same as MH-NCCS-87448. There seem to be other sequences also with %identity = 100 but sampled earlier in 2020. |
Do these dates line up with what's in the paper that's referring to them? We can quite happily change the date of these specific sequences if we have good reason. |
For MH-NCCS-87448, yes. SRR14388093 (in the Viridian dataset) is reportedly sampled on the same date as MH-NCCS-87448. There are other samples in the Viridian dataset with 100% identity to MH-NCCS-87448 that we could use as seeds. @hyanwong MH-NCCS-87448 is B.1.617, not B.1.617.2, according to the China CDC Weekly paper you found. Are suggesting to maybe use a sequence identical or similar to MH-NCCS-87448 to seed (as a precursor) Delta and its related lineages? |
I don't think it can be SRR14388093 @szhan , that's coming up as a B.1.2 from Texas for me |
The other one looks better:
with the match: {
"strain": "ERR5461550",
"num_mismatches": 4,
"direction": "forward",
"match": {
"path": [
{
"left": 0,
"right": 28882,
"parent": 6303
},
{
"left": 28882,
"right": 29904,
"parent": 43646
}
],
"mutations": [
{
"site_id": 207,
"derived_state": "T",
"inherited_state": "G",
"site_position": 210,
"is_reversion": null,
"is_immediate_reversion": null
},
{
"site_id": 3443,
"derived_state": "T",
"inherited_state": "C",
"site_position": 3457,
"is_reversion": null,
"is_immediate_reversion": null
},
{
"site_id": 4950,
"derived_state": "T",
"inherited_state": "C",
"site_position": 4965,
"is_reversion": null,
"is_immediate_reversion": null
},
{
"site_id": 8117,
"derived_state": "T",
"inherited_state": "G",
"site_position": 8137,
"is_reversion": null,
"is_immediate_reversion": null
},
{
"site_id": 11173,
"derived_state": "G",
"inherited_state": "A",
"site_position": 11201,
"is_reversion": null,
"is_immediate_reversion": null
},
{
"site_id": 17485,
"derived_state": "T",
"inherited_state": "G",
"site_position": 17523,
"is_reversion": null,
"is_immediate_reversion": null
},
{
"site_id": 20351,
"derived_state": "G",
"inherited_state": "A",
"site_position": 20396,
"is_reversion": null,
"is_immediate_reversion": null
},
{
"site_id": 20356,
"derived_state": "G",
"inherited_state": "T",
"site_position": 20401,
"is_reversion": null,
"is_immediate_reversion": null
},
{
"site_id": 21842,
"derived_state": "C",
"inherited_state": "T",
"site_position": 21895,
"is_reversion": null,
"is_immediate_reversion": null
},
{
"site_id": 21968,
"derived_state": "A",
"inherited_state": "G",
"site_position": 22022,
"is_reversion": null,
"is_immediate_reversion": null
},
{
"site_id": 22859,
"derived_state": "G",
"inherited_state": "T",
"site_position": 22917,
"is_reversion": null,
"is_immediate_reversion": null
},
{
"site_id": 22954,
"derived_state": "C",
"inherited_state": "G",
"site_position": 23012,
"is_reversion": null,
"is_immediate_reversion": null
},
{
"site_id": 23545,
"derived_state": "G",
"inherited_state": "C",
"site_position": 23604,
"is_reversion": null,
"is_immediate_reversion": null
},
{
"site_id": 24714,
"derived_state": "T",
"inherited_state": "A",
"site_position": 24775,
"is_reversion": null,
"is_immediate_reversion": null
},
{
"site_id": 25407,
"derived_state": "T",
"inherited_state": "C",
"site_position": 25469,
"is_reversion": null,
"is_immediate_reversion": null
},
{
"site_id": 25532,
"derived_state": "T",
"inherited_state": "G",
"site_position": 25595,
"is_reversion": null,
"is_immediate_reversion": null
},
{
"site_id": 26699,
"derived_state": "G",
"inherited_state": "T",
"site_position": 26767,
"is_reversion": null,
"is_immediate_reversion": null
},
{
"site_id": 27563,
"derived_state": "C",
"inherited_state": "T",
"site_position": 27638,
"is_reversion": null,
"is_immediate_reversion": null
},
{
"site_id": 28794,
"derived_state": "T",
"inherited_state": "A",
"site_position": 28881,
"is_reversion": null,
"is_immediate_reversion": null
}
],
"likelihood": 2.171300934439176e-44,
"cost": 23
}
} For reference, here's the match of the early 617 that I put in:
|
Not sure what to make of this. I think we need to get some curated sequences in here in the right order for 617, 617.1, 617.2 and 617.3. It's going to be impossible to untangle otherwise. |
Hmm, that's weird. What countries are the locations for the other MH-NCCS-87448 100% matches, @szhan ? It's classified as B.1.617.3 in that China CDC paper, but with a collection date of 2020-10-02, not 2021-02-16 as you found. Maybe they have a typo in the paper? |
I got the date of 2021-02-16 in the description line above "hCoV-19/India/MH-NCCS-87448/2021|EPI ISL 1415203.2|2021-02-16". Just checked GISAID. It does say the collection date of MH-NCCS-87448 is 2021-02-16. Hmm. |
@hyanwong Are you referring to hCoV-19/India/ILSGS00308/2020|EPIISL1372093|2020-12-01, which was reportedly sampled on 2020-12-01? I'm not seeing another sample in the phylogeny in the CCDC Weekly report which was sampled on October 02, 2020. |
Tagging this ECDC report that summarises some useful epidemiological background about B.1.617.1, B.1.617.2, and B.1.617.3. |
Let's use the sequences designated to B.1.617.1 and Delta B.1.617.2 that are listed in These designated sequences were used to propose a new lineage in the first place, so they are the lineage-defining sequences. Assignment tools just use information in these designated sequences to assign variants to new sequences. For B.1.617.1 and Delta B.1.617.2, there are sequences from COG-UK, so they should be in the SRA/ENA and therefore also in the Viridian dataset. We can just use them as seeds. Sadly, there are no COG-UK sequences for B.1.617.3, but we can live without it as that lineage is not as important. |
Sounds great. Can you list them out here when you figure out the mapping to Viridian IDs please? |
There are 2,833 sequences listed in Note that these designated sequences are of good quality, and were sampled when Delta was just getting recognised. Also, they are not identical. It's not clear how to see one sequence as a seed. Building a consensus out of them wouldn't be good, I think, because there would likely lead to reversions as Delta samples get attached to it. We could also randomly choose a sequence, but that means there would be no good justification for choosing the sequence. What else could we do? |
Can you link to the pango designation issues for delta lineages please? Do they mention any specific sequences there? As the sequences are from UK, we can probably trust the dates and therefore pick a few early ones. |
It's these two issues about designated sequences for B.1.617.2 I've come across. In issue 113, a post includes a tree showing some samples collected in England. Maybe we could pick some of those as seeds? |
This is the issue about B.1.617.1. There are 93 sequences designated to B.1.617.1 in |
Can we see what happens if we consider all these sequences as a single "retro group", do the NJ thing, then add that in? I'm also keen to use not just UK sequences but also other (esp Indian/Maharashtra) ones, but only if they match a sequence that was described at the time in GISAID. Perhaps we can simply make a list of potential candidates which we can match to a Viridian sample? |
It might be important for resolving the ancestral states of B.1.617 in general though. If there are one or two B.1.617.3 samples in Viridian which have reasonable dates / QC and we can add to a whitelist, I think we should. |
Can we please just get some set of candidate sequences together so we can start trying stuff out? Looking at the 617.1 issue there's a handful of samples that should be mappable back. Can we start getting these and others lined up in a verified way? It's much more important that we have something trustworthy in the February-March date range to get things started than it is to get the exact sequences that started things off. We can specify hundreds or even thousands of sequences on the include list (whitelist is a deprecated term @hyanwong ) if we want. We should stop thinking about these as specific seed sequences, and just as a set of things that we are confident of the dates. The algorithm can (and should) deal with the rest. The clock is ticking here on getting a full run done before during the holiday, and we must get Delta origins reasonably sorted. |
Okay, I just retrieved the ENA sample accessions of the 93 sequences designated to B.1.617.1 manually from the ENA website.
|
Here are 200+ designated sequences for B.1.617.2 that we can work on for now. I'll get the rest later.
|
Great, are these all in late 2020 then? |
No, all early 2021. I'm not seeing any designated sequences in England sampled in 2020. |
Also, I'm not finding designated sequences for B.1.617.2 from India sampled in 2020. I've found a few for B.1.617.1 from India sampled in 2020, but I have no luck linking them to SRA/ENA/Viridian. |
Hmm, not having luck for those above either. |
Great, thanks. Anything for 617? |
Can you copy CSV of the data to slack also please? Would save me the trouble of parsing the markdown so I can join with viridian. Can you also include sampling date provided, so we can compare with viridian date? |
Yes, I'll provide a CSV file. I've just managed to link 900+ entries. So far, all the samples I've looked at are from COG-UK. Their dates were preferentially chosen over the dates from other sources, so they should be identical to those in the Viridian metadata file. |
Oddly, I'm only seeing one designated sequence for B.1.617. It is India/MH-NCCS-P1162000180788/2021, and I'm not finding an associated ENA/SRA accession for it. |
Okay, here is a much better approach, I think. We can download a file containing the run accession and sample alias (which is contained in the strain name) for all the genomes sequenced by the COG-UK, because they are organised under the project PRJEB37886 in the ENA. This file can be downloaded by going to https://www.ebi.ac.uk/ena/browser/view/PRJEB37886, and then by selecting the columns run accession and sample alias. There should be 2,700,702 entries. We can do some parsing to figure out which designated sequences have an ENA run accession, and based on that we can narrow down a list of candidate sequences to attach by the sampling dates. EDIT: We are using run accessions rather than sample accessions. |
Using the above approach, I'm getting 64 sequences for B.1.617.1 and 10 sequences for B.1.617.2 in the Viridian dataset with a sampling date before April 01, 2021 in the Viridian metadata file (excluding NY Eve 2020). The B.1.617.1 sequences are reportedly sampled 2021-02-22 to 2021-03-31 (inclusive). Note to self: There are 71 sequences for B.1.617.2 with sampling date of NY Eve 2020 in the Viridian metadata file. I manually checked a few against the ENA, and found that the reported sampling date is 2020. |
Just noting again that I'm not finding designated sequences for B.1.617 and B.1.617.3 from COG-UK with sampling dates before April 01, 2021. |
Great! Can you post the code for doing this in a pr so this is reproducible please? |
Can we use the ena rest API? https://ena-docs.readthedocs.io/en/latest/retrieval/programmatic-access.html |
Turns out that we can just download all the ENA run accessions and sample aliases in one go easily by visiting the project page. No need for API access after all. |
This is what I'm doing to subset the sequences. import pandas as pd
# Pango designated sequences
# lineage, sample name
pango = pd.read_csv("lineages.csv.gz", sep=",")
pango["sample_name"] = [s.split("/")[1] for s in pango["taxon"]]
# ENA
# run accession, sample name
ena = pd.read_csv("filereport_read_run_PRJEB37886_tsv.txt.gz", sep="\t")
ena["sample_name"] = [s.split("/")[1] for s in ena["sample_alias"]]
# Viridian dataset
# strain (run accession)
viridian = pd.read_csv("../sc2ts/data/run_metadata.v04.renamed.dedup.trimmed.tsv", sep="\t")
viridian["parsed_datetime"] = pd.to_datetime(viridian["date"])
focal_pango = "B.1.617.1"
#focal_pango = "B.1.617.2"
out_file = "candidate_seeds_" + focal_pango + "_pre-2021-04-01.txt"
designated_samples = pango[pango["lineage"] == focal_pango]["sample_name"]
coguk_runs = ena[ena["sample_name"].isin(designated_samples)]["run_accession"]
viridian_samples = viridian[viridian["strain"].isin(coguk_runs)]
viridian_samples.head()
viridian_samples[
(viridian_samples["parsed_datetime"] < pd.to_datetime("2021-04-01")) & \
(viridian_samples["parsed_datetime"] != pd.to_datetime("2020-12-31"))
][["strain", "date"]].to_csv(out_file, index=False) |
Here's some useful info on 617 starting out in cov-lineages/pango-designation#38 and cov-lineages/pango-designation#49 |
xrefing #258 as there's lots of discussion there |
In the ARG as of 13th Oct 2024, at the start of the spike there is a single clade that comprises the delta samples ("B.1.617.2" or "AY.*"), which contains all 43028 delta samples plus two B.1-assigned samples (SRR23110826 (node 17569) and SRR11810706 (node 12805), both of which seem to be good candidates for samples close to the origin of delta (orange, below).
However, the origins of delta are a little messy, with reversions (in magenta) crammed into a short branch below node 140425 (expanded below), and several separate lineages. We think this is probably because of adding different retrospective groups from different countries. We should check whether this is improved in the next ARG iteration.
Note that nodes 140425 and its sister 147202 are recombination nodes (larger open circle used)
For huge clades like this, it can be helpful to subsample a few delta nodes, as per below
The text was updated successfully, but these errors were encountered: