-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding deletions, and testing compatibility with Dollo parsimony #250
Comments
This is now possible to do with the latest changes using the Zarr dataset. |
@IsobelGuthrie: we tried mapping the deletion from 11288-11297 to the ARG, and it appears to occur both at the start of the Alpha outbreak and at the start of BA.2. Do you know if this is expected? Maybe we could get you to look at some of the other deletions, now that we can map them on? |
There are a few recurrent deletions, so this isn't particularly unexpected. This paper notes it as emerging almost simultaneously in alpha, beta and gamma: https://pmc.ncbi.nlm.nih.gov/articles/PMC10804945/ "One such event relates to the almost simultaneous emergence of VOCs Alpha, Beta, and Gamma, all of them possessing a characteristic 9-bp deletion at position 11,288–11,296 in the NSP6 gene within the polyprotein ORF1a." I'm happy to look at any deletions! |
Interesting, thanks. I'm not sure if "simultaneous emergence" is code for "we think it happened multiple times independently" or "it could be ancestral or dues to recombination"? @IsobelGuthrie now that we can map deletions on, perhaps I can meet up with you next week and look through the various deletions, whether their placement makes sense, and what tools we might need to make available to help explore this? Perhaps Tues or Wed? |
Its phrased vaguely enough that they avoid having to address the point at all. No one is really sure to be honest. Happy to do Tuesday morning if that works? |
Here is some code to add deletions onto the tree sequence: def delete_site_mutations(tables, site_ids):
# copied from the delete_sites code
keep_sites = np.ones(len(tables.sites), dtype=bool)
site_ids = tskit.util.safe_np_int_cast(site_ids, np.int32)
if np.any(site_ids < 0) or np.any(site_ids >= len(tables.sites)):
raise ValueError("Site ID out of bounds")
keep_sites[site_ids] = 0
mutations = tables.mutations
keep_mutations = keep_sites[mutations.site]
new_ds, new_ds_offset = tskit.tables.keep_with_offset(
keep_mutations, mutations.derived_state, mutations.derived_state_offset
)
new_md, new_md_offset = tskit.tables.keep_with_offset(
keep_mutations, mutations.metadata, mutations.metadata_offset
)
mutation_map = np.cumsum(keep_mutations, dtype=mutations.parent.dtype) - 1
# Map parent == -1 to -1, and check this has worked (assumes tskit.NULL == -1)
mutation_map = np.append(mutation_map, -1).astype(mutations.parent.dtype)
assert mutation_map[tskit.NULL] == tskit.NULL
mutations.set_columns(
site=mutations.site[keep_mutations],
node=mutations.node[keep_mutations],
time=mutations.time[keep_mutations],
derived_state=new_ds,
derived_state_offset=new_ds_offset,
parent=mutation_map[mutations.parent[keep_mutations]],
metadata=new_md,
metadata_offset=new_md_offset,
)
def remap_regions(ts, ds, start, end):
"""
Create a new tree sequence by mapping deletions from the `ds` dataset as
single site changes onto the tree sequence `ts`
"""
sample_id = ts.metadata["sc2ts"]["samples_strain"]
tables = ts.dump_tables()
nodes_time = ts.nodes_time
tree = ts.first()
delete_site_mutations(tables, [ts.site(position=p).id for p in range(start, end)])
for var in ds.variants(sample_id[1:], np.arange(start, end)):
tree.seek(var.position)
site = ts.site(position=var.position)
# # treat non-nucleotide alleles as missing data IUPAC codes
keep = np.isin(var.alleles, np.array(['A', 'C', 'G', 'T', '-']))
g = var.genotypes.copy()
g[keep[g] == False] = tskit.MISSING_DATA
anc_state = site.ancestral_state
# First genotype is the (non-viridian) Wuhan sample, so add that
g = np.append([list(var.alleles).index(anc_state)], g)
_, mutations = tree.map_mutations(g, list(var.alleles), ancestral_state=anc_state)
mut_id_map = {tskit.NULL: tskit.NULL}
for list_id, mut in enumerate(mutations):
mut_id_map[list_id] = tables.mutations.append(
mut.replace(
site=site.id,
parent=mut_id_map[mut.parent],
time=nodes_time[mut.node],
))
tables.sort()
return tables.tree_sequence()
ds = sc2ts.Dataset("../data/viridian_2024-04-29.alpha1.zarr.zip")
del_ts = remap_regions(ts, ds, start=11284, end=11302) |
Looks good! Simpler to just call
Running compute_mutation_parents is less efficient than doing it by hand here, but it's one less thing to go wrong and test. |
I forget that we had implemented a function to keep certain rows (e.g. of the mutations table), which is much easier than deleting the sites. So: def remap_regions(ts, ds, start, end):
"""
Create a new tree sequence by mapping deletions from the `ds` dataset as
single site changes onto the tree sequence `ts`
"""
sample_id = ts.metadata["sc2ts"]["samples_strain"]
tables = ts.dump_tables()
nodes_time = ts.nodes_time
tree = ts.first()
del_sites = [ts.site(position=p).id for p in range(start, end)]
tables.mutations.keep_rows(np.logical_not(np.isin(ts.mutations_site, del_sites)))
for var in ds.variants(sample_id[1:], np.arange(start, end)):
tree.seek(var.position)
site = ts.site(position=var.position)
# # treat non-nucleotide alleles as missing data IUPAC codes
keep = np.isin(var.alleles, np.array(['A', 'C', 'G', 'T', '-']))
g = var.genotypes.copy()
g[keep[g] == False] = tskit.MISSING_DATA
anc_state = site.ancestral_state
# First genotype is the (non-viridian) Wuhan sample, so add that
g = np.append([list(var.alleles).index(anc_state)], g)
_, mutations = tree.map_mutations(g, list(var.alleles), ancestral_state=anc_state)
mut_id_map = {tskit.NULL: tskit.NULL}
for list_id, mut in enumerate(mutations):
mut_id_map[list_id] = tables.mutations.append(
mut.replace(
site=site.id,
parent=mut_id_map[mut.parent],
time=nodes_time[mut.node],
))
tables.sort()
return tables.tree_sequence() |
Using the ARG visualizer, we can highlight the deletions nicely by using: d3arg.mutations.loc[d3arg.mutations.derived == "-", "fill"] = "white" Note that it would be nice to use a different colour (?pink) for mutations whose inherited state is the deletion, but which change to a non-deleted version. This will have to wait for kitchensjn/tskit_arg_visualizer#111. I also suggest that we adopt the same colour scheme as in the tree plots, and use red for mutations that occur multiple times on the tree, and magenta for ones that are reversions of the previous state? |
We have been intending to add deletions back onto the sc2ts ARGs as a final processing step. @jeromekelleher suggests that this can be done on a base-by-base basis, but after that, we might possibly want to condense a set of adjacent deletions (of say, 6bp) into a single mutation. @IsobelGuthrie suggested that multi-base deletions of this sort will show asymmetrical mutation. That is, it is possible to mutate from the inserted state to the deleted state, but mutating back from the deletion to the original insertion is essentially impossible. Adding mutations that conform to this model is called "Dollo parsimony", after Dollo's law. If a switch from a deleted to an inserted state is observed, it is almost certainly due to some sort of recombination (unless the insertion is e.g. a tandem duplication)
This would possible be a good test of whether we have the genealogy correct. We can check whether we thing there are any "impossible mutations" of this sort.
The text was updated successfully, but these errors were encountered: