Adding deletions, and testing compatibility with Dollo parsimony #250

hyanwong · 2024-11-26T23:29:00Z

We have been intending to add deletions back onto the sc2ts ARGs as a final processing step. @jeromekelleher suggests that this can be done on a base-by-base basis, but after that, we might possibly want to condense a set of adjacent deletions (of say, 6bp) into a single mutation. @IsobelGuthrie suggested that multi-base deletions of this sort will show asymmetrical mutation. That is, it is possible to mutate from the inserted state to the deleted state, but mutating back from the deletion to the original insertion is essentially impossible. Adding mutations that conform to this model is called "Dollo parsimony", after Dollo's law. If a switch from a deleted to an inserted state is observed, it is almost certainly due to some sort of recombination (unless the insertion is e.g. a tandem duplication)

This would possible be a good test of whether we have the genealogy correct. We can check whether we thing there are any "impossible mutations" of this sort.

jeromekelleher · 2024-11-29T11:31:50Z

This is now possible to do with the latest changes using the Zarr dataset.

hyanwong · 2024-11-29T11:36:56Z

@IsobelGuthrie: we tried mapping the deletion from 11288-11297 to the ARG, and it appears to occur both at the start of the Alpha outbreak and at the start of BA.2. Do you know if this is expected? Maybe we could get you to look at some of the other deletions, now that we can map them on?

IsobelGuthrie · 2024-11-29T11:48:36Z

There are a few recurrent deletions, so this isn't particularly unexpected. This paper notes it as emerging almost simultaneously in alpha, beta and gamma: https://pmc.ncbi.nlm.nih.gov/articles/PMC10804945/

"One such event relates to the almost simultaneous emergence of VOCs Alpha, Beta, and Gamma, all of them possessing a characteristic 9-bp deletion at position 11,288–11,296 in the NSP6 gene within the polyprotein ORF1a."

I'm happy to look at any deletions!

hyanwong · 2024-11-29T11:54:18Z

Interesting, thanks. I'm not sure if "simultaneous emergence" is code for "we think it happened multiple times independently" or "it could be ancestral or dues to recombination"?

@IsobelGuthrie now that we can map deletions on, perhaps I can meet up with you next week and look through the various deletions, whether their placement makes sense, and what tools we might need to make available to help explore this? Perhaps Tues or Wed?

IsobelGuthrie · 2024-11-29T12:35:59Z

Its phrased vaguely enough that they avoid having to address the point at all. No one is really sure to be honest. Happy to do Tuesday morning if that works?

hyanwong · 2024-12-03T16:11:59Z

Here is some code to add deletions onto the tree sequence:

def delete_site_mutations(tables, site_ids):
    # copied from the delete_sites code
    keep_sites = np.ones(len(tables.sites), dtype=bool)
    site_ids = tskit.util.safe_np_int_cast(site_ids, np.int32)
    if np.any(site_ids < 0) or np.any(site_ids >= len(tables.sites)):
        raise ValueError("Site ID out of bounds")
    keep_sites[site_ids] = 0
    mutations = tables.mutations
    keep_mutations = keep_sites[mutations.site]
    new_ds, new_ds_offset = tskit.tables.keep_with_offset(
        keep_mutations, mutations.derived_state, mutations.derived_state_offset
    )
    new_md, new_md_offset = tskit.tables.keep_with_offset(
        keep_mutations, mutations.metadata, mutations.metadata_offset
    )
    mutation_map = np.cumsum(keep_mutations, dtype=mutations.parent.dtype) - 1
    # Map parent == -1 to -1, and check this has worked (assumes tskit.NULL == -1)
    mutation_map = np.append(mutation_map, -1).astype(mutations.parent.dtype)
    assert mutation_map[tskit.NULL] == tskit.NULL
    mutations.set_columns(
        site=mutations.site[keep_mutations],
        node=mutations.node[keep_mutations],
        time=mutations.time[keep_mutations],
        derived_state=new_ds,
        derived_state_offset=new_ds_offset,
        parent=mutation_map[mutations.parent[keep_mutations]],
        metadata=new_md,
        metadata_offset=new_md_offset,
    )

def remap_regions(ts, ds, start, end):
    """
    Create a new tree sequence by mapping deletions from the `ds` dataset as
    single site changes onto the tree sequence `ts`
    """
    sample_id = ts.metadata["sc2ts"]["samples_strain"]
    tables = ts.dump_tables()
    nodes_time = ts.nodes_time
    tree = ts.first()
    delete_site_mutations(tables, [ts.site(position=p).id for p in range(start, end)])
    for var in ds.variants(sample_id[1:], np.arange(start, end)):
        tree.seek(var.position)
        site = ts.site(position=var.position)
            
        # # treat non-nucleotide alleles as missing data IUPAC codes
        keep = np.isin(var.alleles, np.array(['A', 'C', 'G', 'T', '-']))
        g = var.genotypes.copy()
        g[keep[g] == False] = tskit.MISSING_DATA
        anc_state = site.ancestral_state
        # First genotype is the (non-viridian) Wuhan sample, so add that
        g = np.append([list(var.alleles).index(anc_state)], g)
        _, mutations = tree.map_mutations(g, list(var.alleles), ancestral_state=anc_state)
        mut_id_map = {tskit.NULL: tskit.NULL}
        for list_id, mut in enumerate(mutations):
            mut_id_map[list_id] = tables.mutations.append(
                mut.replace(
                    site=site.id,
                    parent=mut_id_map[mut.parent],
                    time=nodes_time[mut.node],
                ))
    tables.sort()
    return tables.tree_sequence()

ds = sc2ts.Dataset("../data/viridian_2024-04-29.alpha1.zarr.zip")
del_ts = remap_regions(ts, ds, start=11284, end=11302)

jeromekelleher · 2024-12-03T16:45:08Z

Looks good! Simpler to just call tables.delete_sites and add back in the site information afterwards, though? Like

   tables.delete_sites(site_ids)
    tree = ts.first()
    delete_site_mutations(tables, [ts.site(position=p).id for p in range(start, end)])
    for var in ds.variants(sample_id[1:], np.arange(start, end)):
            site = ts.site(position=var.position)
            new_site_id = tables.sites.add_row(**site)
            _, mutations = tree.map_mutations(g, list(var.alleles), ancestral_state=anc_state)
            for mut in  mutations:
                tables.mutations.append(
                    mut.replace(
                         site=site_id,
                         time=nodes_time[mut.node],
                ))
     tables.sort()
     tables.index()
     tables.compute_mutation_parents()

Running compute_mutation_parents is less efficient than doing it by hand here, but it's one less thing to go wrong and test.

hyanwong · 2024-12-03T16:53:07Z

I forget that we had implemented a function to keep certain rows (e.g. of the mutations table), which is much easier than deleting the sites. So:

def remap_regions(ts, ds, start, end):
    """
    Create a new tree sequence by mapping deletions from the `ds` dataset as
    single site changes onto the tree sequence `ts`
    """
    sample_id = ts.metadata["sc2ts"]["samples_strain"]
    tables = ts.dump_tables()
    nodes_time = ts.nodes_time
    tree = ts.first()
    del_sites = [ts.site(position=p).id for p in range(start, end)]
    tables.mutations.keep_rows(np.logical_not(np.isin(ts.mutations_site, del_sites)))
    for var in ds.variants(sample_id[1:], np.arange(start, end)):
        tree.seek(var.position)
        site = ts.site(position=var.position)
            
        # # treat non-nucleotide alleles as missing data IUPAC codes
        keep = np.isin(var.alleles, np.array(['A', 'C', 'G', 'T', '-']))
        g = var.genotypes.copy()
        g[keep[g] == False] = tskit.MISSING_DATA
        anc_state = site.ancestral_state
        # First genotype is the (non-viridian) Wuhan sample, so add that
        g = np.append([list(var.alleles).index(anc_state)], g)
        _, mutations = tree.map_mutations(g, list(var.alleles), ancestral_state=anc_state)
        mut_id_map = {tskit.NULL: tskit.NULL}
        for list_id, mut in enumerate(mutations):
            mut_id_map[list_id] = tables.mutations.append(
                mut.replace(
                    site=site.id,
                    parent=mut_id_map[mut.parent],
                    time=nodes_time[mut.node],
                ))
    tables.sort()
    return tables.tree_sequence()

hyanwong · 2024-12-03T17:14:17Z

Using the ARG visualizer, we can highlight the deletions nicely by using:

d3arg.mutations.loc[d3arg.mutations.derived == "-", "fill"] = "white"

Note that it would be nice to use a different colour (?pink) for mutations whose inherited state is the deletion, but which change to a non-deleted version. This will have to wait for kitchensjn/tskit_arg_visualizer#111.

I also suggest that we adopt the same colour scheme as in the tree plots, and use red for mutations that occur multiple times on the tree, and magenta for ones that are reversions of the previous state?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding deletions, and testing compatibility with Dollo parsimony #250

Adding deletions, and testing compatibility with Dollo parsimony #250

hyanwong commented Nov 26, 2024

jeromekelleher commented Nov 29, 2024

hyanwong commented Nov 29, 2024

IsobelGuthrie commented Nov 29, 2024

hyanwong commented Nov 29, 2024

IsobelGuthrie commented Nov 29, 2024

hyanwong commented Dec 3, 2024 •

edited

Loading

jeromekelleher commented Dec 3, 2024

hyanwong commented Dec 3, 2024 •

edited

Loading

hyanwong commented Dec 3, 2024

Adding deletions, and testing compatibility with Dollo parsimony #250

Adding deletions, and testing compatibility with Dollo parsimony #250

Comments

hyanwong commented Nov 26, 2024

jeromekelleher commented Nov 29, 2024

hyanwong commented Nov 29, 2024

IsobelGuthrie commented Nov 29, 2024

hyanwong commented Nov 29, 2024

IsobelGuthrie commented Nov 29, 2024

hyanwong commented Dec 3, 2024 • edited Loading

jeromekelleher commented Dec 3, 2024

hyanwong commented Dec 3, 2024 • edited Loading

hyanwong commented Dec 3, 2024

hyanwong commented Dec 3, 2024 •

edited

Loading

hyanwong commented Dec 3, 2024 •

edited

Loading