Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding deletions, and testing compatibility with Dollo parsimony #250

Open
hyanwong opened this issue Nov 26, 2024 · 9 comments
Open

Adding deletions, and testing compatibility with Dollo parsimony #250

hyanwong opened this issue Nov 26, 2024 · 9 comments

Comments

@hyanwong
Copy link
Collaborator

We have been intending to add deletions back onto the sc2ts ARGs as a final processing step. @jeromekelleher suggests that this can be done on a base-by-base basis, but after that, we might possibly want to condense a set of adjacent deletions (of say, 6bp) into a single mutation. @IsobelGuthrie suggested that multi-base deletions of this sort will show asymmetrical mutation. That is, it is possible to mutate from the inserted state to the deleted state, but mutating back from the deletion to the original insertion is essentially impossible. Adding mutations that conform to this model is called "Dollo parsimony", after Dollo's law. If a switch from a deleted to an inserted state is observed, it is almost certainly due to some sort of recombination (unless the insertion is e.g. a tandem duplication)

This would possible be a good test of whether we have the genealogy correct. We can check whether we thing there are any "impossible mutations" of this sort.

@jeromekelleher
Copy link
Owner

This is now possible to do with the latest changes using the Zarr dataset.

@hyanwong
Copy link
Collaborator Author

@IsobelGuthrie: we tried mapping the deletion from 11288-11297 to the ARG, and it appears to occur both at the start of the Alpha outbreak and at the start of BA.2. Do you know if this is expected? Maybe we could get you to look at some of the other deletions, now that we can map them on?

@IsobelGuthrie
Copy link

There are a few recurrent deletions, so this isn't particularly unexpected. This paper notes it as emerging almost simultaneously in alpha, beta and gamma: https://pmc.ncbi.nlm.nih.gov/articles/PMC10804945/

"One such event relates to the almost simultaneous emergence of VOCs Alpha, Beta, and Gamma, all of them possessing a characteristic 9-bp deletion at position 11,288–11,296 in the NSP6 gene within the polyprotein ORF1a."

I'm happy to look at any deletions!

@hyanwong
Copy link
Collaborator Author

Interesting, thanks. I'm not sure if "simultaneous emergence" is code for "we think it happened multiple times independently" or "it could be ancestral or dues to recombination"?

@IsobelGuthrie now that we can map deletions on, perhaps I can meet up with you next week and look through the various deletions, whether their placement makes sense, and what tools we might need to make available to help explore this? Perhaps Tues or Wed?

@IsobelGuthrie
Copy link

Its phrased vaguely enough that they avoid having to address the point at all. No one is really sure to be honest. Happy to do Tuesday morning if that works?

@hyanwong
Copy link
Collaborator Author

hyanwong commented Dec 3, 2024

Here is some code to add deletions onto the tree sequence:

def delete_site_mutations(tables, site_ids):
    # copied from the delete_sites code
    keep_sites = np.ones(len(tables.sites), dtype=bool)
    site_ids = tskit.util.safe_np_int_cast(site_ids, np.int32)
    if np.any(site_ids < 0) or np.any(site_ids >= len(tables.sites)):
        raise ValueError("Site ID out of bounds")
    keep_sites[site_ids] = 0
    mutations = tables.mutations
    keep_mutations = keep_sites[mutations.site]
    new_ds, new_ds_offset = tskit.tables.keep_with_offset(
        keep_mutations, mutations.derived_state, mutations.derived_state_offset
    )
    new_md, new_md_offset = tskit.tables.keep_with_offset(
        keep_mutations, mutations.metadata, mutations.metadata_offset
    )
    mutation_map = np.cumsum(keep_mutations, dtype=mutations.parent.dtype) - 1
    # Map parent == -1 to -1, and check this has worked (assumes tskit.NULL == -1)
    mutation_map = np.append(mutation_map, -1).astype(mutations.parent.dtype)
    assert mutation_map[tskit.NULL] == tskit.NULL
    mutations.set_columns(
        site=mutations.site[keep_mutations],
        node=mutations.node[keep_mutations],
        time=mutations.time[keep_mutations],
        derived_state=new_ds,
        derived_state_offset=new_ds_offset,
        parent=mutation_map[mutations.parent[keep_mutations]],
        metadata=new_md,
        metadata_offset=new_md_offset,
    )

def remap_regions(ts, ds, start, end):
    """
    Create a new tree sequence by mapping deletions from the `ds` dataset as
    single site changes onto the tree sequence `ts`
    """
    sample_id = ts.metadata["sc2ts"]["samples_strain"]
    tables = ts.dump_tables()
    nodes_time = ts.nodes_time
    tree = ts.first()
    delete_site_mutations(tables, [ts.site(position=p).id for p in range(start, end)])
    for var in ds.variants(sample_id[1:], np.arange(start, end)):
        tree.seek(var.position)
        site = ts.site(position=var.position)
            
        # # treat non-nucleotide alleles as missing data IUPAC codes
        keep = np.isin(var.alleles, np.array(['A', 'C', 'G', 'T', '-']))
        g = var.genotypes.copy()
        g[keep[g] == False] = tskit.MISSING_DATA
        anc_state = site.ancestral_state
        # First genotype is the (non-viridian) Wuhan sample, so add that
        g = np.append([list(var.alleles).index(anc_state)], g)
        _, mutations = tree.map_mutations(g, list(var.alleles), ancestral_state=anc_state)
        mut_id_map = {tskit.NULL: tskit.NULL}
        for list_id, mut in enumerate(mutations):
            mut_id_map[list_id] = tables.mutations.append(
                mut.replace(
                    site=site.id,
                    parent=mut_id_map[mut.parent],
                    time=nodes_time[mut.node],
                ))
    tables.sort()
    return tables.tree_sequence()

ds = sc2ts.Dataset("../data/viridian_2024-04-29.alpha1.zarr.zip")
del_ts = remap_regions(ts, ds, start=11284, end=11302)

@jeromekelleher
Copy link
Owner

Looks good! Simpler to just call tables.delete_sites and add back in the site information afterwards, though? Like

   tables.delete_sites(site_ids)
    tree = ts.first()
    delete_site_mutations(tables, [ts.site(position=p).id for p in range(start, end)])
    for var in ds.variants(sample_id[1:], np.arange(start, end)):
            site = ts.site(position=var.position)
            new_site_id = tables.sites.add_row(**site)
            _, mutations = tree.map_mutations(g, list(var.alleles), ancestral_state=anc_state)
            for mut in  mutations:
                tables.mutations.append(
                    mut.replace(
                         site=site_id,
                         time=nodes_time[mut.node],
                ))
     tables.sort()
     tables.index()
     tables.compute_mutation_parents()

Running compute_mutation_parents is less efficient than doing it by hand here, but it's one less thing to go wrong and test.

@hyanwong
Copy link
Collaborator Author

hyanwong commented Dec 3, 2024

I forget that we had implemented a function to keep certain rows (e.g. of the mutations table), which is much easier than deleting the sites. So:

def remap_regions(ts, ds, start, end):
    """
    Create a new tree sequence by mapping deletions from the `ds` dataset as
    single site changes onto the tree sequence `ts`
    """
    sample_id = ts.metadata["sc2ts"]["samples_strain"]
    tables = ts.dump_tables()
    nodes_time = ts.nodes_time
    tree = ts.first()
    del_sites = [ts.site(position=p).id for p in range(start, end)]
    tables.mutations.keep_rows(np.logical_not(np.isin(ts.mutations_site, del_sites)))
    for var in ds.variants(sample_id[1:], np.arange(start, end)):
        tree.seek(var.position)
        site = ts.site(position=var.position)
            
        # # treat non-nucleotide alleles as missing data IUPAC codes
        keep = np.isin(var.alleles, np.array(['A', 'C', 'G', 'T', '-']))
        g = var.genotypes.copy()
        g[keep[g] == False] = tskit.MISSING_DATA
        anc_state = site.ancestral_state
        # First genotype is the (non-viridian) Wuhan sample, so add that
        g = np.append([list(var.alleles).index(anc_state)], g)
        _, mutations = tree.map_mutations(g, list(var.alleles), ancestral_state=anc_state)
        mut_id_map = {tskit.NULL: tskit.NULL}
        for list_id, mut in enumerate(mutations):
            mut_id_map[list_id] = tables.mutations.append(
                mut.replace(
                    site=site.id,
                    parent=mut_id_map[mut.parent],
                    time=nodes_time[mut.node],
                ))
    tables.sort()
    return tables.tree_sequence()

@hyanwong
Copy link
Collaborator Author

hyanwong commented Dec 3, 2024

Using the ARG visualizer, we can highlight the deletions nicely by using:

d3arg.mutations.loc[d3arg.mutations.derived == "-", "fill"] = "white"
Screenshot 2024-12-03 at 17 10 51

Note that it would be nice to use a different colour (?pink) for mutations whose inherited state is the deletion, but which change to a non-deleted version. This will have to wait for kitchensjn/tskit_arg_visualizer#111.

I also suggest that we adopt the same colour scheme as in the tree plots, and use red for mutations that occur multiple times on the tree, and magenta for ones that are reversions of the previous state?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants