Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading Anndata from only parts of h5ad file: Hack solution #1517

Open
mtvector opened this issue Jun 7, 2024 · 8 comments
Open

Reading Anndata from only parts of h5ad file: Hack solution #1517

mtvector opened this issue Jun 7, 2024 · 8 comments

Comments

@mtvector
Copy link

mtvector commented Jun 7, 2024

I came across a previous issue #436 and couldn't get the dask solution working with my application, so I came up with a somewhat hacky solution to reading only the desired fields from an h5ad into an anndata (not chunking). It works by making a tree of all the fields in the H5, searching the tree for fields matching the ones you want to load, then loading the ancestors and descendants of that field. (see code below). Useful if you want to keep all your data together on disk, but only need to load some fields into memory.

Basically you run:

read_h5ad_backed_selective(model_path / 'p3_adata.h5ad', mode='r', selected_keys=['spliced', 'S_score', 'batch_name', 'var', 'uns', 'X_antipode'])

and get back:

AnnData object with n_obs × n_vars = 493759 × 16738 backed at './1.9.1.8.3_Dev-DMR/p3_adata.h5ad'
    obs: 'S_score', 'batch_name'
    var: '_index', 'h_gene', 'h_gene_size', 'h_left_flank', 'h_right_flank', 'h_syntenic'
    uns: '_scvi_manager_uuid', '_scvi_uuid', 'antipode_cluster_colors'
    obsm: 'X_antipode'
    layers: 'spliced'

Just wanted to share in case it is useful to someone!

# Author: mtvector a.k.a. Matthew Schmitz
import h5py
import anndata
from anndata import AnnData
from pathlib import Path
from typing import Union, Literal, List, Dict
from ete3 import Tree
from scipy.sparse import csr_matrix
import pandas as pd

def h5_tree(val):
    tree = {}
    for key, item in val.items():
        if isinstance(item, h5py._hl.group.Group):
            tree[key] = h5_tree(item)
        else:
            try:
                tree[key] = len(item)
            except TypeError:
                tree[key] = "scalar"
    return tree


def dict_to_ete3_tree(d, parent=None):
    if parent is None:
        parent = Tree(name="root")
    for k, v in d.items():
        child = parent.add_child(name=k)
        if isinstance(v, dict):
            dict_to_ete3_tree(v, child)
    return parent


def ete3_tree_to_dict(tree):
    def helper(node):
        if node.is_leaf():
            return node.name
        d = {}
        for child in node.get_children():
            d[child.name] = helper(child)
        return d

    root_dict = {}
    for child in tree.get_children():
        root_dict[child.name] = helper(child)
    return root_dict


def prune_tree(tree, keys):
    t = dict_to_ete3_tree(tree)

    nodes_to_keep = set()

    # Find all nodes matching the keys and collect their ancestors and descendants
    for key in keys:
        for node in t.search_nodes(name=key):
            nodes_to_keep.update(node.iter_ancestors())
            nodes_to_keep.update(node.iter_descendants())
            nodes_to_keep.add(node)

    # Prune the original tree by removing nodes that are not in nodes_to_keep
    for node in t.traverse("postorder"):
        if node not in nodes_to_keep and node.up:
            node.detach()

    pruned_dict = ete3_tree_to_dict(t)
    return pruned_dict


def read_h5_to_dict(h5_group, pruned_tree):
    def helper(group, subtree):
        result = {}
        for key, value in subtree.items():
            if isinstance(value, dict):
                # Ensure key exists in the group before trying to access it
                if key in group:
                    result[key] = helper(group[key], value)
                else:
                    result[key] = None  # Handle missing keys gracefully
            else:
                # Ensure key exists in the group before trying to access it
                if key in group:
                    if isinstance(group[key], h5py.Dataset):
                        if group[key].shape == ():
                            result[key] = group[key][()]  # Read scalar dataset
                        else:
                            data = group[key][:]
                            # Decode binary strings to regular strings if necessary
                            if data.dtype.kind == "S":
                                data = data.astype(str)
                            result[key] = data  # Read non-scalar dataset
                    else:
                        result[key] = None  # Handle non-dataset values gracefully
                else:
                    result[key] = None  # Handle missing keys gracefully
        return result

    return helper(h5_group, pruned_tree)


def convert_to_dataframe(data):
    df_dict = {}
    for key, value in data.items():
        if isinstance(value, dict) and "categories" in value and "codes" in value:
            categories = [
                cat.decode("utf-8") if isinstance(cat, bytes) else cat
                for cat in value["categories"]
            ]
            codes = value["codes"]
            df_dict[key] = pd.Categorical.from_codes(codes, categories)
        elif (
            isinstance(value, dict)
            and "data" in value
            and "indices" in value
            and "indptr" in value
        ):
            df_dict[key] = csr_matrix(
                (value["data"], value["indices"], value["indptr"])
            )
        else:
            if value.dtype.kind == "S":
                value = [
                    v.decode("utf-8") if isinstance(v, bytes) else v for v in value
                ]
            df_dict[key] = value
    df = pd.DataFrame(df_dict)
    return df


def handle_special_keys(data):
    if "obs" in data:
        data["obs"] = convert_to_dataframe(data["obs"])
    if "var" in data:
        data["var"] = convert_to_dataframe(data["var"])
    if ("layers" in data) or (data == "X"):
        try:
            for layer in data["layers"]:
                data["layers"][layer] = csr_matrix(
                    (
                        data["layers"][layer]["data"],
                        data["layers"][layer]["indices"],
                        data["layers"][layer]["indptr"],
                    )
                )
        except:
            pass
    return data


def read_h5ad_backed_selective(
    filename: Union[str, Path],
    mode: Literal["r", "r+"],
    selected_keys: List[str] = [],
    return_dict: bool = False,
) -> AnnData:
    f = h5py.File(filename, mode)

    attributes = ["obsm", "varm", "obsp", "varp", "uns", "layers"]
    df_attributes = ["obs", "var"]

    tree = h5_tree(f)
    selected_tree = prune_tree(tree, selected_keys)
    with f:
        d = read_h5_to_dict(f, selected_tree)
        d = handle_special_keys(d)
    d["filename"] = filename
    d["filemode"] = mode
    if return_dict:
        return d
    else:
        adata = AnnData(**d)
        return adata
@flying-sheep
Copy link
Member

This does seem useful, thank you!

As you mentioned, it doesn’t integrate well with our versioned IO yet, so it’s really more of a recipe than a PR for now.

This might be a good addition for our “How to” tutorials section.

Are you interested in writing a little notebook?

@mtvector
Copy link
Author

mtvector commented Jun 9, 2024

@flying-sheep Yeah I can try to make a little notebook, I'm also working on a function to overwrite selected fields as well that I can include. Should I just commit it and link you here?

@rm1113
Copy link

rm1113 commented Jun 12, 2024

Perhaps our more general solution that based on anndata._io.specs import read_elem could address this use case as well https://pypi.org/project/cap-anndata/

@flying-sheep
Copy link
Member

Interesting! You don’t have to use an internal API for it btw, we’re exporting e.g. anndata.experimental.read_elem by now.

@flying-sheep
Copy link
Member

Should I just commit it and link you here?

yeah, I’ll check if it’s a candidate for inclusion in our tutorial notebooks with few changes and we can go from there.

@mtvector
Copy link
Author

mtvector commented Jun 13, 2024 via email

@gtca
Copy link

gtca commented Jul 2, 2024

Just stumbled upon this issue, and as it's pretty recent, wanted to give a pointer to yet another experimental approach to only load parts of the data — https://github.com/scverse/shadows. Please give us feedback in case you end up trying it out!

@mtvector
Copy link
Author

mtvector commented Jul 2, 2024

I love it when you start a conversation with an ad hoc approach and end up with several robust, purpose-built solutions :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants