Reading Anndata from only parts of h5ad file: Hack solution #1517

mtvector · 2024-06-07T08:12:00Z

I came across a previous issue #436 and couldn't get the dask solution working with my application, so I came up with a somewhat hacky solution to reading only the desired fields from an h5ad into an anndata (not chunking). It works by making a tree of all the fields in the H5, searching the tree for fields matching the ones you want to load, then loading the ancestors and descendants of that field. (see code below). Useful if you want to keep all your data together on disk, but only need to load some fields into memory.

Basically you run:

read_h5ad_backed_selective(model_path / 'p3_adata.h5ad', mode='r', selected_keys=['spliced', 'S_score', 'batch_name', 'var', 'uns', 'X_antipode'])

and get back:

AnnData object with n_obs × n_vars = 493759 × 16738 backed at './1.9.1.8.3_Dev-DMR/p3_adata.h5ad'
    obs: 'S_score', 'batch_name'
    var: '_index', 'h_gene', 'h_gene_size', 'h_left_flank', 'h_right_flank', 'h_syntenic'
    uns: '_scvi_manager_uuid', '_scvi_uuid', 'antipode_cluster_colors'
    obsm: 'X_antipode'
    layers: 'spliced'

Just wanted to share in case it is useful to someone!

# Author: mtvector a.k.a. Matthew Schmitz
import h5py
import anndata
from anndata import AnnData
from pathlib import Path
from typing import Union, Literal, List, Dict
from ete3 import Tree
from scipy.sparse import csr_matrix
import pandas as pd

def h5_tree(val):
    tree = {}
    for key, item in val.items():
        if isinstance(item, h5py._hl.group.Group):
            tree[key] = h5_tree(item)
        else:
            try:
                tree[key] = len(item)
            except TypeError:
                tree[key] = "scalar"
    return tree


def dict_to_ete3_tree(d, parent=None):
    if parent is None:
        parent = Tree(name="root")
    for k, v in d.items():
        child = parent.add_child(name=k)
        if isinstance(v, dict):
            dict_to_ete3_tree(v, child)
    return parent


def ete3_tree_to_dict(tree):
    def helper(node):
        if node.is_leaf():
            return node.name
        d = {}
        for child in node.get_children():
            d[child.name] = helper(child)
        return d

    root_dict = {}
    for child in tree.get_children():
        root_dict[child.name] = helper(child)
    return root_dict


def prune_tree(tree, keys):
    t = dict_to_ete3_tree(tree)

    nodes_to_keep = set()

    # Find all nodes matching the keys and collect their ancestors and descendants
    for key in keys:
        for node in t.search_nodes(name=key):
            nodes_to_keep.update(node.iter_ancestors())
            nodes_to_keep.update(node.iter_descendants())
            nodes_to_keep.add(node)

    # Prune the original tree by removing nodes that are not in nodes_to_keep
    for node in t.traverse("postorder"):
        if node not in nodes_to_keep and node.up:
            node.detach()

    pruned_dict = ete3_tree_to_dict(t)
    return pruned_dict


def read_h5_to_dict(h5_group, pruned_tree):
    def helper(group, subtree):
        result = {}
        for key, value in subtree.items():
            if isinstance(value, dict):
                # Ensure key exists in the group before trying to access it
                if key in group:
                    result[key] = helper(group[key], value)
                else:
                    result[key] = None  # Handle missing keys gracefully
            else:
                # Ensure key exists in the group before trying to access it
                if key in group:
                    if isinstance(group[key], h5py.Dataset):
                        if group[key].shape == ():
                            result[key] = group[key][()]  # Read scalar dataset
                        else:
                            data = group[key][:]
                            # Decode binary strings to regular strings if necessary
                            if data.dtype.kind == "S":
                                data = data.astype(str)
                            result[key] = data  # Read non-scalar dataset
                    else:
                        result[key] = None  # Handle non-dataset values gracefully
                else:
                    result[key] = None  # Handle missing keys gracefully
        return result

    return helper(h5_group, pruned_tree)


def convert_to_dataframe(data):
    df_dict = {}
    for key, value in data.items():
        if isinstance(value, dict) and "categories" in value and "codes" in value:
            categories = [
                cat.decode("utf-8") if isinstance(cat, bytes) else cat
                for cat in value["categories"]
            ]
            codes = value["codes"]
            df_dict[key] = pd.Categorical.from_codes(codes, categories)
        elif (
            isinstance(value, dict)
            and "data" in value
            and "indices" in value
            and "indptr" in value
        ):
            df_dict[key] = csr_matrix(
                (value["data"], value["indices"], value["indptr"])
            )
        else:
            if value.dtype.kind == "S":
                value = [
                    v.decode("utf-8") if isinstance(v, bytes) else v for v in value
                ]
            df_dict[key] = value
    df = pd.DataFrame(df_dict)
    return df


def handle_special_keys(data):
    if "obs" in data:
        data["obs"] = convert_to_dataframe(data["obs"])
    if "var" in data:
        data["var"] = convert_to_dataframe(data["var"])
    if ("layers" in data) or (data == "X"):
        try:
            for layer in data["layers"]:
                data["layers"][layer] = csr_matrix(
                    (
                        data["layers"][layer]["data"],
                        data["layers"][layer]["indices"],
                        data["layers"][layer]["indptr"],
                    )
                )
        except:
            pass
    return data


def read_h5ad_backed_selective(
    filename: Union[str, Path],
    mode: Literal["r", "r+"],
    selected_keys: List[str] = [],
    return_dict: bool = False,
) -> AnnData:
    f = h5py.File(filename, mode)

    attributes = ["obsm", "varm", "obsp", "varp", "uns", "layers"]
    df_attributes = ["obs", "var"]

    tree = h5_tree(f)
    selected_tree = prune_tree(tree, selected_keys)
    with f:
        d = read_h5_to_dict(f, selected_tree)
        d = handle_special_keys(d)
    d["filename"] = filename
    d["filemode"] = mode
    if return_dict:
        return d
    else:
        adata = AnnData(**d)
        return adata

The text was updated successfully, but these errors were encountered:

flying-sheep · 2024-06-07T09:24:05Z

This does seem useful, thank you!

As you mentioned, it doesn’t integrate well with our versioned IO yet, so it’s really more of a recipe than a PR for now.

This might be a good addition for our “How to” tutorials section.

Are you interested in writing a little notebook?

mtvector · 2024-06-09T23:14:03Z

@flying-sheep Yeah I can try to make a little notebook, I'm also working on a function to overwrite selected fields as well that I can include. Should I just commit it and link you here?

rm1113 · 2024-06-12T20:24:16Z

Perhaps our more general solution that based on anndata._io.specs import read_elem could address this use case as well https://pypi.org/project/cap-anndata/

flying-sheep · 2024-06-13T06:57:52Z

Interesting! You don’t have to use an internal API for it btw, we’re exporting e.g. anndata.experimental.read_elem by now.

flying-sheep · 2024-06-13T06:59:21Z

Should I just commit it and link you here?

yeah, I’ll check if it’s a candidate for inclusion in our tutorial notebooks with few changes and we can go from there.

mtvector · 2024-06-13T07:24:07Z

Wow I wasn't aware of cap-anndata, that seems like a much more robust solution, I think that should probably be an example notebook rather than my hack solution!

…

On Wed, Jun 12, 2024, 23:59 Philipp A. ***@***.***> wrote: Should I just commit it and link you here? yeah, I’ll check if it’s a candidate for inclusion in our tutorial notebooks with few changes and we can go from there. — Reply to this email directly, view it on GitHub <#1517 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGX7H2ZKM2DDYO4NOX47BLTZHE7N7AVCNFSM6AAAAABI6F3CROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRUG4YTIOBYHA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

gtca · 2024-07-02T17:42:25Z

Just stumbled upon this issue, and as it's pretty recent, wanted to give a pointer to yet another experimental approach to only load parts of the data — https://github.com/scverse/shadows. Please give us feedback in case you end up trying it out!

mtvector · 2024-07-02T19:21:16Z

I love it when you start a conversation with an ad hoc approach and end up with several robust, purpose-built solutions :)

mtvector added the enhancement label Jun 7, 2024

rm1113 mentioned this issue Jun 13, 2024

Transition to Experimental AnnData API instead of internal one cellannotation/cap-anndata#14

Closed

gtca mentioned this issue Jul 2, 2024

Selective read/write mudata modalities scverse/mudata#63

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading Anndata from only parts of h5ad file: Hack solution #1517

Reading Anndata from only parts of h5ad file: Hack solution #1517

mtvector commented Jun 7, 2024 •

edited

Loading

flying-sheep commented Jun 7, 2024

mtvector commented Jun 9, 2024

rm1113 commented Jun 12, 2024

flying-sheep commented Jun 13, 2024

flying-sheep commented Jun 13, 2024

mtvector commented Jun 13, 2024 via email

gtca commented Jul 2, 2024

mtvector commented Jul 2, 2024

Reading Anndata from only parts of h5ad file: Hack solution #1517

Reading Anndata from only parts of h5ad file: Hack solution #1517

Comments

mtvector commented Jun 7, 2024 • edited Loading

flying-sheep commented Jun 7, 2024

mtvector commented Jun 9, 2024

rm1113 commented Jun 12, 2024

flying-sheep commented Jun 13, 2024

flying-sheep commented Jun 13, 2024

mtvector commented Jun 13, 2024 via email

gtca commented Jul 2, 2024

mtvector commented Jul 2, 2024

mtvector commented Jun 7, 2024 •

edited

Loading