miloDE ideas for python #43

MaximilianNuber · 2024-12-13T09:45:41Z

Dear all,

Thank you for this great tool. I am very interested in exploring the combination of milo and miloDE in a variety of datasets.
However, analyzing datasets of 200k cells and above, I just haven´t been able to make miloDE fast enough, or at times prevent it from crashing.

Therefore, I have been attempting to bring at least some functionality to python, using the pertpy-implementation of milo.

I am not computationally trained by background learning everything on the go, so I have questions regarding algorithms and implementation and would appreciate any help you can give me.

My first questions regards assign_neighbourhoods:
To my understanding, the function builds the milo-graph at k_init for sampling vertices, rebuilds the graph at k, and then refines the vertices with the vertices previously sampled at k_init. Are these steps not the same, but similar, to building a graph with miloR and running makeNhoods with graph-refinement?
I had significant issues calling RcppGreedySetCover with rpy2, so I use the greedy set cover approach from https://www.researchgate.net/publication/362015207_ML2R_Coding_Nuggets_Greedy_Set_Cover_with_Native_Python_Data_Types

import pertpy as pt
def assign_neighbourhoods(
    adata,
    k = 25,
    reducedDim_name = "X_pca",
    prop = 0.1,
):
    milo = pt.tl.Milo()
    mdata = milo.load(adata)
    # from sklearn_ann.kneighbors.annoy import AnnoyTransformer
    # transformer = AnnoyTransformer(n_neighbors=k)
    sc.pp.neighbors(mdata["rna"], use_rep=reducedDim_name,# , transformer = transformer
                    n_neighbors = k
                   )
    
    milo.make_nhoods(mdata["rna"], prop=prop)
    
    # nhood_size = np.array(mdata["rna"].obsm["nhoods"].sum(0)).ravel()
    # plt.hist(nhood_size, bins=100)
    # plt.xlabel("# cells in nhood")
    # plt.ylabel("# nhoods");
    # plt.show()
    return mdata

def greedySetCoverV2(U, S):
    # build inverted index

    I = {x : [i for i in S if x in S[i]] for x in U}

    # pre-select necessary sets Sj into C
    C = {j : S[j] for j in [I[x][0] for x in I if len(I[x]) == 1]}

    U = U - set().union(*[C[j] for j in C])

    while U:
        j = max(S.keys(), key = lambda k: len(S[k] & U))
        C[j] = S[j]
        U = U - S[j]

    return C


def _filter_neighbourhoods(mdata):
    nhood_df = pd.DataFrame()

    for nhood_idx in range(mdata["rna"].obsm["nhoods"].shape[1]):
        cur_nhoods = mdata["rna"].obsm["nhoods"][:, nhood_idx].toarray().ravel()
    
        cur_cells = adata.obs_names[cur_nhoods.astype(bool)]
        
        cur_df = pd.DataFrame({
            "set": nhood_idx,
            "elements": cur_cells
        })
        nhood_df = pd.concat([nhood_df, cur_df], axis = 0)

    S = []
    for s in nhood_df.set.unique():
        S.append(set(nhood_df.query('set == @s').elements.unique()))
    S = dict(enumerate(S))
    U = nhood_df.elements.unique()
    U = set(U)

    gsc_result = greedySetCoverV2(U, S)
    nhs_to_use = list(gsc_result.keys())
    nhs_to_use = np.asarray(nhs_to_use)

    index_for_refined = adata.obs[adata.obs.nhood_ixs_refined == 1].index[nhs_to_use.astype(int)]
    
    mdata["rna"].obs["nhood_ixs_refined"] = 0
    mdata["rna"].obs.loc[index_for_refined, "nhood_ixs_refined"] = 1
    
    mdata["rna"].obsm["nhoods"] = mdata["rna"].obsm["nhoods"][:, nhs_to_use.astype(int)]

As the python implementation does not include graph refinement, given that my assumptions above are true, my python implementation first builds a kNN-graph with scanpy/annoy. Then we run make_nhoods, which uses (non-graph) sampling for refined neighbourhoods as in the default miloR/milopy workflow. Refined sampling is (I think) performed on the adjacency matrix.

I am aware this is quite a deviation from the approach in R, but for the time being I would like to build upon that. Or do you think the non-graph refinement is ill-suited for this task? One thing I noticed when running the python implementation is, that neighbourhood sizes are a lot smaller when compared to your original package for the same values of k on the embryo dataset from the tutorial (about 10-fold, see example below). This also leads to a higher number neighbourhoods after filtering. But is this plausible?
I tested graph building in R and making the graph in python, so I found out that the higher neighbourhood sizes stem from the R implementation of buildGraph.
Is it sufficient to simply set k higher to achieve the desired neighbourhood sizes?

For calculating the AUC, pertpy provides an implementation of Augur. However, in your implementation, calculate_auc is called per neighbourhood with the parameter n_subsamples = 0. This is not possible in the python version. I do not fully understand what exactly this option does, except being a special case within Augur. Could you explain why n_subsamples = 0, or if other values might be possible?
Subsequently, I would post a feature request on the repository.
However, some tests I did suggest, that AUC values are generally high for most, if not all neighbourhoods when using python.

Then the DE testing per neighbourhood. For this, I call edgeR using an adapted version of the EdgeR-class in pertpy. Even parallelization worked well in my tests.
And finally, the spatial adjustment of p-values across neighbourhoods. Below I am posting my python-implementation of your code. This is my own translation, so I would appreciate any help to correct it.

import numpy as np

def get_weights(nhoods_x):
  

  intersect_mat = np.dot(nhoods_x.T, nhoods_x)
  t_connect = np.sum(intersect_mat, axis=1)
  weights = 1 / t_connect
  return weights

def _return_null_df():
    null_df = pd.DataFrame(index = adata.var_names)
    null_df["log_fc"] = np.nan
    null_df["logCPM"] = np.nan
    null_df["F"] = np.nan
    null_df["p_value"] = np.nan
    null_df["adj_p_value"] = np.nan
    null_df = null_df.reset_index(names = ["variable"])
    return null_df

def _run_edger(ad, design, contrast, model):
    try:
        mod = model(ad, design)
        mod.fit()
        res_df = mod._test_single_contrast(contrast)
        return res_df
    except:
        return _return_null_df()

def de_stat_neighbourhoods(
    mdata,
    sample_col = "patient",
    design = "~condition",
    covariates = "condition",
    contrast = None,
    # subset_nhoods = stat_auc$Nhood[!is.na(stat_auc$auc)],
    layout = "X_umap", 
    n_jobs = 1,
):
    def get_cells_in_nhoods(mdata, nhood_ids):
        '''
        Get cells in neighbourhoods of interest '''
        in_nhoods = np.array(mdata["rna"].obsm['nhoods'][:,nhood_ids.astype('int')].sum(1))
        ad = mdata["rna"][in_nhoods.astype(bool), :].copy()
        return ad

    covs = covariates

    from itertools import repeat
    from joblib import Parallel, delayed

    nhoods = mdata["rna"].obsm["nhoods"]
    n_nhoods = nhoods.shape[1]
    # n_nhoods = 100
    # get generator with all nhoods

    func = "sum"
    layer = "counts"
    
    all_nhoods = (get_cells_in_nhoods(mdata, nhood_ids = np.asarray([i])) for i in range(n_nhoods))
    aggregated_nhoods = (
        sc.get.aggregate(ad, by = [sample_col, covs], func = func, layer=layer)
        for ad in all_nhoods
    )
    def _func_to_X(ad, func):
        ad.X = ad.layers[func].copy()
        return ad
    aggregated_nhoods = (
        _func_to_X(ad, func)
        for ad in aggregated_nhoods
    )

    import diff_exp

    if contrast is None:
        from formulaic import model_matrix
        mm = model_matrix("~condition", mdata["rna"].obs)
        n_coefs = mm.shape[1]
        contrast = np.zeros(n_coefs)
        contrast[n_coefs-1] = 1
        contrast = list(contrast)
        del mm
        
    args = zip(
        aggregated_nhoods, 
        repeat(design),
        repeat(contrast),
        repeat(diff_exp.EdgeR)
    )

    start = time.time()
    pval_df_list = Parallel(n_jobs=n_jobs)(delayed(_run_edger)(ad, design, contrast, model) for ad, design, contrast, model in args)
    end = time.time()
    print("edger done in")
    print(end-start)
    print("seconds")
    # return pval_df_list

    pval_by_nhood = pd.concat([df.set_index("variable")[["p_value"]] for df in pval_df_list], axis = 1)
    FDR_across_genes = pd.concat([df.set_index("variable")[["adj_p_value"]] for df in pval_df_list], axis = 1)
    logfc_nhoods = pd.concat([df.set_index("variable")[["log_fc"]] for df in pval_df_list], axis = 1)
    
    print("Calculating FDR across nhoods")

    n_genes = pval_by_nhood.shape[0]
    FDR_across_nhoods = pval_by_nhood.copy()
    for i in range(n_genes):
        
        pvalues = pval_by_nhood.iloc[i, :].values
        idx_not_nan = np.isnan(pvalues)
        
        _weights = get_weights(mdata["rna"].obsm["nhoods"][:, ~idx_not_nan])
        _weights = np.asarray(_weights)
        _weights = _weights.ravel()
        # o = order_matrix(pvalues)
        pvalues = pvalues[~idx_not_nan]
        
        o = np.argsort(pvalues)
        weights = _weights[o]
        pvalues = pvalues[o]
        _weights
        # weights = weights[o]
        adjp = np.zeros(len(o))
        adjp[o] = np.flip(np.minimum.accumulate(np.flip(np.sum(weights) * pvalues / np.cumsum(weights))))
        adjp = np.minimum(adjp, 1)
        FDR_across_nhoods.iloc[i, :] = np.nan
    
        FDR_across_nhoods.iloc[i, ~idx_not_nan] = adjp

    return pval_by_nhood, logfc_nhoods, FDR_across_genes, FDR_across_nhoods

I would appreciate any help to make this work, and thank you in advance.
Best,
max

Here an example using the miloDE tutorial data:

r('''
library(miloDE)

# library containing toy data
suppressMessages(library(MouseGastrulationData))

# analysis libraries
library(scuttle)
suppressMessages(library(miloR))
suppressMessages(library(uwot))
library(scran)
suppressMessages(library(dplyr))
library(reshape2)

# packages for gene cluster analysis
# library(scWGCNA)
#> 
# suppressMessages(library(Seurat))

# plotting libraries
library(ggplot2)
library(viridis)
#> Loading required package: viridisLite
library(ggpubr)

''')

r('''
library(BiocParallel)

ncores = 4
mcparam = MulticoreParam(workers = ncores)
register(mcparam)

# load chimera Tal1
sce = suppressMessages(MouseGastrulationData::Tal1ChimeraData())

# downsample to few selected cell types
cts = c("Spinal cord" , "Haematoendothelial progenitors", "Endothelium" , "Blood progenitors 1" , "Blood progenitors 2")
sce = sce[, sce$celltype.mapped %in% cts]
# let's rename Haematoendothelial progenitors
sce$celltype.mapped[sce$celltype.mapped == "Haematoendothelial progenitors"] = "Haem. prog-s."

# delete row for tomato
sce = sce[!rownames(sce) == "tomato-td" , ]

# add logcounts
sce = logNormCounts(sce)

# update tomato field to be more interpretable 
sce$tomato = sapply(sce$tomato , function(x) ifelse(isTRUE(x) , "Tal1_KO" , "WT"))

# for this exercise, we focus on 3000 highly variable genes (for computational efficiency)
dec.sce = modelGeneVar(sce)
hvg.genes = getTopHVGs(dec.sce, n = 3000)
sce = sce[hvg.genes , ]
# change rownames to symbol names
rowdata = as.data.frame(rowData(sce))
rownames(sce) = rowdata$SYMBOL

# add UMAPs
set.seed(32)
umaps = as.data.frame(uwot::umap(reducedDim(sce , "pca.corrected")))
# let's store UMAPs - we will use them for visualisation
reducedDim(sce , "UMAP") = umaps

# plot UMAPs, colored by cell types
umaps = cbind(as.data.frame(colData(sce)) , reducedDim(sce , "UMAP"))
names(EmbryoCelltypeColours)[names(EmbryoCelltypeColours) == "Haematoendothelial progenitors"] = "Haem. prog-s."
# cols_ct = EmbryoCelltypeColours[names(EmbryoCelltypeColours) %in% unique(umaps$celltype.mapped)]
''')

# _r_to_py is a self-written helper function for rpy conversion
counts = _r_to_py(r('counts(sce)'))
counts = counts.T
counts
obs = _r_to_py(r('as.data.frame(colData(sce))'))
var = _r_to_py(r('as.data.frame(rowData(sce))'))
X = _r_to_py(r('logcounts(sce)'))
X = X.T
pca_corrected = _r_to_py(r('reducedDim(sce, "pca.corrected")'))
X_umap = _r_to_py(r('reducedDim(sce, "UMAP")'))

adata = anndata.AnnData(
    X = X,
    obs = obs,
    var = var
)
adata.layers["counts"] = counts
adata.obsm["pca_corrected"] = pca_corrected
adata.obsm["X_umap"] = X_umap
adata.obsm["X_umap"] = adata.obsm["X_umap"].values

sc.pl.umap(adata,
          color = "celltype.mapped")

%%time
plot = True
LATENT_KEY = "pca_corrected"
k_grid = list(range(10, 40, 5))
# k_grid.append(100)
prop = 0.5
filtering = True

size_df = pd.DataFrame()
for k in k_grid:
    print(k)
    mdata = assign_neighbourhoods(adata, k = k, prop = prop)
    if filtering:
        print("filtering")
        _filter_neighbourhoods(mdata)
    nhood_size = np.array(mdata["rna"].obsm["nhoods"].sum(0)).ravel()
    _size_df = pd.DataFrame(dict(nhood_size= nhood_size))
    _size_df = _size_df.assign(k = str(k))
    size_df = pd.concat([size_df, _size_df], axis = 0)

if plot:
    sns.boxplot(
        size_df,
        x = "k",
        y = "nhood_size",
        palette = sns.color_palette("Paired", len(k_grid))
    )
    sns.despine()

%%time
pval, logfc, FDR_genes, FDR_nhoods = de_stat_neighbourhoods(
    mdata,
    sample_col = "sample",
    design = "0+tomato",
    covariates = "tomato",
    contrast=[-1, 1],
    n_jobs = 5
)

nhood_size = np.array(mdata["rna"].obsm["nhoods"].sum(0)).ravel()
mdata["milo"].var["Nhood_size"] = nhood_size
mdata["milo"].var["nhood_adjusted"] = np.sum(FDR_nhoods <= 0.05, axis = 0).values
mdata["milo"].var["gene_adjusted"] = np.sum(FDR_genes <= 0.05, axis = 0).values

sc.pl.embedding(mdata["milo"].T,
                basis = "X_milo_graph",
                size = mdata["milo"].var["Nhood_size"],
                color = "nhood_adjusted"
               )
sc.pl.embedding(mdata["milo"].T,
                basis = "X_milo_graph",
                size = mdata["milo"].var["Nhood_size"],
                color = "gene_adjusted"
               )

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

miloDE ideas for python #43

miloDE ideas for python #43

MaximilianNuber commented Dec 13, 2024

miloDE ideas for python #43

miloDE ideas for python #43

Comments

MaximilianNuber commented Dec 13, 2024