Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

miloDE ideas for python #43

Open
MaximilianNuber opened this issue Dec 13, 2024 · 0 comments
Open

miloDE ideas for python #43

MaximilianNuber opened this issue Dec 13, 2024 · 0 comments

Comments

@MaximilianNuber
Copy link

Dear all,

Thank you for this great tool. I am very interested in exploring the combination of milo and miloDE in a variety of datasets.
However, analyzing datasets of 200k cells and above, I just haven´t been able to make miloDE fast enough, or at times prevent it from crashing.

Therefore, I have been attempting to bring at least some functionality to python, using the pertpy-implementation of milo.

I am not computationally trained by background learning everything on the go, so I have questions regarding algorithms and implementation and would appreciate any help you can give me.

My first questions regards assign_neighbourhoods:
To my understanding, the function builds the milo-graph at k_init for sampling vertices, rebuilds the graph at k, and then refines the vertices with the vertices previously sampled at k_init. Are these steps not the same, but similar, to building a graph with miloR and running makeNhoods with graph-refinement?
I had significant issues calling RcppGreedySetCover with rpy2, so I use the greedy set cover approach from https://www.researchgate.net/publication/362015207_ML2R_Coding_Nuggets_Greedy_Set_Cover_with_Native_Python_Data_Types

import pertpy as pt
def assign_neighbourhoods(
    adata,
    k = 25,
    reducedDim_name = "X_pca",
    prop = 0.1,
):
    milo = pt.tl.Milo()
    mdata = milo.load(adata)
    # from sklearn_ann.kneighbors.annoy import AnnoyTransformer
    # transformer = AnnoyTransformer(n_neighbors=k)
    sc.pp.neighbors(mdata["rna"], use_rep=reducedDim_name,# , transformer = transformer
                    n_neighbors = k
                   )
    
    milo.make_nhoods(mdata["rna"], prop=prop)
    
    # nhood_size = np.array(mdata["rna"].obsm["nhoods"].sum(0)).ravel()
    # plt.hist(nhood_size, bins=100)
    # plt.xlabel("# cells in nhood")
    # plt.ylabel("# nhoods");
    # plt.show()
    return mdata

def greedySetCoverV2(U, S):
    # build inverted index

    I = {x : [i for i in S if x in S[i]] for x in U}

    # pre-select necessary sets Sj into C
    C = {j : S[j] for j in [I[x][0] for x in I if len(I[x]) == 1]}

    U = U - set().union(*[C[j] for j in C])

    while U:
        j = max(S.keys(), key = lambda k: len(S[k] & U))
        C[j] = S[j]
        U = U - S[j]

    return C


def _filter_neighbourhoods(mdata):
    nhood_df = pd.DataFrame()

    for nhood_idx in range(mdata["rna"].obsm["nhoods"].shape[1]):
        cur_nhoods = mdata["rna"].obsm["nhoods"][:, nhood_idx].toarray().ravel()
    
        cur_cells = adata.obs_names[cur_nhoods.astype(bool)]
        
        cur_df = pd.DataFrame({
            "set": nhood_idx,
            "elements": cur_cells
        })
        nhood_df = pd.concat([nhood_df, cur_df], axis = 0)

    S = []
    for s in nhood_df.set.unique():
        S.append(set(nhood_df.query('set == @s').elements.unique()))
    S = dict(enumerate(S))
    U = nhood_df.elements.unique()
    U = set(U)

    gsc_result = greedySetCoverV2(U, S)
    nhs_to_use = list(gsc_result.keys())
    nhs_to_use = np.asarray(nhs_to_use)

    index_for_refined = adata.obs[adata.obs.nhood_ixs_refined == 1].index[nhs_to_use.astype(int)]
    
    mdata["rna"].obs["nhood_ixs_refined"] = 0
    mdata["rna"].obs.loc[index_for_refined, "nhood_ixs_refined"] = 1
    
    mdata["rna"].obsm["nhoods"] = mdata["rna"].obsm["nhoods"][:, nhs_to_use.astype(int)]

As the python implementation does not include graph refinement, given that my assumptions above are true, my python implementation first builds a kNN-graph with scanpy/annoy. Then we run make_nhoods, which uses (non-graph) sampling for refined neighbourhoods as in the default miloR/milopy workflow. Refined sampling is (I think) performed on the adjacency matrix.

I am aware this is quite a deviation from the approach in R, but for the time being I would like to build upon that. Or do you think the non-graph refinement is ill-suited for this task? One thing I noticed when running the python implementation is, that neighbourhood sizes are a lot smaller when compared to your original package for the same values of k on the embryo dataset from the tutorial (about 10-fold, see example below). This also leads to a higher number neighbourhoods after filtering. But is this plausible?
I tested graph building in R and making the graph in python, so I found out that the higher neighbourhood sizes stem from the R implementation of buildGraph.
Is it sufficient to simply set k higher to achieve the desired neighbourhood sizes?

For calculating the AUC, pertpy provides an implementation of Augur. However, in your implementation, calculate_auc is called per neighbourhood with the parameter n_subsamples = 0. This is not possible in the python version. I do not fully understand what exactly this option does, except being a special case within Augur. Could you explain why n_subsamples = 0, or if other values might be possible?
Subsequently, I would post a feature request on the repository.
However, some tests I did suggest, that AUC values are generally high for most, if not all neighbourhoods when using python.

Then the DE testing per neighbourhood. For this, I call edgeR using an adapted version of the EdgeR-class in pertpy. Even parallelization worked well in my tests.
And finally, the spatial adjustment of p-values across neighbourhoods. Below I am posting my python-implementation of your code. This is my own translation, so I would appreciate any help to correct it.

import numpy as np

def get_weights(nhoods_x):
  

  intersect_mat = np.dot(nhoods_x.T, nhoods_x)
  t_connect = np.sum(intersect_mat, axis=1)
  weights = 1 / t_connect
  return weights

def _return_null_df():
    null_df = pd.DataFrame(index = adata.var_names)
    null_df["log_fc"] = np.nan
    null_df["logCPM"] = np.nan
    null_df["F"] = np.nan
    null_df["p_value"] = np.nan
    null_df["adj_p_value"] = np.nan
    null_df = null_df.reset_index(names = ["variable"])
    return null_df

def _run_edger(ad, design, contrast, model):
    try:
        mod = model(ad, design)
        mod.fit()
        res_df = mod._test_single_contrast(contrast)
        return res_df
    except:
        return _return_null_df()

def de_stat_neighbourhoods(
    mdata,
    sample_col = "patient",
    design = "~condition",
    covariates = "condition",
    contrast = None,
    # subset_nhoods = stat_auc$Nhood[!is.na(stat_auc$auc)],
    layout = "X_umap", 
    n_jobs = 1,
):
    def get_cells_in_nhoods(mdata, nhood_ids):
        '''
        Get cells in neighbourhoods of interest '''
        in_nhoods = np.array(mdata["rna"].obsm['nhoods'][:,nhood_ids.astype('int')].sum(1))
        ad = mdata["rna"][in_nhoods.astype(bool), :].copy()
        return ad

    covs = covariates

    from itertools import repeat
    from joblib import Parallel, delayed

    nhoods = mdata["rna"].obsm["nhoods"]
    n_nhoods = nhoods.shape[1]
    # n_nhoods = 100
    # get generator with all nhoods

    func = "sum"
    layer = "counts"
    
    all_nhoods = (get_cells_in_nhoods(mdata, nhood_ids = np.asarray([i])) for i in range(n_nhoods))
    aggregated_nhoods = (
        sc.get.aggregate(ad, by = [sample_col, covs], func = func, layer=layer)
        for ad in all_nhoods
    )
    def _func_to_X(ad, func):
        ad.X = ad.layers[func].copy()
        return ad
    aggregated_nhoods = (
        _func_to_X(ad, func)
        for ad in aggregated_nhoods
    )

    import diff_exp

    if contrast is None:
        from formulaic import model_matrix
        mm = model_matrix("~condition", mdata["rna"].obs)
        n_coefs = mm.shape[1]
        contrast = np.zeros(n_coefs)
        contrast[n_coefs-1] = 1
        contrast = list(contrast)
        del mm
        
    args = zip(
        aggregated_nhoods, 
        repeat(design),
        repeat(contrast),
        repeat(diff_exp.EdgeR)
    )

    start = time.time()
    pval_df_list = Parallel(n_jobs=n_jobs)(delayed(_run_edger)(ad, design, contrast, model) for ad, design, contrast, model in args)
    end = time.time()
    print("edger done in")
    print(end-start)
    print("seconds")
    # return pval_df_list

    pval_by_nhood = pd.concat([df.set_index("variable")[["p_value"]] for df in pval_df_list], axis = 1)
    FDR_across_genes = pd.concat([df.set_index("variable")[["adj_p_value"]] for df in pval_df_list], axis = 1)
    logfc_nhoods = pd.concat([df.set_index("variable")[["log_fc"]] for df in pval_df_list], axis = 1)
    
    print("Calculating FDR across nhoods")

    n_genes = pval_by_nhood.shape[0]
    FDR_across_nhoods = pval_by_nhood.copy()
    for i in range(n_genes):
        
        pvalues = pval_by_nhood.iloc[i, :].values
        idx_not_nan = np.isnan(pvalues)
        
        _weights = get_weights(mdata["rna"].obsm["nhoods"][:, ~idx_not_nan])
        _weights = np.asarray(_weights)
        _weights = _weights.ravel()
        # o = order_matrix(pvalues)
        pvalues = pvalues[~idx_not_nan]
        
        o = np.argsort(pvalues)
        weights = _weights[o]
        pvalues = pvalues[o]
        _weights
        # weights = weights[o]
        adjp = np.zeros(len(o))
        adjp[o] = np.flip(np.minimum.accumulate(np.flip(np.sum(weights) * pvalues / np.cumsum(weights))))
        adjp = np.minimum(adjp, 1)
        FDR_across_nhoods.iloc[i, :] = np.nan
    
        FDR_across_nhoods.iloc[i, ~idx_not_nan] = adjp

    return pval_by_nhood, logfc_nhoods, FDR_across_genes, FDR_across_nhoods

I would appreciate any help to make this work, and thank you in advance.
Best,
max

Here an example using the miloDE tutorial data:

r('''
library(miloDE)

# library containing toy data
suppressMessages(library(MouseGastrulationData))

# analysis libraries
library(scuttle)
suppressMessages(library(miloR))
suppressMessages(library(uwot))
library(scran)
suppressMessages(library(dplyr))
library(reshape2)

# packages for gene cluster analysis
# library(scWGCNA)
#> 
# suppressMessages(library(Seurat))

# plotting libraries
library(ggplot2)
library(viridis)
#> Loading required package: viridisLite
library(ggpubr)

''')

r('''
library(BiocParallel)

ncores = 4
mcparam = MulticoreParam(workers = ncores)
register(mcparam)

# load chimera Tal1
sce = suppressMessages(MouseGastrulationData::Tal1ChimeraData())

# downsample to few selected cell types
cts = c("Spinal cord" , "Haematoendothelial progenitors", "Endothelium" , "Blood progenitors 1" , "Blood progenitors 2")
sce = sce[, sce$celltype.mapped %in% cts]
# let's rename Haematoendothelial progenitors
sce$celltype.mapped[sce$celltype.mapped == "Haematoendothelial progenitors"] = "Haem. prog-s."

# delete row for tomato
sce = sce[!rownames(sce) == "tomato-td" , ]

# add logcounts
sce = logNormCounts(sce)

# update tomato field to be more interpretable 
sce$tomato = sapply(sce$tomato , function(x) ifelse(isTRUE(x) , "Tal1_KO" , "WT"))

# for this exercise, we focus on 3000 highly variable genes (for computational efficiency)
dec.sce = modelGeneVar(sce)
hvg.genes = getTopHVGs(dec.sce, n = 3000)
sce = sce[hvg.genes , ]
# change rownames to symbol names
rowdata = as.data.frame(rowData(sce))
rownames(sce) = rowdata$SYMBOL

# add UMAPs
set.seed(32)
umaps = as.data.frame(uwot::umap(reducedDim(sce , "pca.corrected")))
# let's store UMAPs - we will use them for visualisation
reducedDim(sce , "UMAP") = umaps

# plot UMAPs, colored by cell types
umaps = cbind(as.data.frame(colData(sce)) , reducedDim(sce , "UMAP"))
names(EmbryoCelltypeColours)[names(EmbryoCelltypeColours) == "Haematoendothelial progenitors"] = "Haem. prog-s."
# cols_ct = EmbryoCelltypeColours[names(EmbryoCelltypeColours) %in% unique(umaps$celltype.mapped)]
''')
# _r_to_py is a self-written helper function for rpy conversion
counts = _r_to_py(r('counts(sce)'))
counts = counts.T
counts
obs = _r_to_py(r('as.data.frame(colData(sce))'))
var = _r_to_py(r('as.data.frame(rowData(sce))'))
X = _r_to_py(r('logcounts(sce)'))
X = X.T
pca_corrected = _r_to_py(r('reducedDim(sce, "pca.corrected")'))
X_umap = _r_to_py(r('reducedDim(sce, "UMAP")'))

adata = anndata.AnnData(
    X = X,
    obs = obs,
    var = var
)
adata.layers["counts"] = counts
adata.obsm["pca_corrected"] = pca_corrected
adata.obsm["X_umap"] = X_umap
adata.obsm["X_umap"] = adata.obsm["X_umap"].values

sc.pl.umap(adata,
          color = "celltype.mapped")
image
%%time
plot = True
LATENT_KEY = "pca_corrected"
k_grid = list(range(10, 40, 5))
# k_grid.append(100)
prop = 0.5
filtering = True

size_df = pd.DataFrame()
for k in k_grid:
    print(k)
    mdata = assign_neighbourhoods(adata, k = k, prop = prop)
    if filtering:
        print("filtering")
        _filter_neighbourhoods(mdata)
    nhood_size = np.array(mdata["rna"].obsm["nhoods"].sum(0)).ravel()
    _size_df = pd.DataFrame(dict(nhood_size= nhood_size))
    _size_df = _size_df.assign(k = str(k))
    size_df = pd.concat([size_df, _size_df], axis = 0)

if plot:
    sns.boxplot(
        size_df,
        x = "k",
        y = "nhood_size",
        palette = sns.color_palette("Paired", len(k_grid))
    )
    sns.despine()
image
%%time
pval, logfc, FDR_genes, FDR_nhoods = de_stat_neighbourhoods(
    mdata,
    sample_col = "sample",
    design = "0+tomato",
    covariates = "tomato",
    contrast=[-1, 1],
    n_jobs = 5
)

nhood_size = np.array(mdata["rna"].obsm["nhoods"].sum(0)).ravel()
mdata["milo"].var["Nhood_size"] = nhood_size
mdata["milo"].var["nhood_adjusted"] = np.sum(FDR_nhoods <= 0.05, axis = 0).values
mdata["milo"].var["gene_adjusted"] = np.sum(FDR_genes <= 0.05, axis = 0).values

sc.pl.embedding(mdata["milo"].T,
                basis = "X_milo_graph",
                size = mdata["milo"].var["Nhood_size"],
                color = "nhood_adjusted"
               )
sc.pl.embedding(mdata["milo"].T,
                basis = "X_milo_graph",
                size = mdata["milo"].var["Nhood_size"],
                color = "gene_adjusted"
               )
image image
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant