You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for this great tool. I am very interested in exploring the combination of milo and miloDE in a variety of datasets.
However, analyzing datasets of 200k cells and above, I just haven´t been able to make miloDE fast enough, or at times prevent it from crashing.
Therefore, I have been attempting to bring at least some functionality to python, using the pertpy-implementation of milo.
I am not computationally trained by background learning everything on the go, so I have questions regarding algorithms and implementation and would appreciate any help you can give me.
My first questions regards assign_neighbourhoods:
To my understanding, the function builds the milo-graph at k_init for sampling vertices, rebuilds the graph at k, and then refines the vertices with the vertices previously sampled at k_init. Are these steps not the same, but similar, to building a graph with miloR and running makeNhoods with graph-refinement?
I had significant issues calling RcppGreedySetCover with rpy2, so I use the greedy set cover approach from https://www.researchgate.net/publication/362015207_ML2R_Coding_Nuggets_Greedy_Set_Cover_with_Native_Python_Data_Types
import pertpy as pt
def assign_neighbourhoods(
adata,
k = 25,
reducedDim_name = "X_pca",
prop = 0.1,
):
milo = pt.tl.Milo()
mdata = milo.load(adata)
# from sklearn_ann.kneighbors.annoy import AnnoyTransformer
# transformer = AnnoyTransformer(n_neighbors=k)
sc.pp.neighbors(mdata["rna"], use_rep=reducedDim_name,# , transformer = transformer
n_neighbors = k
)
milo.make_nhoods(mdata["rna"], prop=prop)
# nhood_size = np.array(mdata["rna"].obsm["nhoods"].sum(0)).ravel()
# plt.hist(nhood_size, bins=100)
# plt.xlabel("# cells in nhood")
# plt.ylabel("# nhoods");
# plt.show()
return mdata
def greedySetCoverV2(U, S):
# build inverted index
I = {x : [i for i in S if x in S[i]] for x in U}
# pre-select necessary sets Sj into C
C = {j : S[j] for j in [I[x][0] for x in I if len(I[x]) == 1]}
U = U - set().union(*[C[j] for j in C])
while U:
j = max(S.keys(), key = lambda k: len(S[k] & U))
C[j] = S[j]
U = U - S[j]
return C
def _filter_neighbourhoods(mdata):
nhood_df = pd.DataFrame()
for nhood_idx in range(mdata["rna"].obsm["nhoods"].shape[1]):
cur_nhoods = mdata["rna"].obsm["nhoods"][:, nhood_idx].toarray().ravel()
cur_cells = adata.obs_names[cur_nhoods.astype(bool)]
cur_df = pd.DataFrame({
"set": nhood_idx,
"elements": cur_cells
})
nhood_df = pd.concat([nhood_df, cur_df], axis = 0)
S = []
for s in nhood_df.set.unique():
S.append(set(nhood_df.query('set == @s').elements.unique()))
S = dict(enumerate(S))
U = nhood_df.elements.unique()
U = set(U)
gsc_result = greedySetCoverV2(U, S)
nhs_to_use = list(gsc_result.keys())
nhs_to_use = np.asarray(nhs_to_use)
index_for_refined = adata.obs[adata.obs.nhood_ixs_refined == 1].index[nhs_to_use.astype(int)]
mdata["rna"].obs["nhood_ixs_refined"] = 0
mdata["rna"].obs.loc[index_for_refined, "nhood_ixs_refined"] = 1
mdata["rna"].obsm["nhoods"] = mdata["rna"].obsm["nhoods"][:, nhs_to_use.astype(int)]
As the python implementation does not include graph refinement, given that my assumptions above are true, my python implementation first builds a kNN-graph with scanpy/annoy. Then we run make_nhoods, which uses (non-graph) sampling for refined neighbourhoods as in the default miloR/milopy workflow. Refined sampling is (I think) performed on the adjacency matrix.
I am aware this is quite a deviation from the approach in R, but for the time being I would like to build upon that. Or do you think the non-graph refinement is ill-suited for this task? One thing I noticed when running the python implementation is, that neighbourhood sizes are a lot smaller when compared to your original package for the same values of k on the embryo dataset from the tutorial (about 10-fold, see example below). This also leads to a higher number neighbourhoods after filtering. But is this plausible?
I tested graph building in R and making the graph in python, so I found out that the higher neighbourhood sizes stem from the R implementation of buildGraph.
Is it sufficient to simply set k higher to achieve the desired neighbourhood sizes?
For calculating the AUC, pertpy provides an implementation of Augur. However, in your implementation, calculate_auc is called per neighbourhood with the parameter n_subsamples = 0. This is not possible in the python version. I do not fully understand what exactly this option does, except being a special case within Augur. Could you explain why n_subsamples = 0, or if other values might be possible?
Subsequently, I would post a feature request on the repository.
However, some tests I did suggest, that AUC values are generally high for most, if not all neighbourhoods when using python.
Then the DE testing per neighbourhood. For this, I call edgeR using an adapted version of the EdgeR-class in pertpy. Even parallelization worked well in my tests.
And finally, the spatial adjustment of p-values across neighbourhoods. Below I am posting my python-implementation of your code. This is my own translation, so I would appreciate any help to correct it.
import numpy as np
def get_weights(nhoods_x):
intersect_mat = np.dot(nhoods_x.T, nhoods_x)
t_connect = np.sum(intersect_mat, axis=1)
weights = 1 / t_connect
return weights
def _return_null_df():
null_df = pd.DataFrame(index = adata.var_names)
null_df["log_fc"] = np.nan
null_df["logCPM"] = np.nan
null_df["F"] = np.nan
null_df["p_value"] = np.nan
null_df["adj_p_value"] = np.nan
null_df = null_df.reset_index(names = ["variable"])
return null_df
def _run_edger(ad, design, contrast, model):
try:
mod = model(ad, design)
mod.fit()
res_df = mod._test_single_contrast(contrast)
return res_df
except:
return _return_null_df()
def de_stat_neighbourhoods(
mdata,
sample_col = "patient",
design = "~condition",
covariates = "condition",
contrast = None,
# subset_nhoods = stat_auc$Nhood[!is.na(stat_auc$auc)],
layout = "X_umap",
n_jobs = 1,
):
def get_cells_in_nhoods(mdata, nhood_ids):
'''
Get cells in neighbourhoods of interest '''
in_nhoods = np.array(mdata["rna"].obsm['nhoods'][:,nhood_ids.astype('int')].sum(1))
ad = mdata["rna"][in_nhoods.astype(bool), :].copy()
return ad
covs = covariates
from itertools import repeat
from joblib import Parallel, delayed
nhoods = mdata["rna"].obsm["nhoods"]
n_nhoods = nhoods.shape[1]
# n_nhoods = 100
# get generator with all nhoods
func = "sum"
layer = "counts"
all_nhoods = (get_cells_in_nhoods(mdata, nhood_ids = np.asarray([i])) for i in range(n_nhoods))
aggregated_nhoods = (
sc.get.aggregate(ad, by = [sample_col, covs], func = func, layer=layer)
for ad in all_nhoods
)
def _func_to_X(ad, func):
ad.X = ad.layers[func].copy()
return ad
aggregated_nhoods = (
_func_to_X(ad, func)
for ad in aggregated_nhoods
)
import diff_exp
if contrast is None:
from formulaic import model_matrix
mm = model_matrix("~condition", mdata["rna"].obs)
n_coefs = mm.shape[1]
contrast = np.zeros(n_coefs)
contrast[n_coefs-1] = 1
contrast = list(contrast)
del mm
args = zip(
aggregated_nhoods,
repeat(design),
repeat(contrast),
repeat(diff_exp.EdgeR)
)
start = time.time()
pval_df_list = Parallel(n_jobs=n_jobs)(delayed(_run_edger)(ad, design, contrast, model) for ad, design, contrast, model in args)
end = time.time()
print("edger done in")
print(end-start)
print("seconds")
# return pval_df_list
pval_by_nhood = pd.concat([df.set_index("variable")[["p_value"]] for df in pval_df_list], axis = 1)
FDR_across_genes = pd.concat([df.set_index("variable")[["adj_p_value"]] for df in pval_df_list], axis = 1)
logfc_nhoods = pd.concat([df.set_index("variable")[["log_fc"]] for df in pval_df_list], axis = 1)
print("Calculating FDR across nhoods")
n_genes = pval_by_nhood.shape[0]
FDR_across_nhoods = pval_by_nhood.copy()
for i in range(n_genes):
pvalues = pval_by_nhood.iloc[i, :].values
idx_not_nan = np.isnan(pvalues)
_weights = get_weights(mdata["rna"].obsm["nhoods"][:, ~idx_not_nan])
_weights = np.asarray(_weights)
_weights = _weights.ravel()
# o = order_matrix(pvalues)
pvalues = pvalues[~idx_not_nan]
o = np.argsort(pvalues)
weights = _weights[o]
pvalues = pvalues[o]
_weights
# weights = weights[o]
adjp = np.zeros(len(o))
adjp[o] = np.flip(np.minimum.accumulate(np.flip(np.sum(weights) * pvalues / np.cumsum(weights))))
adjp = np.minimum(adjp, 1)
FDR_across_nhoods.iloc[i, :] = np.nan
FDR_across_nhoods.iloc[i, ~idx_not_nan] = adjp
return pval_by_nhood, logfc_nhoods, FDR_across_genes, FDR_across_nhoods
I would appreciate any help to make this work, and thank you in advance.
Best,
max
Here an example using the miloDE tutorial data:
r('''
library(miloDE)
# library containing toy data
suppressMessages(library(MouseGastrulationData))
# analysis libraries
library(scuttle)
suppressMessages(library(miloR))
suppressMessages(library(uwot))
library(scran)
suppressMessages(library(dplyr))
library(reshape2)
# packages for gene cluster analysis
# library(scWGCNA)
#>
# suppressMessages(library(Seurat))
# plotting libraries
library(ggplot2)
library(viridis)
#> Loading required package: viridisLite
library(ggpubr)
''')
r('''
library(BiocParallel)
ncores = 4
mcparam = MulticoreParam(workers = ncores)
register(mcparam)
# load chimera Tal1
sce = suppressMessages(MouseGastrulationData::Tal1ChimeraData())
# downsample to few selected cell types
cts = c("Spinal cord" , "Haematoendothelial progenitors", "Endothelium" , "Blood progenitors 1" , "Blood progenitors 2")
sce = sce[, sce$celltype.mapped %in% cts]
# let's rename Haematoendothelial progenitors
sce$celltype.mapped[sce$celltype.mapped == "Haematoendothelial progenitors"] = "Haem. prog-s."
# delete row for tomato
sce = sce[!rownames(sce) == "tomato-td" , ]
# add logcounts
sce = logNormCounts(sce)
# update tomato field to be more interpretable
sce$tomato = sapply(sce$tomato , function(x) ifelse(isTRUE(x) , "Tal1_KO" , "WT"))
# for this exercise, we focus on 3000 highly variable genes (for computational efficiency)
dec.sce = modelGeneVar(sce)
hvg.genes = getTopHVGs(dec.sce, n = 3000)
sce = sce[hvg.genes , ]
# change rownames to symbol names
rowdata = as.data.frame(rowData(sce))
rownames(sce) = rowdata$SYMBOL
# add UMAPs
set.seed(32)
umaps = as.data.frame(uwot::umap(reducedDim(sce , "pca.corrected")))
# let's store UMAPs - we will use them for visualisation
reducedDim(sce , "UMAP") = umaps
# plot UMAPs, colored by cell types
umaps = cbind(as.data.frame(colData(sce)) , reducedDim(sce , "UMAP"))
names(EmbryoCelltypeColours)[names(EmbryoCelltypeColours) == "Haematoendothelial progenitors"] = "Haem. prog-s."
# cols_ct = EmbryoCelltypeColours[names(EmbryoCelltypeColours) %in% unique(umaps$celltype.mapped)]
''')
# _r_to_py is a self-written helper function for rpy conversion
counts = _r_to_py(r('counts(sce)'))
counts = counts.T
counts
obs = _r_to_py(r('as.data.frame(colData(sce))'))
var = _r_to_py(r('as.data.frame(rowData(sce))'))
X = _r_to_py(r('logcounts(sce)'))
X = X.T
pca_corrected = _r_to_py(r('reducedDim(sce, "pca.corrected")'))
X_umap = _r_to_py(r('reducedDim(sce, "UMAP")'))
adata = anndata.AnnData(
X = X,
obs = obs,
var = var
)
adata.layers["counts"] = counts
adata.obsm["pca_corrected"] = pca_corrected
adata.obsm["X_umap"] = X_umap
adata.obsm["X_umap"] = adata.obsm["X_umap"].values
sc.pl.umap(adata,
color = "celltype.mapped")
%%time
plot = True
LATENT_KEY = "pca_corrected"
k_grid = list(range(10, 40, 5))
# k_grid.append(100)
prop = 0.5
filtering = True
size_df = pd.DataFrame()
for k in k_grid:
print(k)
mdata = assign_neighbourhoods(adata, k = k, prop = prop)
if filtering:
print("filtering")
_filter_neighbourhoods(mdata)
nhood_size = np.array(mdata["rna"].obsm["nhoods"].sum(0)).ravel()
_size_df = pd.DataFrame(dict(nhood_size= nhood_size))
_size_df = _size_df.assign(k = str(k))
size_df = pd.concat([size_df, _size_df], axis = 0)
if plot:
sns.boxplot(
size_df,
x = "k",
y = "nhood_size",
palette = sns.color_palette("Paired", len(k_grid))
)
sns.despine()
Dear all,
Thank you for this great tool. I am very interested in exploring the combination of milo and miloDE in a variety of datasets.
However, analyzing datasets of 200k cells and above, I just haven´t been able to make miloDE fast enough, or at times prevent it from crashing.
Therefore, I have been attempting to bring at least some functionality to python, using the pertpy-implementation of milo.
I am not computationally trained by background learning everything on the go, so I have questions regarding algorithms and implementation and would appreciate any help you can give me.
My first questions regards
assign_neighbourhoods
:To my understanding, the function builds the milo-graph at
k_init
for sampling vertices, rebuilds the graph atk
, and then refines the vertices with the vertices previously sampled atk_init
. Are these steps not the same, but similar, to building a graph withmiloR
and runningmakeNhoods
with graph-refinement?I had significant issues calling
RcppGreedySetCover
with rpy2, so I use the greedy set cover approach from https://www.researchgate.net/publication/362015207_ML2R_Coding_Nuggets_Greedy_Set_Cover_with_Native_Python_Data_TypesAs the python implementation does not include graph refinement, given that my assumptions above are true, my python implementation first builds a kNN-graph with scanpy/annoy. Then we run
make_nhoods
, which uses (non-graph) sampling for refined neighbourhoods as in the default miloR/milopy workflow. Refined sampling is (I think) performed on the adjacency matrix.I am aware this is quite a deviation from the approach in R, but for the time being I would like to build upon that. Or do you think the non-graph refinement is ill-suited for this task? One thing I noticed when running the python implementation is, that neighbourhood sizes are a lot smaller when compared to your original package for the same values of
k
on the embryo dataset from the tutorial (about 10-fold, see example below). This also leads to a higher number neighbourhoods after filtering. But is this plausible?I tested graph building in R and making the graph in python, so I found out that the higher neighbourhood sizes stem from the R implementation of
buildGraph
.Is it sufficient to simply set
k
higher to achieve the desired neighbourhood sizes?For calculating the AUC, pertpy provides an implementation of Augur. However, in your implementation,
calculate_auc
is called per neighbourhood with the parametern_subsamples = 0
. This is not possible in the python version. I do not fully understand what exactly this option does, except being a special case within Augur. Could you explain whyn_subsamples = 0
, or if other values might be possible?Subsequently, I would post a feature request on the repository.
However, some tests I did suggest, that AUC values are generally high for most, if not all neighbourhoods when using python.
Then the DE testing per neighbourhood. For this, I call edgeR using an adapted version of the
EdgeR
-class in pertpy. Even parallelization worked well in my tests.And finally, the spatial adjustment of p-values across neighbourhoods. Below I am posting my python-implementation of your code. This is my own translation, so I would appreciate any help to correct it.
I would appreciate any help to make this work, and thank you in advance.
Best,
max
Here an example using the miloDE tutorial data:
The text was updated successfully, but these errors were encountered: