-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metric morans i #303
base: main
Are you sure you want to change the base?
Add metric morans i #303
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have some questions about the rationale for the rescaling and comparison pre and post. It seems in the end, you default to not comparing the score to pre-integration? Do we have a ground truth that 1 is desirable for all HVGs that we're using? This seems a bit tough for me to argue for. Why not go for the pre-post comparison instead? Also, I'm not sure why we should care about distinguishing values of -1 and 0.
adata_pre, | ||
adata_post, | ||
batch_key, | ||
n_hvg=1000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1000 hvgs sounds like a lot for this as default. Is there any reason for this? Can it be reduced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this can be reduced - maybe to 100?
sc.pp.neighbors(adata_post, use_rep=embed) | ||
# Prepare pre data | ||
adata_pre = adata_pre.copy() | ||
adata_pre.obs[batch_key] = adata_pre.obs[batch_key].astype("category") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a call to check_sanity()
or the check_adata
or check_batch
util functions from here:
Line 25 in f35b0c9
def check_sanity(adata, batch, hvg): |
cond = ( | ||
np.array(np.mean(data[:, hvgs].X, axis=0)) | ||
!= np.array(data[0, hvgs].X.todense()) | ||
).squeeze() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mean equal to first value could happen if the genes are not constant, no? Why not define this via standard deviation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right! std is better!
) | ||
adata_sample_scl = adata_sample.copy() | ||
# PP each sample: scaling, pca, neighbours | ||
sc.pp.scale(adata_sample_scl, max_value=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just out of curiosity: any reason for scaling here specifically or out of habit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the code was adapted from Karin. @Hrovatin, do you have an opinion here?
batch_mis = pd.concat(batch_mis, axis=1) | ||
# Difference with integrated data | ||
mi_diffs = adata_post.var["morans_i"] - batch_mis.max(axis=1) | ||
avg_mi_diff = mi_diffs.mean() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't entirely understand this part. You are expecting optimal integration to mean that Moran's I is the same same in the full integration as the max in any individual batch? Is the max used due to the possibility of having unique cell types in a single batch? Is this affected by cell type composition differences across batches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the max gives an upper value on the degree at which an individual gene can be clustered. I changed it in #304 such that only worse morans I scores are counted - if you agree that the max is an upper bound per gene, then a difference > 0 would mean that the integration was successful in maintaining this trait of the data.
To compare different integrations, I believe that that compare_pre=False
computation would be sensible as scores are directly comparable (their mean).
# Rescale so that it is between [0,1] where 1 is better | ||
# Moran's I will be in [-1,1] and thus difference can be in [-2,2] | ||
if rescale: | ||
res = (avg_mi_diff + 2) / 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another note on rescaling. Here it seems that a difference of 2 (perfect auto-correlation of markers in the integrated data but max per-batch auto-correlation is -1) is optimal. I'm not sure this should be the case. Wouldn't the idea be to detect the same spatial correlation structure in a single batch as across all of them? I guess you want a 1 - (avg_mi_diff + 2) / 4
here, no?
Also, should a Moran's I of -1 and 0 be distinguished for the comparison at all? Both mean no biologically meaningful information is encoded, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 means perfectly dispersed, if we are looking for a clustered signal we could do max(0, score)
to have the score in [0,1]
from the beginning - what do you think?
Wrt to the 1 - ...
, I totally agree. Minimal distances should yield a high score.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: I changed it such as to only consider "bad" differences, please check in #304
res = adata_post.var["morans_i"].mean() | ||
if rescale: | ||
# Moran's I will be in [-1,1] | ||
res = (res + 1) / 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also here: should a Moran's I of 0 be better than one of -1? Could that difference only be due to sparsity of the marker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to scores >0 as we are mostly interested in clusterings.
No description provided.