Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metric morans i #303

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from
Draft

Add metric morans i #303

wants to merge 7 commits into from

Conversation

MxMstrmn
Copy link
Contributor

@MxMstrmn MxMstrmn commented May 2, 2022

No description provided.

@MxMstrmn MxMstrmn requested a review from mumichae May 2, 2022 14:56
Copy link
Collaborator

@LuckyMD LuckyMD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have some questions about the rationale for the rescaling and comparison pre and post. It seems in the end, you default to not comparing the score to pre-integration? Do we have a ground truth that 1 is desirable for all HVGs that we're using? This seems a bit tough for me to argue for. Why not go for the pre-post comparison instead? Also, I'm not sure why we should care about distinguishing values of -1 and 0.

scib/metrics/metrics.py Outdated Show resolved Hide resolved
adata_pre,
adata_post,
batch_key,
n_hvg=1000,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1000 hvgs sounds like a lot for this as default. Is there any reason for this? Can it be reduced?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this can be reduced - maybe to 100?

sc.pp.neighbors(adata_post, use_rep=embed)
# Prepare pre data
adata_pre = adata_pre.copy()
adata_pre.obs[batch_key] = adata_pre.obs[batch_key].astype("category")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a call to check_sanity() or the check_adata or check_batch util functions from here:

def check_sanity(adata, batch, hvg):

cond = (
np.array(np.mean(data[:, hvgs].X, axis=0))
!= np.array(data[0, hvgs].X.todense())
).squeeze()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mean equal to first value could happen if the genes are not constant, no? Why not define this via standard deviation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right! std is better!

)
adata_sample_scl = adata_sample.copy()
# PP each sample: scaling, pca, neighbours
sc.pp.scale(adata_sample_scl, max_value=10)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just out of curiosity: any reason for scaling here specifically or out of habit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the code was adapted from Karin. @Hrovatin, do you have an opinion here?

batch_mis = pd.concat(batch_mis, axis=1)
# Difference with integrated data
mi_diffs = adata_post.var["morans_i"] - batch_mis.max(axis=1)
avg_mi_diff = mi_diffs.mean()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't entirely understand this part. You are expecting optimal integration to mean that Moran's I is the same same in the full integration as the max in any individual batch? Is the max used due to the possibility of having unique cell types in a single batch? Is this affected by cell type composition differences across batches?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the max gives an upper value on the degree at which an individual gene can be clustered. I changed it in #304 such that only worse morans I scores are counted - if you agree that the max is an upper bound per gene, then a difference > 0 would mean that the integration was successful in maintaining this trait of the data.

To compare different integrations, I believe that that compare_pre=False computation would be sensible as scores are directly comparable (their mean).

# Rescale so that it is between [0,1] where 1 is better
# Moran's I will be in [-1,1] and thus difference can be in [-2,2]
if rescale:
res = (avg_mi_diff + 2) / 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another note on rescaling. Here it seems that a difference of 2 (perfect auto-correlation of markers in the integrated data but max per-batch auto-correlation is -1) is optimal. I'm not sure this should be the case. Wouldn't the idea be to detect the same spatial correlation structure in a single batch as across all of them? I guess you want a 1 - (avg_mi_diff + 2) / 4 here, no?

Also, should a Moran's I of -1 and 0 be distinguished for the comparison at all? Both mean no biologically meaningful information is encoded, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 means perfectly dispersed, if we are looking for a clustered signal we could do max(0, score) to have the score in [0,1] from the beginning - what do you think?

Wrt to the 1 - ... , I totally agree. Minimal distances should yield a high score.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: I changed it such as to only consider "bad" differences, please check in #304

res = adata_post.var["morans_i"].mean()
if rescale:
# Moran's I will be in [-1,1]
res = (res + 1) / 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also here: should a Moran's I of 0 be better than one of -1? Could that difference only be due to sparsity of the marker?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to scores >0 as we are mostly interested in clusterings.

@MxMstrmn MxMstrmn mentioned this pull request May 3, 2022
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants