Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wilcoxon-Rank-Sum test in rank_genes_groups: suspicious code & ties not corrected #698

Closed
idavydov opened this issue Jun 19, 2019 · 8 comments
Labels
Area – Differential Expression Differential expression

Comments

@idavydov
Copy link

Hi,
Thanks a lot for the library.

We're having some issues with Wilcoxon-Rank-Sum test in rank_genes_groups. And I noticed a suspicious code in the implementation:

scores[left:right] = np.sum(ranks.loc[0:n_active, :])

Shouldn't it be .iloc?

Additionally, it seems there is no tie correction in the code. I think for sparse data correction this could be an issue.

image

There is an implementation of tiecorrect in scipy.

Thanks,
Iakov

@idavydov idavydov changed the title Wilcoxon-Rank-Sum test in rank_genes_groups: suspicious code & tie not corrected Wilcoxon-Rank-Sum test in rank_genes_groups: suspicious code & ties not corrected Jun 19, 2019
@idavydov
Copy link
Author

idavydov commented Jun 19, 2019

I wrote this code for tie correction; it is inspired by scipy's implementation.

def matrix_tiecorrect(rankvals):
    size = np.float64(rankvals.shape[1])
    if size < 2:
        return np.repeat(rankvals.shape[0], 1.0)

    arr = np.sort(rankvals, axis=1)
    tf = np.insert(arr[:, 1:] != arr[:, :-1], (0, arr.shape[1]-1), True, axis=1)
    idx = np.where(tf, np.arange(tf.shape[1]), 0)
    idx = np.sort(idx, axis=1)
    cnt = np.diff(idx, axis=1).astype(np.float64)

    return 1.0 - (cnt**3 - cnt).sum(axis=1) / (size**3 - size)

@ivirshup
Copy link
Member

ivirshup commented Jul 5, 2019

@idavydov, I think .loc and .iloc would be equivalent in this case, since the index should just be sorted integers. Do you have a case where changing it makes a difference?

I'm not too sure about why we don't do tie correction. I think there was some discussion of that here: #460

@idavydov
Copy link
Author

Hi @ivirshup,

It took us some time to come up with a reproducible example. But then we realized that this behavior only is present in scanpy==1.4 (and perhaps earlier). In 1.4.1 and 1.4.3 crazy Z-scores are gone. The results seem to be correct.

Regarding ties. Quickly scanning through the thread I didn't find any mentions of tie correction (maybe I missed something).

In any case, please consider the crazy Z-score issue resolved. Tie correction still could be discussed, I think.

@ivirshup
Copy link
Member

Great, just wanted to make sure that we had that out of the way first.

About the tie correction, I'm not the most knowledgeable person about our differential expression testing. Maybe @falexwolf or @a-munoz-rojas would be able to comment on this?

@idavydov, what do you think our results should be? Is there a gold standard in scipy.stats which we should be returning the same results as?

@idavydov
Copy link
Author

scipy.stats.mannwhitneyu uses tie correction (and there is no way to disable it). I think the better option would be to make it enabled by default and have an option to disable it.

@a-munoz-rojas
Copy link
Contributor

Hi! Sorry for the very long delay in replying. Indeed, the version in the code doesn't do tie correction. A while ago when originally implementing some changes to this function, we tried using the scipy.stats.mannwhitneyumethod, but it was significantly slower so we kept the current version instead. If there is a way to improve the performance of scipy version, it might be worth trying

@idavydov
Copy link
Author

idavydov commented Sep 6, 2019

Hi @a-munoz-rojas,

I don't think there is a way to speed-up scipy.stats.mannwhitney, as it expects 1d vectors; not a matrix.

Regarding ties, this is a simple multiplier. So should be easy to implement or use from scipy.stats.

I have a matrix version of scipy.stats.mannwhitney and scipy.stats.tiecorrect which is almost a 1-to-1 rewrite. I can share it in case you are interested.

@a-munoz-rojas
Copy link
Contributor

Just brining this thread back up - I think it would be useful to have tie-correction in the code. @falexwolf what do you think? If we agree, @idavydov would you be able submit a pull request to implement it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area – Differential Expression Differential expression
Projects
None yet
Development

No branches or pull requests

4 participants