Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional parameters in some functions - new functions #460

Open
andrea-tango opened this issue Feb 4, 2019 · 23 comments · May be fixed by #2803
Open

Additional parameters in some functions - new functions #460

andrea-tango opened this issue Feb 4, 2019 · 23 comments · May be fixed by #2803

Comments

@andrea-tango
Copy link

Dear all,

I am writing to ask you some other functionalities.
I have just moved from Seurat to Scanpy and I am finding Scanpy a very nice and well done Python package.

  1. I wrote a function to show the 3D plot of the UMAP, tSNE and PCA spaces. In the scanpy.tl.tsne function is not possible to change the number of components, it calculates only the first two components, even if the scanpy.pl.tsne function has a parameter component. May you add a parameter like the n_components of the scanpy.tl.umap function?

  2. In the rank_genes_groups function the log2FC values are provided only for ‘t-test’ based methods. May you return the log2FC values (maybe named log2FC) for all the implemented statistical methods?

  3. I think that two parameters in the rank_genes_groups function should be added.

    • min_pCells to test only the genes that are detected in a minimum fraction of cells of either of the two populations (e.g., cluster 0 vs rest). For instance, min_pCells=0.3 means that at least 30% of the cells must express that gene.
    • positive, if it is True, the function should return only positive marker genes for each population.
  4. A function showing the volcano plots (based on the log2FC) can help (I can write it if the log2FC values are provided).

Thank you in advance.
Best,
Andrea

@falexwolf
Copy link
Member

Thank you for the constructive wish list! All of it sounds doable. Would you feel comfortable with adding this functionality yourself? A pull request would be welcome!

@andrea-tango
Copy link
Author

Thank you for the prompt reply, I am going to open a pull request.
I can work on points 1 and 4. May you help me with the other 2 points?

Thanks!

@MichaelPeibo
Copy link

MichaelPeibo commented Mar 10, 2019

@andrea-tango
Really awesome!
I am also wondering to find some parameters to tune in scanpy's rank_genes_groups like in Seurat.

Because I found there is some difference in makers by scanpy's default(using wilcoxon ) and Seurat's default parameters only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25. (Seurat's default method is wilcoxon), in this case, I can find interesting markers calculated by Seurat but not in Scanpy's.

However, when I tried scanpy's logreg method, I found many overlap DEGs between two calculations, aka, scanpy's logreg and Seurat's wilcox.

Have you ever came into similar results?

@falexwolf
Copy link
Member

Interesting, @MichaelPeibo! Would you share these results somewhere publicly? A notebook on GitHub?

@andrea-tango. Yes, we should get the functionality of 2 and 3 functionality into rank_genes_groups. @Koncopd, could you work on this?

@andrea-tango
Copy link
Author

@MichaelPeibo @falexwolf I started working on points 2 and 3, but it is better if you will work on these points.
I wrote the code for points 1 and 4.
In order to generate volcano plots, I calculated the log2FC relying on the diffxpy library.
I can push again the code for tSNE and also the code for volcano plots.

Please, check the rank_genes_groups function.
Considering 2 groups of cells and using the Wilcoxon test (de.test.wilcoxon) provided by the diffxpy library, I obtained different marker genes with respect to those calculated by using rank_genes_groups function (Wilcoxon test).

Many thanks.

@Koncopd
Copy link
Member

Koncopd commented Mar 10, 2019

@falexwolf
Hi, yes.

@falexwolf
Copy link
Member

Great, thank you, @andrea-tango and @Koncopd!

@andrea-tango, would you make a PR? We can then look at how you solved this. In principle, I'm very hesitant to add diffxpy as a dependency of Scanpy. It depends on Tensorflow itself, which is a large dependency. What would be OK would be to have a wrapper in scanpy.external, but I don't know whether this makes sense. Why not using diffxpys Volcano plots right away?

Regarding the discrepancy between wilxocon in diffxpy and scanpy. There obviously shouldn't be any and there also shouldn't be duplicated code, here, at all. The only reason that Scanpy has its own Wilcoxon implementation was that there was no implementation available that would scale to large sparse data. That's why @tcallies wrote the present implementation about 1.5 years ago. He benchmarked with scipy's Wilcoxon test. @davidsebfischer, can you shed light on why and how you implemented your Wilcoxon? Shouldn't we have a comparison? At the time, @tcallies wrote this and these tests. How did you write your tests?

We were just talking about log2FC, which is such a simple quantity and should evidently be properly computed by rank_genes_groups. We just had this other PR on it (#519).

@tcallies, any thoughts from your side?

@andrea-tango
Copy link
Author

andrea-tango commented Mar 10, 2019

@falexwolf I agree with you about the diffxpy a scanpy dependencies, Tensorflow is a very important dependency!

would you make a PR?

I did it, I pushed the code where I added the parameter n_components for scanpy.tl.tsne function.

Why not using diffxpy Volcano plots right away?

I wrote a function in which you can change the colour of the genes, you can add the names of the genes etc.

How did you write your tests?

I tried them on data coming from the lab in which I am working.
I can write a jupyter notebook using public dataset and push it on my copy of the scanpy repository.
Give me a couple of days.

@falexwolf
Copy link
Member

Oh, thanks! Sorry for the long downtime, the whole family was sick... I'm going through the PR now.

The tests question was actually targeted towards @davidsebfischer, but thanks anyways! The comparison question was also targeted to @davidsebfischer, @tcallies. But if you do it, @andrea-tango, awesome!

@gokceneraslan
Copy link
Collaborator

Very good suggestions. One small question:

  • positive, if it is True, the function should return only positive marker genes for each population.

Isn't this rankby_abs=False which we have by default in rank_genes_groups? (and which was actually called only_positive=True previously https://github.com/theislab/scanpy/blob/master/scanpy/tools/_rank_genes_groups.py#L100)

@MichaelPeibo
Copy link

Hi @falexwolf
I am sending you the simpler code ,my anndata which can be reproduced in some way, and csv marker file calculated by Seurat to your email, you might to repeat my marker analysis if you would like

Sorry for doing it in this way, I not familiar about how to make public notebook...and our data is too preliminary to be public.

@davidsebfischer
Copy link

Hi all, thanks for the mention @andrea-tango! If have been making multiple changes in diffxpy and batchglm recently, the following refers to the branch diffxpy dev, I haven' merged all of this into master yet as I am waiting for some last issues to be fixed.

  1. Are diffxpy's tests correct: We are using unit tests to check a) that all tests fullfill the null model. b) for standard tests that do not require model fits (Welch's t-test and rank sum test) we check that we produce the same results as scipy. We vectorise where possible but we do actually directly call scipy in the rank sum test right now. Both of these tests check out on the dev branch for the rank sum test, so this is working correctly. @falexwolf this might be a unit test that you could also introduce? I did not see this in the ones that you linked but I just glanced over. @andrea-tango please use dev right now.

  2. Nature of the rank sum test. We previously called the Wilcoxon rank sum test de.test.wilcoxon, note that this is also alternatively named Mann-Whitney U test (MWU). Importantly, MWU is for independent samples, which we always have in scRNAseq, the "wilcoxon" test in scipy is for paired samples. We therefore renamed this to de.test.rank_sum to avoid confusion. Which one are you using @falexwolf?

  3. Comparison scanpy vs diffxpy (in unit tests). I would discourage this and compare against scipy because I would consider scipy the gold standard for statistical testing. I can run comparisons if the above comments do not resolve all issues discussed here, which would imply that we might differ in our wrapping in the tests.

@andrea-tango
Copy link
Author

andrea-tango commented Mar 11, 2019

Hi @davidsebfischer, I am writing a simple jupyter notebook where I am analysing the 10x_pbmc68k_reduced.h5ad data.

I selected only clusters 0 and 1:
twoClusters = adata[np.logical_or(adata.obs.louvain == '0', adata.obs.louvain == '1')]

Running sc.tl.rank_genes_groups(twoClusters, groupby='louvain, method='wilcoxon', corr_method=''bonferroni), I obtained the following genes

image

Trying with diffxxy library,
test = de.test.wilcoxon(data=twoClusters, grouping="louvain")
there is the following error: All numbers are identical in mannwhitneyu

@andrea-tango please use dev right now.

For this test, I used the version downloaded with pip.
I can clone the repository and use the diffxpy dev branch.

@LuckyMD
Copy link
Contributor

LuckyMD commented Mar 11, 2019

Just a note... I also have some code to make a volcano plot in line [111] here. Don't know if this is still needed, but I thought I would see if someone cares.

@davidsebfischer do you allow labels in your volcano plot function? And can it take any object, or only some custom diffxpy ones? If these things aren't a problem I'd prefer to use yours tbh.

@fidelram
Copy link
Collaborator

@andrea-tango @MichaelPeibo To address the filtering of rank_genes_groups (eg. min.pct = 0.25, logfc.threshold = 0.25) I recently added a function called sc.tl.filter_rank_genes_groups. See #425

@falexwolf I don't know whysc.tl.filter_rank_genes_groups does not show up in the docs. I will take a look. Also, I just noticed that this PR with updated examples is still open. I think it would be useful to merge: scverse/scanpy_usage#11

@LuckyMD
Copy link
Contributor

LuckyMD commented Mar 11, 2019

does sc.tl.filter_rank_genes_groups filter the sc.tl.rank_genes_groups result? Or does it recompute? The former would not alleviate the multiple testing burden.

@fidelram
Copy link
Collaborator

fidelram commented Mar 11, 2019 via email

@LuckyMD
Copy link
Contributor

LuckyMD commented Mar 11, 2019

If p-values are regarded as a valuable output rather than just the ranks, it might be worth recomputing as thresholding would ease the multiple testing burden. I guess that's the idea behind the min_pCells parameter request.

@andrea-tango
Copy link
Author

@LuckyMD your tutorial is very interesting!
I agree with you regarding the min_pCells parameter, it should be used as a filtering step before calculating the marker genes.

@Koncopd
Copy link
Member

Koncopd commented Mar 12, 2019

@falexwolf @andrea-tango
I have a question regarding point 2 (log2FC values in rank_genes_groups). I see that only 'logreg' method doesn't return logfoldchanges. But logfoldchanges don't seem natural for 'logreg' as this method doesn't even use reference for calculation.

@ivirshup
Copy link
Member

Sorry this is a little off topic, but it's something I've found useful:

@LuckyMD and anyone else looking for an interactive volcano plot, I've been using hvplot in my notebooks. A volcano plot can be made from DE data frame with something like:

de_df.hvplot.scatter(
    "logfoldchanges", "pvals_adj", 
    flip_yaxis=True, logy=True, 
    hover_cols=["names"]
)
Complete example using scanpy
import pandas as pd
import numpy as np
import hvplot.pandas
import scanpy as sc

def rank_genes_groups_df(adata, group, pval_cutoff : float =None, logfc_cutoff=None): 
    d = pd.DataFrame() 
    for k in ['scores', 'names', 'logfoldchanges', 'pvals', 'pvals_adj']: 
        d[k] = adata.uns["rank_genes_groups"][k][group] 
    if pval_cutoff is not None: 
        d = d[d["pvals_adj"] < pval_cutoff] 
    if logfc_cutoff is not None: 
        d = d[d["logfoldchanges"].abs() > logfc_cutoff] 
    return d

pbmcs = sc.datasets.pbmc68k_reduced()
sc.tl.rank_genes_groups(pbmcs, "bulk_labels", n_genes=pbmcs.var_names.size)
de_df = rank_genes_groups_df(pbmcs, "CD34+")

de_df.hvplot.scatter(
    "logfoldchanges", "pvals_adj", 
    flip_yaxis=True, logy=True, 
    hover_cols=["names"]
)

@falexwolf
Copy link
Member

@ivirshup, this is cool and very useful. Would you make a notebook "Interactive plotting" for http://scanpy-tutorials.readthedocs.io? And we'd link to it from https://scanpy.readthedocs.io/en/latest/tutorials.html as done for the other notebooks. You could also add the example from #510 (comment). As time passes, this could grow. But these two little examples are a very useful start, I think.

@falexwolf
Copy link
Member

@Koncopd, updating logreg with a proper implementation accounting for reference is another story. We can talk about it sometime soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants