Additional parameters in some functions - new functions #460

andrea-tango · 2019-02-04T17:57:35Z

Dear all,

I am writing to ask you some other functionalities.
I have just moved from Seurat to Scanpy and I am finding Scanpy a very nice and well done Python package.

I wrote a function to show the 3D plot of the UMAP, tSNE and PCA spaces. In the scanpy.tl.tsne function is not possible to change the number of components, it calculates only the first two components, even if the scanpy.pl.tsne function has a parameter component. May you add a parameter like the n_components of the scanpy.tl.umap function?
In the rank_genes_groups function the log2FC values are provided only for ‘t-test’ based methods. May you return the log2FC values (maybe named log2FC) for all the implemented statistical methods?
I think that two parameters in the rank_genes_groups function should be added.
- min_pCells to test only the genes that are detected in a minimum fraction of cells of either of the two populations (e.g., cluster 0 vs rest). For instance, min_pCells=0.3 means that at least 30% of the cells must express that gene.
- positive, if it is True, the function should return only positive marker genes for each population.
A function showing the volcano plots (based on the log2FC) can help (I can write it if the log2FC values are provided).

Thank you in advance.
Best,
Andrea

The text was updated successfully, but these errors were encountered:

falexwolf · 2019-02-05T01:10:43Z

Thank you for the constructive wish list! All of it sounds doable. Would you feel comfortable with adding this functionality yourself? A pull request would be welcome!

andrea-tango · 2019-02-05T09:50:47Z

Thank you for the prompt reply, I am going to open a pull request.
I can work on points 1 and 4. May you help me with the other 2 points?

Thanks!

MichaelPeibo · 2019-03-10T02:47:43Z

@andrea-tango
Really awesome!
I am also wondering to find some parameters to tune in scanpy's rank_genes_groups like in Seurat.

Because I found there is some difference in makers by scanpy's default(using wilcoxon ) and Seurat's default parameters only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25. (Seurat's default method is wilcoxon), in this case, I can find interesting markers calculated by Seurat but not in Scanpy's.

However, when I tried scanpy's logreg method, I found many overlap DEGs between two calculations, aka, scanpy's logreg and Seurat's wilcox.

Have you ever came into similar results?

falexwolf · 2019-03-10T15:56:26Z

Interesting, @MichaelPeibo! Would you share these results somewhere publicly? A notebook on GitHub?

@andrea-tango. Yes, we should get the functionality of 2 and 3 functionality into rank_genes_groups. @Koncopd, could you work on this?

andrea-tango · 2019-03-10T16:34:50Z

@MichaelPeibo @falexwolf I started working on points 2 and 3, but it is better if you will work on these points.
I wrote the code for points 1 and 4.
In order to generate volcano plots, I calculated the log2FC relying on the diffxpy library.
I can push again the code for tSNE and also the code for volcano plots.

Please, check the rank_genes_groups function.
Considering 2 groups of cells and using the Wilcoxon test (de.test.wilcoxon) provided by the diffxpy library, I obtained different marker genes with respect to those calculated by using rank_genes_groups function (Wilcoxon test).

Many thanks.

Koncopd · 2019-03-10T16:37:37Z

@falexwolf
Hi, yes.

falexwolf · 2019-03-10T16:49:25Z

Great, thank you, @andrea-tango and @Koncopd!

@andrea-tango, would you make a PR? We can then look at how you solved this. In principle, I'm very hesitant to add diffxpy as a dependency of Scanpy. It depends on Tensorflow itself, which is a large dependency. What would be OK would be to have a wrapper in scanpy.external, but I don't know whether this makes sense. Why not using diffxpys Volcano plots right away?

Regarding the discrepancy between wilxocon in diffxpy and scanpy. There obviously shouldn't be any and there also shouldn't be duplicated code, here, at all. The only reason that Scanpy has its own Wilcoxon implementation was that there was no implementation available that would scale to large sparse data. That's why @tcallies wrote the present implementation about 1.5 years ago. He benchmarked with scipy's Wilcoxon test. @davidsebfischer, can you shed light on why and how you implemented your Wilcoxon? Shouldn't we have a comparison? At the time, @tcallies wrote this and these tests. How did you write your tests?

We were just talking about log2FC, which is such a simple quantity and should evidently be properly computed by rank_genes_groups. We just had this other PR on it (#519).

@tcallies, any thoughts from your side?

andrea-tango · 2019-03-10T17:08:32Z

@falexwolf I agree with you about the diffxpy a scanpy dependencies, Tensorflow is a very important dependency!

would you make a PR?

I did it, I pushed the code where I added the parameter n_components for scanpy.tl.tsne function.

Why not using diffxpy Volcano plots right away?

I wrote a function in which you can change the colour of the genes, you can add the names of the genes etc.

How did you write your tests?

I tried them on data coming from the lab in which I am working.
I can write a jupyter notebook using public dataset and push it on my copy of the scanpy repository.
Give me a couple of days.

falexwolf · 2019-03-10T17:39:18Z

Oh, thanks! Sorry for the long downtime, the whole family was sick... I'm going through the PR now.

The tests question was actually targeted towards @davidsebfischer, but thanks anyways! The comparison question was also targeted to @davidsebfischer, @tcallies. But if you do it, @andrea-tango, awesome!

gokceneraslan · 2019-03-10T19:10:53Z

Very good suggestions. One small question:

positive, if it is True, the function should return only positive marker genes for each population.

Isn't this rankby_abs=False which we have by default in rank_genes_groups? (and which was actually called only_positive=True previously https://github.com/theislab/scanpy/blob/master/scanpy/tools/_rank_genes_groups.py#L100)

MichaelPeibo · 2019-03-11T01:05:56Z

Hi @falexwolf
I am sending you the simpler code ,my anndata which can be reproduced in some way, and csv marker file calculated by Seurat to your email, you might to repeat my marker analysis if you would like

Sorry for doing it in this way, I not familiar about how to make public notebook...and our data is too preliminary to be public.

davidsebfischer · 2019-03-11T10:36:27Z

Hi all, thanks for the mention @andrea-tango! If have been making multiple changes in diffxpy and batchglm recently, the following refers to the branch diffxpy dev, I haven' merged all of this into master yet as I am waiting for some last issues to be fixed.

Are diffxpy's tests correct: We are using unit tests to check a) that all tests fullfill the null model. b) for standard tests that do not require model fits (Welch's t-test and rank sum test) we check that we produce the same results as scipy. We vectorise where possible but we do actually directly call scipy in the rank sum test right now. Both of these tests check out on the dev branch for the rank sum test, so this is working correctly. @falexwolf this might be a unit test that you could also introduce? I did not see this in the ones that you linked but I just glanced over. @andrea-tango please use dev right now.
Nature of the rank sum test. We previously called the Wilcoxon rank sum test de.test.wilcoxon, note that this is also alternatively named Mann-Whitney U test (MWU). Importantly, MWU is for independent samples, which we always have in scRNAseq, the "wilcoxon" test in scipy is for paired samples. We therefore renamed this to de.test.rank_sum to avoid confusion. Which one are you using @falexwolf?
Comparison scanpy vs diffxpy (in unit tests). I would discourage this and compare against scipy because I would consider scipy the gold standard for statistical testing. I can run comparisons if the above comments do not resolve all issues discussed here, which would imply that we might differ in our wrapping in the tests.

andrea-tango · 2019-03-11T12:28:19Z

Hi @davidsebfischer, I am writing a simple jupyter notebook where I am analysing the 10x_pbmc68k_reduced.h5ad data.

I selected only clusters 0 and 1:
twoClusters = adata[np.logical_or(adata.obs.louvain == '0', adata.obs.louvain == '1')]

Running sc.tl.rank_genes_groups(twoClusters, groupby='louvain, method='wilcoxon', corr_method=''bonferroni), I obtained the following genes

Trying with diffxxy library,
test = de.test.wilcoxon(data=twoClusters, grouping="louvain")
there is the following error: All numbers are identical in mannwhitneyu

@andrea-tango please use dev right now.

For this test, I used the version downloaded with pip.
I can clone the repository and use the diffxpy dev branch.

LuckyMD · 2019-03-11T12:38:30Z

Just a note... I also have some code to make a volcano plot in line [111] here. Don't know if this is still needed, but I thought I would see if someone cares.

@davidsebfischer do you allow labels in your volcano plot function? And can it take any object, or only some custom diffxpy ones? If these things aren't a problem I'd prefer to use yours tbh.

fidelram · 2019-03-11T13:09:31Z

@andrea-tango @MichaelPeibo To address the filtering of rank_genes_groups (eg. min.pct = 0.25, logfc.threshold = 0.25) I recently added a function called sc.tl.filter_rank_genes_groups. See #425

@falexwolf I don't know whysc.tl.filter_rank_genes_groups does not show up in the docs. I will take a look. Also, I just noticed that this PR with updated examples is still open. I think it would be useful to merge: scverse/scanpy_usage#11

LuckyMD · 2019-03-11T14:47:05Z

does sc.tl.filter_rank_genes_groups filter the sc.tl.rank_genes_groups result? Or does it recompute? The former would not alleviate the multiple testing burden.

fidelram · 2019-03-11T17:37:15Z

does not recompute, simply saves the filtered data under adata.uns['rank_genes_groups_filtered']. Thus, different parameters can be tested quickly. Off course, sc.tl.rank_genes_groups has to be call first.

…

On Mon, Mar 11, 2019 at 3:47 PM MalteDLuecken ***@***.***> wrote: does sc.tl.filter_rank_genes_groups filter the sc.tl.rank_genes_groups result? Or does it recompute? The former would not alleviate the multiple testing burden. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#460 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEu_1debC8DotLkQywhO8zJpEvfkBbSHks5vVmxpgaJpZM4ahuSs> .

LuckyMD · 2019-03-11T18:30:41Z

If p-values are regarded as a valuable output rather than just the ranks, it might be worth recomputing as thresholding would ease the multiple testing burden. I guess that's the idea behind the min_pCells parameter request.

andrea-tango · 2019-03-12T01:14:59Z

@LuckyMD your tutorial is very interesting!
I agree with you regarding the min_pCells parameter, it should be used as a filtering step before calculating the marker genes.

Koncopd · 2019-03-12T10:14:19Z

@falexwolf @andrea-tango
I have a question regarding point 2 (log2FC values in rank_genes_groups). I see that only 'logreg' method doesn't return logfoldchanges. But logfoldchanges don't seem natural for 'logreg' as this method doesn't even use reference for calculation.

ivirshup · 2019-03-16T04:37:40Z

Sorry this is a little off topic, but it's something I've found useful:

@LuckyMD and anyone else looking for an interactive volcano plot, I've been using hvplot in my notebooks. A volcano plot can be made from DE data frame with something like:

de_df.hvplot.scatter(
    "logfoldchanges", "pvals_adj", 
    flip_yaxis=True, logy=True, 
    hover_cols=["names"]
)

Complete example using scanpy

import pandas as pd
import numpy as np
import hvplot.pandas
import scanpy as sc

def rank_genes_groups_df(adata, group, pval_cutoff : float =None, logfc_cutoff=None): 
    d = pd.DataFrame() 
    for k in ['scores', 'names', 'logfoldchanges', 'pvals', 'pvals_adj']: 
        d[k] = adata.uns["rank_genes_groups"][k][group] 
    if pval_cutoff is not None: 
        d = d[d["pvals_adj"] < pval_cutoff] 
    if logfc_cutoff is not None: 
        d = d[d["logfoldchanges"].abs() > logfc_cutoff] 
    return d

pbmcs = sc.datasets.pbmc68k_reduced()
sc.tl.rank_genes_groups(pbmcs, "bulk_labels", n_genes=pbmcs.var_names.size)
de_df = rank_genes_groups_df(pbmcs, "CD34+")

de_df.hvplot.scatter(
    "logfoldchanges", "pvals_adj", 
    flip_yaxis=True, logy=True, 
    hover_cols=["names"]
)

falexwolf · 2019-03-19T10:37:46Z

@ivirshup, this is cool and very useful. Would you make a notebook "Interactive plotting" for http://scanpy-tutorials.readthedocs.io? And we'd link to it from https://scanpy.readthedocs.io/en/latest/tutorials.html as done for the other notebooks. You could also add the example from #510 (comment). As time passes, this could grow. But these two little examples are a very useful start, I think.

falexwolf · 2019-03-19T10:55:41Z

@Koncopd, updating logreg with a proper implementation accounting for reference is another story. We can talk about it sometime soon.

davidsebfischer mentioned this issue Mar 11, 2019

volcano plots specifications theislab/diffxpy#81

Open

ivirshup mentioned this issue Jul 5, 2019

Wilcoxon-Rank-Sum test in rank_genes_groups: suspicious code & ties not corrected #698

Closed

This was referenced Jan 11, 2024

Add tSNE components parameter #2802

Closed

tSNE components parameter #2803

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional parameters in some functions - new functions #460

Additional parameters in some functions - new functions #460

andrea-tango commented Feb 4, 2019

falexwolf commented Feb 5, 2019

andrea-tango commented Feb 5, 2019

MichaelPeibo commented Mar 10, 2019 •

edited

Loading

falexwolf commented Mar 10, 2019

andrea-tango commented Mar 10, 2019

Koncopd commented Mar 10, 2019

falexwolf commented Mar 10, 2019

andrea-tango commented Mar 10, 2019 •

edited

Loading

falexwolf commented Mar 10, 2019

gokceneraslan commented Mar 10, 2019

MichaelPeibo commented Mar 11, 2019

davidsebfischer commented Mar 11, 2019

andrea-tango commented Mar 11, 2019 •

edited

Loading

LuckyMD commented Mar 11, 2019 •

edited

Loading

fidelram commented Mar 11, 2019

LuckyMD commented Mar 11, 2019

fidelram commented Mar 11, 2019 via email

LuckyMD commented Mar 11, 2019

andrea-tango commented Mar 12, 2019

Koncopd commented Mar 12, 2019

ivirshup commented Mar 16, 2019

falexwolf commented Mar 19, 2019

falexwolf commented Mar 19, 2019

Additional parameters in some functions - new functions #460

Additional parameters in some functions - new functions #460

Comments

andrea-tango commented Feb 4, 2019

falexwolf commented Feb 5, 2019

andrea-tango commented Feb 5, 2019

MichaelPeibo commented Mar 10, 2019 • edited Loading

falexwolf commented Mar 10, 2019

andrea-tango commented Mar 10, 2019

Koncopd commented Mar 10, 2019

falexwolf commented Mar 10, 2019

andrea-tango commented Mar 10, 2019 • edited Loading

falexwolf commented Mar 10, 2019

gokceneraslan commented Mar 10, 2019

MichaelPeibo commented Mar 11, 2019

davidsebfischer commented Mar 11, 2019

andrea-tango commented Mar 11, 2019 • edited Loading

LuckyMD commented Mar 11, 2019 • edited Loading

fidelram commented Mar 11, 2019

LuckyMD commented Mar 11, 2019

fidelram commented Mar 11, 2019 via email

LuckyMD commented Mar 11, 2019

andrea-tango commented Mar 12, 2019

Koncopd commented Mar 12, 2019

ivirshup commented Mar 16, 2019

falexwolf commented Mar 19, 2019

falexwolf commented Mar 19, 2019

MichaelPeibo commented Mar 10, 2019 •

edited

Loading

andrea-tango commented Mar 10, 2019 •

edited

Loading

andrea-tango commented Mar 11, 2019 •

edited

Loading

LuckyMD commented Mar 11, 2019 •

edited

Loading