SC2P
is a package designed for testing differential expression (DE) for data from single-cell RNA-seq experiment. It provides functionalities for testing DE in two phases: (1) phase transition (difference in the probabilities of being expressed); and (2) magnitude tuning (difference in the levels of expression once the gene is on).
Single-cell RNA-sequencing (scRNA-seq) has emerged recently as a powerful technology to investigate transcriptomic variation at the individual cell level. Compared to traditional bulk RNA-seq, scRNA-seq reveals much detailed information for inter-cellular heterogeneities. scRNA-seq data show clear evidence of binary status of transcription, which we refer to as phases in transcription: Phase I corresponds to low level non-specific transcription (for example, as a result of random initiation), and Phase II corresponds to targeted specific transcription. The regulation of transcription includes a phase transition between Phase I to Phase II, as well as continuous regulation within Phase II. Both are important regulatory mechanisms that need to be identified in the DE analysis.
The SC2P package identify DE genes in two phases seperately. It implements a rigorous statistical method to determine the phases for all genes in all cells in a data-driven way, with consideration of cell- and gene-specific characteristics. Compared with methods using an ad hoc thresholds to determine phases, SC2P achieves better sensitivity and accuracy.
Run following commands in R:
library(devtools)
install_github("haowulab/SC2P", build_vignettes=TRUE)
The second line might take a little time and install some extra packages for building the package vignette.
To view the package vignette, do
library(SC2P)
vignette("SC2P")
The content in this README file is essentially the same as the package vignette.
SC2P starts from a count matrix for gene expressionsm, and a data frame for cell information. In the count matrix, each row corresponds to a gene and each column corresponds to a cell. Each row of the cell information data frame contains the annotation for a cell. The number of columns of the count matrix and the number of rows for the cell data frame must match. Below we will use a small part of a public human brain dataset (GSE67835) to illustrate the workflow of SC2P. The data is distributed with SC2P as brain_scRNAseq
.
1. Load library and example data
library(SC2P)
data(brain_scRNAseq)
2. Create an ExpressionSet
object out of the count matrix and cell information data frame.
colnames(Y) <- rownames(design)
phenoData <- new("AnnotatedDataFrame", data=design)
eset <- ExpressionSet(assayData=Y, phenoData=phenoData)
eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 10483 features, 100 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: 15 62 ... 310 (100 total)
## varLabels: tissue celltype ... adult (6 total)
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:
3. estimate the phases for all genes in all cells using eset2Phase
function. This returns an object of sc2pSet
.
data <- eset2Phase(eset)
data
## sc2pSet (storageMode: lockedEnvironment)
## assayData: 10483 features, 100 samples
## element names: exprs, Offset, Z
## protocolData: none
## phenoData
## sampleNames: 15 62 ... 310 (100 total)
## varLabels: tissue celltype ... L (9 total)
## varMetadata: labelDescription
## featureData
## featureNames: A2M A2ML1 ... alignment_not_unique (10483 total)
## fvarLabels: mean sd
## fvarMetadata:
## c..shrinkage.estimated.foreground.mean....shrinkage.estimated.foreground.standard.deviation.
## labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
The phase estimation result can be visualized using zyPlot
function, which plots the posterior probability of being expressed versus log expression. Two groups will be shown in different colors.
zyPlot(rownames(data)[1], data, group.name="celltype")
4. test DE in two phases using twoPhaseDE
function.
Here - design
is a character vector of variable names in pData(norm2) to provide regression covariates in the DE test. - test.which
is an integer that points to the location of the to-be-tested binary variable in argument design
. - offset
indicates the method to compute normalization factor.
de.sc2p <- twoPhaseDE(data, design="celltype", test.which=1, offset="sf")
The function returns a data frame, each row is for a gene. The rows are sorted by the gene name, not significance level.
5. Top ranked genes can be obtained using topGene
function, and visualized by visGene
function.
There are differnt options in topGene
:
- To list the top 5 DE genes in phase I:
topGene(de.sc2p, phase=1, number=5)
## Gene.name p1 p2 Ph1.coef Ph1.fdr m1 m2 Ph2.coef marLogFC
## 1 GPR98 1.000 0.053 -24.5 8.0e-23 10.6 5.6 -4.98 -10.12
## 2 F3 0.984 0.026 -7.7 9.4e-23 10.0 7.1 -2.90 -9.47
## 3 ENPP2 0.048 1.000 23.5 1.3e-22 5.7 10.2 4.49 8.96
## 4 LRRC16A 0.952 0.000 -23.5 1.3e-22 9.7 NA NA -9.11
## 5 RYR3 0.968 0.026 -7.0 7.1e-22 9.2 3.1 -6.05 -8.71
## Ph2.fdr Phase1 Phase2 Comb.fdr Category
## 1 5.3e-04 Y Y 8.0e-23 --
## 2 4.9e-01 Y N 1.3e-22 --
## 3 7.9e-04 Y Y 2.0e-22 ++
## 4 NA Y NA 2.0e-22 NA
## 5 2.4e-02 Y Y 1.2e-21 --
- To list the top 5 DE genes in phase II:
topGene(de.sc2p, phase=2, number=5)
## Gene.name p1 p2 Ph1.coef Ph1.fdr m1 m2 Ph2.coef marLogFC
## 1 SLC1A3 1.000 0.684 -19.79 6.9e-06 11.8 7.2 -4.6 -6.528
## 2 CLU 1.000 0.605 -20.14 2.0e-07 11.5 6.8 -4.7 -6.978
## 3 SPARCL1 1.000 0.316 -21.34 2.9e-14 12.2 7.3 -4.9 -9.247
## 4 PIP4K2A 0.548 1.000 19.37 2.3e-07 8.3 12.3 4.0 6.872
## 5 TMEM144 0.371 1.000 20.09 3.6e-11 6.8 11.5 4.7 8.215
## Ph2.fdr Phase1 Phase2 Comb.fdr Category
## 1 5.5e-23 Y Y 8.0e-23 --
## 2 1.6e-16 Y Y 2.0e-17 --
## 3 1.5e-15 Y Y 2.1e-16 --
## 4 1.9e-15 Y Y 3.2e-16 ++
## 5 3.7e-15 Y Y 6.3e-16 ++
- To list the top 5 DE genes in both phases:
topGene(de.sc2p, phase="both", number=5)
## Gene.name p1 p2 Ph1.coef Ph1.fdr m1 m2 Ph2.coef marLogFC
## 1 GPR98 1.000 0.053 -24.46 8.0e-23 10.6 5.6 -4.98 -10.119
## 2 SLC1A3 1.000 0.684 -19.79 6.9e-06 11.8 7.2 -4.59 -6.528
## 3 F3 0.984 0.026 -7.72 9.4e-23 10.0 7.1 -2.90 -9.473
## 4 ENPP2 0.048 1.000 23.54 1.3e-22 5.7 10.2 4.49 8.955
## 5 LRRC16A 0.952 0.000 -23.54 1.3e-22 9.7 NA NA -9.109
## Ph2.fdr Phase1 Phase2 Comb.fdr Category
## 1 5.3e-04 Y Y 8.0e-23 --
## 2 5.5e-23 Y Y 8.0e-23 --
## 3 4.9e-01 Y N 1.3e-22 --
## 4 7.9e-04 Y Y 2.0e-22 ++
## 5 NA Y NA 2.0e-22 NA
To visualize expression distribution using visGene
function:
- Top DE gene in phase I
visGene(topGene(de.sc2p, 1)$Gene.name[1], data, group.name="celltype")
- Top DE gene in phase II
visGene(topGene(de.sc2p, 2)$Gene.name[1], data, group.name="celltype")