de_analysis_speaqeasy.Rmd

---
title: 'Differential Expression Analysis'
author: 
  - name: Joshua M. Stolz
    affiliation:
    - &libd Lieber Institute for Brain Development, Johns Hopkins Medical Campus
output: 
  BiocStyle::html_document:
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: show
date: '9/18/2020'
---

```{r 'setup', echo = FALSE, warning = FALSE, message = FALSE}
timestart <- Sys.time()

## Bib setup
library("knitcitations")

## Load knitcitations with a clean bibliography
cleanbib()
cite_options(
    hyperlink = "to.doc",
    citation_format = "text",
    style = "html"
)

## Write bibliography information
bibs <- c(
    BiocStyle = citation("BiocStyle"),
    clusterProfiler = citation("clusterProfiler"),
    devtools = citation("devtools"),
    edgeR = citation("edgeR"),
    here = citation("here"),
    jaffelab = citation("jaffelab"),
    knitcitations = citation("knitcitations"),
    limma = citation("limma"),
    org.Hs.eg.db = citation("org.Hs.eg.db"),
    pheatmap = citation("pheatmap"),
    R = citation(),
    RColorBrewer = citation("RColorBrewer"),
    rmarkdown = citation("rmarkdown"),
    recount = citation("recount"),
    SummarizedExperiment = citation("SummarizedExperiment"),
    voom = RefManageR::BibEntry(
        "article",
        key = "voom",
        author = "CW Law and Y Chen and W Shi and GK Smyth",
        year = "2014",
        title = "Voom: precision weights unlock linear model analysis tools for RNA-seq read counts",
        journal = "Genome Biology",
        volume = "15",
        pages = "R29"
    )
)

write.bibtex(bibs, file = "de_analysis.bib")
bib <- read.bibtex("de_analysis.bib")

## Assign short names
names(bib) <- names(bibs)
```

# Analysis

The following analysis explores a `RangedSummarizedExperiment` object from the SPEAQeasy pipeline. Note that we will use a modified version of the object, which resolved sample identity issues which were present in the raw output from SPEAQeasy. This object also includes phenotype data added after resolving identity issues. Though SPEAQeasy produces objects for several feature types (genes, exons, exon-exon junctions), we will demonstrate an example analysis for just genes. We will perform differential expression across some typical variables of interest (e.g. sex, age, race) and show how to perform principal component analysis (PCA) and visualize findings with plots.

## Load required libraries

```{r loadpackages, message = FALSE, warning = FALSE}
library("SummarizedExperiment")
library("recount")
library("edgeR")
library("limma")
library("jaffelab") # GitHub: LieberInstitute/jaffelab
library("RColorBrewer")
library("clusterProfiler")
library("org.Hs.eg.db")
library("pheatmap")
library("here")
library("devtools")
library("BiocStyle")
```

## Load data and prepare directories to place outputs in

For those who ran SPEAQeasy from the example FASTQ data set, the `RangedSummarizedExperiment` will have a different path, as specified with the `--output` flag.

```{r loaddata}
# Load the RSE gene object
load(here("rse_speaqeasy.RData"), verbose = TRUE)

# Create directories to organize outputs from this analysis
dir.create(here("DE_analysis", "pdfs"), showWarnings = FALSE)
dir.create(here("DE_analysis", "tables"), showWarnings = FALSE)
dir.create(here("DE_analysis", "rdas"), showWarnings = FALSE)

## Clean up the PrimaryDx variable
rse_gene$PrimaryDx <- factor(rse_gene$PrimaryDx)
```


## statistics PCs

Here we are using principal component analysis to control for the listed variables impact on expression. This will be later added into our linear model.

```{r stats_pcs, echo=TRUE}
col_names <- c(
    "trimmed",
    "numReads",
    "numMapped",
    "numUnmapped",
    "overallMapRate",
    "concordMapRate",
    "totalMapped",
    "mitoMapped",
    "mitoRate",
    "totalAssignedGene"
)
set.seed(20201006)
statsPca <- prcomp(as.data.frame(colData(rse_gene)[, col_names]))
rse_gene$PC <- statsPca$x[, 1]
getPcaVars(statsPca)[1]
```

## Stats vs. diagnosis and brain region

Here we explore the relationship between some quality control covariates produced by SPEAQeasy against phenotype data that we have on the samples such as brain region, primary diagnosis and race.

```{r dx_region, echo=TRUE}
## We only have one race here
table(rse_gene$Race)

# Check if certain statistics changed by primary diagnosis or brain region

# Display box plots here
create_boxplots <- function() {
    boxplot(rse_gene$rRNA_rate ~ rse_gene$BrainRegion, xlab = "")
    boxplot(rse_gene$mitoRate ~ rse_gene$BrainRegion, xlab = "")
    boxplot(rse_gene$gene_Assigned ~ rse_gene$BrainRegion, xlab = "")
    boxplot(rse_gene$mitoRate ~ rse_gene$PrimaryDx,
        las = 3,
        xlab = ""
    )
    boxplot(rse_gene$gene_Assigned ~ rse_gene$PrimaryDx,
        las = 3,
        xlab = ""
    )
    boxplot(rse_gene$mitoRate ~ rse_gene$PrimaryDx + rse_gene$BrainRegion,
        las = 3,
        xlab = ""
    )
    boxplot(rse_gene$gene_Assigned ~ rse_gene$PrimaryDx + rse_gene$BrainRegion,
        las = 3,
        xlab = ""
    )
    return(NULL)
}


# Save box plots to PDF
pdf(file = here("DE_analysis", "pdfs", "Region_Dx_cellcheck.pdf"))
create_boxplots()
dev.off()
```

## Explore and visualize gene expression

```{r gene_expression, echo=TRUE}
# Filter for expressed
rse_gene <- rse_gene[rowMeans(getRPKM(rse_gene, "Length")) > 0.2, ]

# Explore gene expression
geneExprs <- log2(getRPKM(rse_gene, "Length") + 1)
set.seed(20201006)
pca <- prcomp(t(geneExprs))
pca_vars <- getPcaVars(pca)
pca_vars_lab <- paste0(
    "PC", seq(along = pca_vars), ": ",
    pca_vars, "% Var Expl"
)

# Group together code for generating plots of interest
generate_plots <- function() {
    par(
        mar = c(8, 6, 2, 2),
        cex.axis = 1.8,
        cex.lab = 1.8
    )
    palette(brewer.pal(4, "Dark2"))

    # PC1 vs. PC2
    plot(
        pca$x,
        pch = 21,
        bg = factor(rse_gene$PrimaryDx),
        cex = 1.2,
        xlab = pca_vars_lab[1],
        ylab = pca_vars_lab[2]
    )
    legend(
        "bottomleft",
        levels(rse_gene$PrimaryDx),
        col = 1:2,
        pch = 15,
        cex = 2
    )


    # By line
    for (i in 1:10) {
        boxplot(
            pca$x[, i] ~ rse_gene$Sex,
            ylab = pca_vars_lab[i],
            las = 3,
            xlab = "Sex",
            outline = FALSE
        )
        points(
            pca$x[, i] ~ jitter(as.numeric(factor(
                rse_gene$Sex
            ))),
            pch = 21,
            bg = rse_gene$PrimaryDx,
            cex = 1.2
        )
    }

    # By brain region
    for (i in 1:10) {
        boxplot(
            pca$x[, i] ~ rse_gene$BrainRegion,
            ylab = pca_vars_lab[i],
            las = 3,
            xlab = "Brain Region",
            outline = FALSE
        )
        points(
            pca$x[, i] ~ jitter(as.numeric(factor(
                rse_gene$BrainRegion
            ))),
            pch = 21,
            bg = rse_gene$PrimaryDx,
            cex = 1.2
        )
    }
}

# Display plots
set.seed(20201006)
generate_plots()

# Write plots to PDF
pdf(here("DE_analysis", "pdfs", "PCA_plotsExprs.pdf"), width = 9)
set.seed(20201006)
generate_plots()
dev.off()
```

## Modeling

```{r modeling, echo=TRUE, warning=FALSE}
dge <- DGEList(
    counts = assays(rse_gene)$counts,
    genes = rowData(rse_gene)
)
dge <- calcNormFactors(dge)

# Mean-variance
mod <- model.matrix(~ PrimaryDx + PC + BrainRegion,
    data = colData(rse_gene)
)

vGene <- invisible(voom(dge, mod, plot = TRUE))

# Also write mean-variance plot to PDF
pdf(file = here("DE_analysis", "pdfs", "vGene.pdf"))
invisible(voom(dge, mod, plot = TRUE))
dev.off()

# Get duplicate correlation
gene_dupCorr <- duplicateCorrelation(vGene$E, mod,
    block = colData(rse_gene)$BrNum
)
gene_dupCorr$consensus.correlation

## We can save this for later since it can take a while
## to compute and we might need it. This will be useful
## for larger projects.
save(gene_dupCorr,
    file = here("DE_analysis", "rdas", "gene_dupCorr_neurons.rda")
)

# Fit linear model
fitGeneDupl <- lmFit(
    vGene,
    correlation = gene_dupCorr$consensus.correlation,
    block = colData(rse_gene)$BrNum
)

# Here we perform an empirical Bayesian calculation to obtain our significant genes
ebGeneDupl <- eBayes(fitGeneDupl)
outGeneDupl <- topTable(
    ebGeneDupl,
    coef = 2,
    p.value = 1,
    number = nrow(rse_gene),
    sort.by = "none"
)

pdf(file = here("DE_analysis", "pdfs", "hist_pval.pdf"))
hist(outGeneDupl$P.Value)
dev.off()

hist(outGeneDupl$P.Value)
table(outGeneDupl$adj.P.Val < 0.05)
table(outGeneDupl$adj.P.Val < 0.1)

sigGeneDupl <- topTable(
    ebGeneDupl,
    coef = 2,
    p.value = 0.2,
    number = nrow(rse_gene)
)

sigGeneDupl[, c("Symbol", "logFC", "P.Value", "AveExpr")]
sigGeneDupl[sigGeneDupl$logFC > 0, c("Symbol", "logFC", "P.Value")]
sigGeneDupl[sigGeneDupl$logFC < 0, c("Symbol", "logFC", "P.Value")]

write.csv(outGeneDupl,
    file = here("DE_analysis", "tables", "de_stats_allExprs.csv")
)
write.csv(sigGeneDupl,
    file = here("DE_analysis", "tables", "de_stats_fdr20_sorted.csv")
)
```

## Check plots

```{r check_plots, echo=TRUE}
exprs <- vGene$E[rownames(sigGeneDupl), ]

# Group together code for displaying boxplots
generate_plots <- function() {
    par(
        mar = c(8, 6, 4, 2),
        cex.axis = 1.8,
        cex.lab = 1.8,
        cex.main = 1.8
    )
    palette(brewer.pal(4, "Dark2"))

    for (i in 1:nrow(sigGeneDupl)) {
        yy <- exprs[i, ]
        boxplot(
            yy ~ rse_gene$PrimaryDx,
            outline = FALSE,
            ylim = range(yy),
            ylab = "Normalized log2 Exprs",
            xlab = "",
            main = paste(sigGeneDupl$Symbol[i], "-", sigGeneDupl$gencodeID[i])
        )
        points(
            yy ~ jitter(as.numeric(rse_gene$PrimaryDx)),
            pch = 21,
            bg = rse_gene$PrimaryDx,
            cex = 1.3
        )
        ll <-
            ifelse(sigGeneDupl$logFC[i] > 0, "topleft", "topright")
        legend(ll, paste0("p=", signif(sigGeneDupl$P.Value[i], 3)), cex = 1.3)
    }
}

# Show boxplots
set.seed(20201006)
generate_plots()

# Write plots to PDF
pdf(here("DE_analysis", "pdfs", "DE_boxplots_byDiagnosis.pdf"),
    width = 10
)
set.seed(20201006)
generate_plots()
dev.off()


e <- geneExprs[rownames(sigGeneDupl), ]

generate_plots <- function() {
    par(
        mar = c(8, 6, 4, 2),
        cex.axis = 1.8,
        cex.lab = 1.8,
        cex.main = 1.8
    )
    palette(brewer.pal(4, "Dark2"))
    for (i in 1:nrow(sigGeneDupl)) {
        yy <- e[i, ]
        boxplot(
            yy ~ rse_gene$PrimaryDx,
            las = 3,
            outline = FALSE,
            ylim = range(yy),
            ylab = "log2(RPKM+1)",
            xlab = "",
            main = paste(sigGeneDupl$Symbol[i], "-", sigGeneDupl$gencodeID[i])
        )
        points(
            yy ~ jitter(as.numeric(rse_gene$PrimaryDx)),
            pch = 21,
            bg = rse_gene$PrimaryDx,
            cex = 1.3
        )
        ll <-
            ifelse(sigGeneDupl$logFC[i] > 0, "topleft", "topright")
        legend(ll, paste0("p=", signif(sigGeneDupl$P.Value[i], 3)), cex = 1.3)
    }
}

# Display plots
set.seed(20201006)
generate_plots()

# Write the same plots to PDF
pdf(here("DE_analysis", "pdfs", "DE_boxplots_byGenome_log2RPKM.pdf"),
    w = 10
)
set.seed(20201006)
generate_plots()
dev.off()
```

## Gene ontology

`clusterProfiler` is a gene ontology package we will use to see if our genes are specifically differentially expressed in certain pathways.

```{r gene_ontology, echo=TRUE}
# Get significant genes by sign
sigGene <- outGeneDupl[outGeneDupl$P.Value < 0.005, ]
sigGeneList <-
    split(as.character(sigGene$EntrezID), sign(sigGene$logFC))
sigGeneList <- lapply(sigGeneList, function(x) {
      x[!is.na(x)]
  })
geneUniverse <- as.character(outGeneDupl$EntrezID)
geneUniverse <- geneUniverse[!is.na(geneUniverse)]

# Do GO
goBP_Adj <- compareCluster(
    sigGeneList,
    fun = "enrichGO",
    universe = geneUniverse,
    OrgDb = org.Hs.eg.db,
    ont = "BP",
    pAdjustMethod = "BH",
    pvalueCutoff = 1,
    qvalueCutoff = 1,
    readable = TRUE
)

goMF_Adj <- compareCluster(
    sigGeneList,
    fun = "enrichGO",
    universe = geneUniverse,
    OrgDb = org.Hs.eg.db,
    ont = "MF",
    pAdjustMethod = "BH",
    pvalueCutoff = 1,
    qvalueCutoff = 1,
    readable = TRUE
)

goCC_Adj <- compareCluster(
    sigGeneList,
    fun = "enrichGO",
    universe = geneUniverse,
    OrgDb = org.Hs.eg.db,
    ont = "CC",
    pAdjustMethod = "BH",
    pvalueCutoff = 1,
    qvalueCutoff = 1,
    readable = TRUE
)

save(
    goBP_Adj,
    goCC_Adj,
    goMF_Adj,
    file = here("DE_analysis", "rdas", "gene_set_objects_p005.rda")
)

goList <-
    list(
        BP = goBP_Adj,
        MF = goMF_Adj,
        CC = goCC_Adj
    )
goDf <-
    dplyr::bind_rows(lapply(goList, as.data.frame), .id = "Ontology")
goDf <- goDf[order(goDf$pvalue), ]

write.csv(
    goDf,
    file = here("DE_analysis", "tables", "geneSet_output.csv"),
    row.names = FALSE
)

options(width = 130)
goDf[goDf$p.adjust < 0.05, c(1:5, 7)]
```

## Visualize differentially expressed genes

Here we visualize DEGs with a heatmap.

```{r heatmap, echo=TRUE}
exprs_heatmap <- vGene$E[rownames(sigGene), ]

df <- as.data.frame(colData(rse_gene)[, c("PrimaryDx", "Sex", "BrainRegion")])
rownames(df) <- colnames(exprs_heatmap) <- gsub("_.*", "", colnames(exprs_heatmap))
colnames(df) <- c("Diagnosis", "Sex", "Region")

#  Manually determine coloring for plot annotation
palette_names = c('Dark2', 'Paired', 'YlOrRd')
ann_colors = list()
for (i in 1:ncol(df)) {
    col_name = colnames(df)[i]
    n_uniq_colors = length(unique(df[,col_name]))
    
    #   Use a unique palette with the correct number of levels, named with
    #   those levels
    ann_colors[[col_name]] = RColorBrewer::brewer.pal(n_uniq_colors, palette_names[i])[1:n_uniq_colors]
    names(ann_colors[[col_name]]) = unique(df[,col_name])
}

# Display heatmap
pheatmap(
    exprs_heatmap,
    cluster_rows = TRUE,
    show_rownames = FALSE,
    cluster_cols = TRUE,
    annotation_col = df,
    annotation_colors = ann_colors
)

```

```{r pdf_heatmap, eval=FALSE, echo=FALSE}
## For some reason, I need to run this manually to save the PDF

# Write heatmap to PDF
pdf(file = here("DE_analysis", "pdfs", "de_heatmap.pdf"))
pheatmap(
    exprs_heatmap,
    cluster_rows = TRUE,
    show_rownames = FALSE,
    cluster_cols = TRUE,
    annotation_col = df,
    annotation_colors = ann_colors
)
dev.off()
```

# Reproducibility

This analysis report was made possible thanks to:

* R `r citep(bib[['R']])`
* `r Biocpkg('BiocStyle')` `r citep(bib[['BiocStyle']])`
* `r Biocpkg('clusterProfiler')` `r citep(bib[['clusterProfiler']])`
* `r CRANpkg('devtools')` `r citep(bib[['devtools']])`
* `r Biocpkg('edgeR')` `r citep(bib[['edgeR']])`
* `r CRANpkg('here')` `r citep(bib[['here']])`
* `r Githubpkg('LieberInstitute/jaffelab')` `r citep(bib[['jaffelab']])`
* `r Biocpkg('limma')` `r citep(bib[['limma']])`
* `r CRANpkg('knitcitations')` `r citep(bib[['knitcitations']])`
* `r Biocpkg('org.Hs.eg.db')` `r citep(bib[['org.Hs.eg.db']])`
* `r CRANpkg('pheatmap')` `r citep(bib[['pheatmap']])`
* `r CRANpkg('RColorBrewer')` `r citep(bib[['RColorBrewer']])`
* `r Biocpkg('recount')` `r citep(bib[['recount']])`
* `r CRANpkg('rmarkdown')` `r citep(bib[['rmarkdown']])`
* `r Biocpkg('SummarizedExperiment')` `r citep(bib[['SummarizedExperiment']])`
* `r Biocpkg('voom')` `r citep(bib[['voom']])`

[Bibliography file](de_analysis.bib)

```{r bibliography, results='asis', echo=FALSE, warning = FALSE}
# Print bibliography
bibliography()
```

```{r reproducibility}
# Time spent creating this report:
diff(c(timestart, Sys.time()))

# Date this report was generated
message(Sys.time())

# Reproducibility info
options(width = 120)
devtools::session_info()
```