correlation_workflow.Rmd

---
title: "Correlation workflow"
author: "Noor Sohail"
output:
   html_document:
      code_folding: hide
      df_print: paged
      highlights: pygments
      number_sections: true
      self_contained: true
      theme: default
      toc: true
      toc_float:
         collapsed: true
         smooth_scroll: true
---

```{r setup, include=FALSE} 
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
# Turn off Warnings and other console output messages from the whole document
```

```{r}
library(reticulate)
use_virtualenv("magic2")
```

```{r, cache=FALSE, message=FALSE}
library(tidyverse)
library(knitr)
library(glue)
library(Seurat)
library(pheatmap)
library(devtools)
library(gridExtra)
library(RColorBrewer)

library(CSCORE)
library(SAVER)

ggplot2::theme_set(theme_light(base_size = 11))
opts_chunk[["set"]](
    cache = FALSE,
    dev = c("png", "pdf"),
    error = TRUE,
    highlight = TRUE,
    message = FALSE,
    prompt = FALSE,
    tidy = FALSE,
    warning = FALSE)
```

# User define values

```{r}
path_seurat <- "/path/to/seurat.RDS"
path_outs <- "../output/"

## add column names for the sample id and the celltype column. ct is the cluster/celltype that you want to focus on. Replace with your information that match seurat object.
col_sample <- "orig.ident"
col_celltype <- "SCT_snn_res.0.25"
ct <- "2"

# Filtration parameters
filter <- TRUE
min_exp <- 0.2
min_cells <- 40
min_perc <- 0.2
```

- Path seurat: `r path_seurat`
- Metadata column with celltypes: `r col_celltype`
- Celltype to subset to: `r ct`
- Filter genes based on expression and frequency: `r filter`

List of all genes:
```{r}
# Fill in the list of genes you are interested in calculating correlations for
corr_genes_all <- c("Col1a1","Col1a2","Dcn","Ly6c1","Ly6a", "Ebf2","Il33","Prx")
as.data.frame(corr_genes_all)
```

Starting off with `r length(corr_genes_all)` genes of interest.

# Load seurat object

```{r}
seurat <- readRDS(path_seurat)

# Saving celltype information as column named celltype
# In order to use subset function later
seurat$celltype <- seurat@meta.data[col_celltype]
```


```{r}
Idents(seurat) <- col_celltype
DimPlot(seurat) + ggtitle("Celltypes")
```

```{r}
seurat@meta.data %>%
        ggplot() +
        geom_bar(aes(
            x = get(col_celltype),
            fill = get(col_celltype)),
            stat = "count", color = "black") +
        theme_classic() +
        NoLegend() +
        xlab("Celltype") +
        ylab("Number of Cells") +
        ggtitle("Celltypes") +
        theme(plot.title = element_text(hjust = 0.5)) +
        geom_text(aes(x = get(col_celltype), label = after_stat(count)), stat = "count", vjust = -0.5) +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

# Subset to `r ct` cells

```{r}
seurat2 <- subset(seurat, celltype == ct)

## Set default assay to RNA, downstream code fails if it is on SCT. Remove SCT as well
DefaultAssay(object = seurat_imputed) <- "MAGIC"
seurat2[['SCT']] <- NULL # to make sure it doesn't fail downstream due to new version of Seurat

# Removing genes that have 0 counts across all cells of the celltype
raw_rna <- GetAssayData(object =  seurat2[['RNA']], layer = 'counts')
genes.use <- rowSums(raw_rna) > 0
genes.use <- names(genes.use[genes.use])
seurat2 <- seurat2[genes.use, ]
n_cells <- ncol(seurat2)
```

Working with `r n_cells` cells.

```{r}
seurat2@meta.data %>%
        ggplot() +
        geom_bar(aes(
            x = get(col_sample),
            fill = get(col_sample)),
            stat = "count", color = "black") +
        theme_classic() +
        NoLegend() +
        ggtitle(glue("{ct} cells: Sample distribution")) +
        ylab("Number of Cells") +
        xlab("Sample") +
        theme(plot.title = element_text(hjust = 0.5)) +
        geom_text(aes(x = get(col_sample), label = after_stat(count)), stat = "count", vjust = -0.5) +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

# Subset genes of interest

From the genes that were supplied, removing any that are not expressed in this dataset.

```{r}
corr_genes <- corr_genes_all[corr_genes_all %in% genes.use]
as.data.frame(corr_genes)
```

Next we look at the basic distribution of remaining genes of interest in terms of expression and number of cells they are expressed in.

These are the following filtration metrics that are set:

- Average expression < `r min_exp`
- Number of cells expressed in > `r min_cells`
- Percentage of cells expressed in > `r min_perc`

Filtration parameter was set to `r filter`. If FALSE, no further filtration will be done at this step.

```{r}
data_rna <- FetchData(seurat2[["RNA"]], vars = corr_genes)

# Number of cells a gene is expressed in
num_cells <- colSums(data_rna > 0)
# Percentage of cells a gene is expressed in
perc_cells <- num_cells / ncol(seurat2)
# Average expression of a gene
avg_expression <- colMeans(data_rna)

df_genes <- data.frame(num_cells, perc_cells, avg_expression)
df_genes <- df_genes %>% mutate(filter = !((perc_cells > min_perc) & (num_cells > min_cells) & (avg_expression > min_exp)))
df_genes$gene <- row.names(df_genes)
df_genes <- df_genes %>% arrange(desc(avg_expression), desc(perc_cells))

df_genes
```

```{r}
df_genes %>% ggplot() +
    geom_point(aes(x = perc_cells, y = avg_expression, color = filter)) +
    theme_classic()
```

```{r}
if (filter == TRUE) {
    corr_genes <- (df_genes %>% subset(filter == FALSE))$gene
}
```

`r length(corr_genes)` genes of interest remaining.

# Imputation and normalization

We compare three alternative methods of estimating expression levels to log normalization and assess their ability to account for dropout.

1. SCTransform (raw counts -> normalized counts)
2. MAGIC (raw counts -> imputed, normalized counts)
3. SAVER (raw counts -> imputed, normalized counts)

```{r}
# Store output so we don't have to re-run imputation each time
filename <- glue("{path_outs}/imputed_{ct}.RDS")

if (!file.exists(filename)) {
    # Get raw counts
    raw_rna <- LayerData(seurat2, assay = "RNA", layer = "counts")

    # SCT
    # Re-run SCT on subset data
    seurat <- SCTransform(seurat2, return.only.var.genes = FALSE, min_cells = 1)

    # Creating new seurat object for genes of interest only
    data_raw <-  FetchData(seurat2, assay="RNA", layer="counts", vars=corr_genes)
    data_rna <-  FetchData(seurat2, assay="RNA", layer="data", vars=corr_genes)
    data_sct <-  FetchData(seurat2, assay="SCT", layer="data", vars=corr_genes)

    seurat_imputed <- CreateSeuratObject(counts=t(data_raw), data=t(data_rna), meta.data=seurat@meta.data)
    seurat_imputed[["SCT"]] <- CreateAssayObject(data=t(data_sct))
    seurat_imputed[["RAW"]] <- CreateAssayObject(counts=raw_rna)

    # Delete the original seurat object to save memory
    rm(seurat)

    # MAGIC
    
    # Load conda environment
    # myenvs <- reticulate::conda_list()
    reticulate::use_virtualenv("magic2",required = T)
    library(Rmagic)
    data_magic <- magic(t(raw_rna), genes = corr_genes)$result
    seurat_imputed[["MAGIC"]] <- CreateAssayObject(data=t(data_magic))

    # SAVER
    # Generate SAVER predictions for those genes
    genes.ind <- which(rownames(raw_rna) %in% corr_genes)
    data_saver <- saver(raw_rna, pred.genes = genes.ind, pred.genes.only = TRUE, estimates.only = TRUE, ncores = 8)

    seurat_imputed[["SAVER"]] <- CreateAssayObject(data=data_saver)

    saveRDS(seurat_imputed, filename)

}

seurat_imputed <- readRDS(filename)
```

## Average expression for each method

```{r}
assays <- c("RNA", "SCT", "MAGIC", "SAVER")

df_avg <- data.frame(gene = corr_genes)
for (assay in assays) {
    data <- GetAssayData(object = seurat_imputed[[assay]], layer = 'data')
    avg <- data.frame(rowMeans(data))
    colnames(avg) <- assay
    avg$gene <- row.names(avg)

   df_avg <-  left_join(df_avg, avg, by = "gene")
}

pheatmap(df_avg %>% column_to_rownames(var = "gene"), scale = "column",
            cluster_col = TRUE, cluster_row = TRUE, show_rownames = TRUE)
```

# Correlation Estimates

We have a few different ways to compute correlation scores with their associated p-values:

1. Spearman correlation 
  - SCTransform counts -> spearman correlation matrix
  - MAGIC imputed -> spearman correlation matrix
  - SAVER imputed -> spearman correlation matrix
2. CS-CORE 
    - Raw RNA counts -> co-expression matrix

```{r}
# Store output so we don't have to re-run correlation each time
filename <- glue("{path_outs}/corr_{ct}.csv")

if (!file.exists(filename)) {

    # Compute spearman correlation for each method (except CS-CORE which is run later)
    # Unique combination of each gene pair
    genes_comb <- data.frame(t(combn(corr_genes, 2)))
    n_comb <- nrow(genes_comb)

    # Create dataframe with correlation and p-values scores
    df_corr <- genes_comb %>% rename("Var1" = X1, "Var2" = X2)
    df_corr[assays] <- NA
    df_p_val <- df_corr

    for (idx in 1:n_comb) {

        if (idx %% 200 == 0) {
            print(glue("{idx}/{n_comb} correlations computed."))
        }

        # Name of genes to run correlation on
        gene_1 <- genes_comb[idx, 1]
        gene_2 <- genes_comb[idx, 2]

        for (assay_ in assays) {
            gene_exp <- t(seurat_imputed[[assay_]]$data[c(gene_1, gene_2), ]) %>% as.data.frame()

            if (all(gene_exp[[gene_1]] == 0) | all(gene_exp[[gene_2]] == 0)) {
                # If a gene has no expression, set correlation = 0 and p-value = 1
                corr_val <- 0.0
                p_val <- 1.0
            } else {
                # Calculate spearman correlation and p-value otherwise
                tmp <- cor.test(gene_exp[[gene_1]], gene_exp[[gene_2]], method = "spearman", exact = FALSE)
                corr_val <- as.numeric(unname(tmp$estimate))
                p_val <- as.numeric(tmp$p.value)
            }

            # Store correlation and p-values
            df_corr[idx, assay_] <- corr_val
            df_p_val[idx, assay_] <- p_val
        }
    }

    # Run CS-CORE
    DefaultAssay(seurat_imputed) <- "RAW"
    CSCORE_result <- CSCORE(seurat_imputed, genes=corr_genes)

    # Store CS-CORE results
    tmp <- reshape2::melt(as.matrix(CSCORE_result$est)) %>% rename(CSCORE = value)
    df_corr <- left_join(df_corr, tmp)
    tmp <- reshape2::melt(as.matrix(CSCORE_result$p_value)) %>% rename(CSCORE = value)
    df_p_val <- left_join(df_p_val, tmp)

    # Save output
    write.csv(df_corr, filename)
    write.csv(df_p_val, glue("{path_outs}/p_corr_{ct}.csv"))
}

df_corr <- read.csv(filename, row.names=1)
df_p_val <- read.csv(glue("{path_outs}/p_corr_{ct}.csv"), row.names=1)
```

## Heatmap of correlation estimates

Showing the patterns of correlation for each method. The x-axis and y-axis are the genes of interest with the corresponding correlation value for the pair as the value. Keep in mind that this is symmetric matrix.

```{r results = "asis"}
methods <- c("RNA", "SCT", "MAGIC", "SAVER", "CSCORE")

for (method in methods) {
    corr <- df_corr[c("Var1", "Var2", method)]
    corr_cp <- corr %>% rename(Var1 = Var2, Var2 = Var1)
    corr <- rbind(corr, corr_cp)
    mtx <- reshape2::dcast(corr, Var2 ~ Var1) %>% column_to_rownames("Var2")

    # Set the diagonal values: Correlation = 1, p-value = 1
    mtx <- as.matrix(mtx)
    diag(mtx) <- 1

    breaks <-  seq(-1, 1, by = 0.1)
    color <- colorRampPalette(rev(brewer.pal(n = 7, name = "RdYlBu")))(length(breaks))


    p <- pheatmap(mtx, show_rownames = FALSE, show_colnames = FALSE,
                color = color, breaks = breaks, silent = FALSE, main = method)
    knit_print(p)
}
```

# Compare correlation estimates across methods

Comparing the correlation scores for each gene pair for MAGIC, SAVER, and CS-CORE.

In these scatterplots, the gene-pairs that are colored red have different results for significance.

```{r fig.width=15}
methods <- c("MAGIC", "SAVER", "CSCORE")
methods_comb <- data.frame(t(combn(methods, 2)))
plot_list <- list()

for (idx in 1:nrow(methods_comb)) {
    method_1 <- methods_comb[idx, 2]
    method_2 <- methods_comb[idx, 1]

    corr <- df_corr[c("Var1", "Var2", method_1, method_2)]
    p_val <- df_p_val[,c("Var1", "Var2", method_1, method_2)]
    corr$sig_1 <- p_val[[method_1]]
    corr$sig_2 <- p_val[[method_2]]

    corr <- corr %>% mutate(sig = (sig_1 < 0.5) & (sig_2 < 0.05))

    p <- ggplot(corr) +
            geom_point(aes(x = get(method_1), y = get(method_2), color = sig)) +
            theme_classic() +
            NoLegend() +
            scale_color_manual(values = c("FALSE" = "red", "TRUE" = "black")) +
            labs(x = method_1, y = method_2, title = paste(method_1, "vs", method_2)) +
            theme(plot.title = element_text(size=rel(2))) +
            ylim(-1, 1) + xlim(-1, 1) +
            geom_abline(slope = 1, intercept = 0, color = "blue")

    plot_list[[idx]] <- ggplotGrob(p)

}

grid.arrange(grobs = plot_list, ncol = 3)

```