Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch effects and PCs blog posts #40

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
195 changes: 195 additions & 0 deletions blog/content/2023-11-19-PCASingleCell/PCs_SingleCell.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
---
title: "PCs_SingleCell"
output: html_document
date: "`r Sys.Date()`"
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Interpreting PCs in relation to clusters in single cell analysis

Single-cell RNA sequential analysis is implemented for cell types identification both existing and novel, tumor types classification, investigation of heterogeneity in different cells, and cell fate prediction and depiction. In this context, an unsupervised learning method such as single-cell clustering represents an important approach to help execute these applications, since it is a key component of cell identification and characterization of gene expression patterns. The upstream data processing includes quality control (QC), normalization and dimension reduction. There are several dimension reduction methods, but here we will focus on principal components analysis (PCA) due to its simplicity and efficiency. The dimension reduction step is important because it reduces the computational work in further steps of clustering, reduces noise and enables more efficient data plotting.

Clustering is a tool to explore data, and its main objective is to summarize complex scRNA-seq data to make
it easier to understand for humans. This is achieved by computing euclidean distances across genes in order
to identify cells with similar transcriptomic profiles, which allows us to describe population heterogeneity.
Regardless of the method used, clustering is a critical step for extracting biological insights from scRNA-seq data. In R, the seurat package is used for QC and exploration of single-cell RNA-seq data. Here, we will use Seurat as well as Tidyverse packages for the analysis.

```{r results = FALSE}
library(Seurat)
library(tidyverse)
library(GEOquery)
```

We are going to read the data, in this case I am using the dataset GSE132771 from GEO. Selecting only the
normal human lung tissue by downloading the barcodes, genes, and matrix files for NLM1, NLM2, and NLM3. It is easiest to do this by visiting `https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE132771` and selecting them under `Summplementary Files` -> `Download` -> `Custom`. Then you can use the `Read10X` function to load them into R.

```{r}
# gse <- getGEO("GSE132771", GSEMatrix = TRUE)

#Create directories
dir.create(file.path("./NML1"))
dir.create(file.path("./NML2"))
dir.create(file.path("./NML3"))

#Read data
NMLI <- Read10X(data.dir= "./NML1")
NMLII <- Read10X(data.dir= "./NML2")
NMLIII <- Read10X(data.dir= "./NML3")

```

They are in ggmatrix format, so we need to create the seurat objects and we will do it with the CreateSeuratObject function. We will keep all the features that are expressing at least 3 cells, keep all those cells that have at least 200 features or genes.

```{r}
NMLI <- CreateSeuratObject(counts= NMLI, project= "NMLI", min.cells=3, min.features=200)
NMLII <- CreateSeuratObject(counts= NMLII, project= "NMLII", min.cells=3, min.features=200)
NMLIII <- CreateSeuratObject(counts= NMLIII, project= "NMLIII", min.cells=3, min.features=200)
```

Next we will merge the seurat objects with the merge function.

```{r}
MergedNML <- merge(NMLI, y= c(NMLII, NMLIII),
add.cell.ids= ls()[1:3],
project= 'MergedNML')
MergedNML
```

## Quality control metrics
In this step we will filter out low quality cells. It is important to be looking at the number of genes in a cell (nFeature) and the number of total molecules (nCount_RNA). This parameters can give us an idea of the
quality of the cell because a poor quality cell would have low number of genes or molecules detected. We can
also have an extremely high number of genes or molecules detected due to doublets or multiple cells being
sequenced together. The % of mitochondrial genes is also important because in dying or low quality cells we
can see higher mitochondrial gene contamination.

```{r}
View([email protected])
```

Let's calculate the percentage of mitochondrial genes with the function `PercentageFeatureSet`, for this function we need to provide a pattern. We are going to calculate the % in all the genes that start with MT.
```{r}
MergedNML <- PercentageFeatureSet(MergedNML, pattern= "^MT-", col.name = "percent.mt")
View([email protected])
```

We can visualize this QC metrics as violin plot, in features we need to include all the columns that we want to visualize.
```{r}
VlnPlot(MergedNML, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol=3)
```

In the *Percent.mt* graph we can see that the cells that have higher mitochondrial percentage and that are disperse are the ones that need to be filtered out.

We can also see the features together with the `FeatureScatter` function which allows us to plot two metrics.
```{r}
FeatureScatter(MergedNML, feature2= "nFeature_RNA", feature1 = "nCount_RNA")+
geom_smooth(method = 'lm')
```

In this plot we are plotting the **number of genes** on the Y axes, and the **number of molecules** in the X axes. A good quality dataset should follow the straight line, we can see that the majority of the data follow the line but we can see some sparse points that need to be filtered.

## Filtering data
Now we need to filter the data, in this case we will set the boundaries to have the number of genes major to 400, the number of molecules up to 2000 and the mitochondrial percentage minor to 5%.
In the plot we can see the difference with the cells already filtered out.

```{r}
MergedNML <- subset(MergedNML, subset= nFeature_RNA>400 & nCount_RNA <2000 & percent.mt <5)
VlnPlot(MergedNML, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol=3)
MergedNML
```

## Data normalization
Now we need to do data normalization in order to be able to compare gene expression across different cells. The function `NormalizeData` does this procedure, and here *normalization.method = "LogNormalize"*, and *scale.factor = 10000* are the default values.
```{r}
MergedNML <- NormalizeData(MergedNML,
normalization.method = "LogNormalize", scale.factor = 10000)
```

Next, we are going to select the features that show high cell to cell variation, so we will perform feature selection with the `FindVariableFeatures` function. *Selection.method = "vst", nfeatures = 2000* are the default values.
```{r}
MergedNML <- FindVariableFeatures(MergedNML,
selection.method = "vst", nfeatures = 2000)
```

We can visualize this top features with the `VariableFeaturePlot` function.
```{r}
VariableFeaturePlot(MergedNML)
```

In red we can see the variable features, the points that are higher in the plot are the top variable features, and the points in black are the non-variable features.

## Data scaling
This step converts the absolute expression measurements to relative concentrations and simultaneously removes efficiency noise, we can do this with the `ScaleData` function. We are going to select all the genes as features, the names of all genes will be extracted from the seurat object.
```{r}
all.genes <- rownames(MergedNML)
MergedNML <- ScaleData(MergedNML, features= all.genes)
```

## Perform linear dimensionality reduction (PCA) on the scaled data
It is important to perform PCA on scRNA-Seq data because the real dimensionality of the data is much lower than the number of genes that appear. Many genes are co-expressed or highly correlated, that is why we can reduce them for example to 4 homogeneous cell types, that would lead to 3 dimensions. It also removes technical noise from scRNA-seq data. We will run PCA with the `RunPCA` function, we just need to provide the suerat object.

```{r}
MergedNML <- RunPCA(MergedNML)
```

## Examine and visualize PCA results
Here we will see the top 5 principal components, we can change the number of PCs we want to see by changing the number in *nfeatures*. We will also see the features that have negative and positive PCs scores.
```{r}
print(MergedNML[["pca"]], dims= 1:5, nfeatures = 5)
```

Another way of visualize the PCs is with a Heatmap. We can do this with the `DimHeatmap` function, we can change the number of PCs by changing the number in *dmis*. Here we are plotting 500 cells.
```{r}
DimHeatmap(MergedNML, dims = 1, cells = 500, balanced = TRUE)
```

The heatmap is colored by the PCs scores and we can see the features that exhibit heterogeneity.

## Determine the dimensionality of the data
Now we are going to choose only the statistical significant PCs that capture the majority of the signal in a downstream analysis. The elbow plot shows an elbow after which the PCs do not vary much in the % of variance. It is better to use too many PCs than too few since this could affect the downstream analysis.

```{r}
ElbowPlot(MergedNML) #it only runs the first 20 cps by default
ElbowPlot(MergedNML, ndims = 50, reduction = "pca")
```

Here we can see that the point in which the elbow is formed is around 10. Since it is better to use many PCs, We are going to use the first 20 PCs

## Clustering
We want to cluster similar cells, with similar feature expression patterns. For that we use the function `FindNeighbors` with the 20 PCs
```{r}
MergedNML <- FindNeighbors(MergedNML, dims= 1:20)
```
Now we want to assign the cells to the clusters so for that we use the function `FindClusters`. Here, resolution is the granularity of the clusters, the lower the number, the lower the clusters.
```{r}
MergedNML <- FindClusters(MergedNML, resolution= c(0.1, 0.3))
```
Next we are going to visualize how many clusters do we have for each resolution so that we can choose the resolution that works best. We are going to do that with `DimPlot`, grouping the cells by resolution, starting with resolution 0.1, then with resolution 0.3
```{r}
DimPlot(MergedNML, group.by = "RNA_snn_res.0.1", label = TRUE) #Change 0.1 to 0.3
DimPlot(MergedNML, group.by = "RNA_snn_res.0.3", label = TRUE) #Now resolution 0.3
```

We can see that with resolution 0.1 we have 9 clusters, we can see that in general they are well separated despite some of the clusters that seem to be overlapping. With resolution 0.3 we have 13 clusters and there are more zones in which they look overlapped so in this case we will choose resolution 0.1.

## Setting identity of clusters
We are going to set the identity of the cells to the 9 clusters.
```{r}
Idents(MergedNML) <- "RNA_snn_res.0.1"
```

## Non-linear dimensionality reduction
We can also view our data into the lower dimensional space to further explore our data, for that we can run a UMAP.
```{r}
MergedNML <- RunUMAP(MergedNML, dims= 1:20)
DimPlot(MergedNML, reduction = "umap", label = TRUE, repel = TRUE)
```

In this UMAP we can see 9 clusters because we chose the resolution 0.1

## References
- Seth S, Mallik S, Bhadra T, Zhao Z. Dimensionality Reduction and Louvain Agglomerative Hierarchical Clustering for Cluster-Specified Frequent Biomarker Discovery in Single-Cell Sequencing Data. Front Genet. 2022 Feb 7;13:828479. doi: 10.3389/fgene.2022.828479. PMID: 35198011; PMCID: PMC8859265.
- Zhang S, Li X, Lin J, Lin Q, Wong KC. Review of single-cell RNA-seq data clustering for cell-type identification and characterization. RNA. 2023 May;29(5):517-530. doi: 10.1261/rna.078965.121. Epub 2023 Feb 3. PMID: 36737104; PMCID: PMC10158997.
- Amezquita R, Lun A, Hicks S, Gottardo S, 2021. Bioconductor, Source: <https://github.com/OSCA-source/OSCA.basic>
100 changes: 100 additions & 0 deletions blog/content/2023-11-19_BatchEffects/BatchEffects.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
---
title: "BatchEffects"
output: html_document
date: "`r Sys.Date()`"
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

#Batch Effect Visualization

One important step in single-cell RNA-Seq data analysis is batch correction. Its importance lies in the
fact that batch effects occur when the variation in data is caused by technical factors instead of biological factors. Technical factors can result from using different sequencing platforms, capturing times, reagents, laboratories, and in general from experimental designs and handling. They occur in batches because they affect a group of samples that are processed differently in contrast to other samples or data. Moreover, batch effects can be highly nonlinear, which makes it difficult to align the datasets and keep biological variations.These batch effects can lead to erroneous conclusions. But, how do we detect batch effects?

There are various methods to detect them, however two simple ways are dissecting PCA and clusters. Here we will detect batch effects with clusters. To do so, we will run clustering analysis and compare UMAP plots before and after batch correction. If there is batch effect in data, we will see in the first plot that cells from different batches will cluster together. After we perform batch correction we shouldn’t see that fragmentation.

Various algorithms are being used to perform batch correction, three of the most commonly used are Mutual
Nearest Neighbor (MNN), canonical correlation analysis (CCA), and Harmony. Here we will perform CCA
with Seurat which reduces data dimensionality and captures the most correlated data features to align the
data batches. We will continue using the data from the `PCs_SingleCell` tutorial. Head over to that blog post if you haven't to download it since we will perform batch detection in the results obtained from that analysis.


To start, lets plot the data again with the DimPlot function and group the cells by their origin identity to
identify the cells from each repeat.

```{r library calls, results = FALSE}
library(Seurat)
library(tidyverse)
```

```{r}
DimPlot(MergedNML, reduction = 'umap', group.by = 'orig.ident')
```

In the plot we can see that the cells were not well integrated into each other with the analysis performed
previously. Therefore, we need to perform integrated analysis to correct the batch effect.

To perform data integration we need to create an object list of three repeats. We will do that with the `SplitObject` function, we provide the seurat object and split it by origin identity, because we detected batch effect from the origin.
```{r}
MergedNML.list <- SplitObject(MergedNML, split.by = 'orig.ident')
```

Next, we need to perform data normalization to each of the objects in the list, so we will do that with `lapply` function, it will return a list of the same lenght as 'X'. Here we are defining X as the object list that we created previously *MergedNML.list*. Then we will find the variable features for each dataset with the `FindVariableFeature` function, here *selection.method= "vst", nfeatures=2000* are the default values.
```{r}
MergedNML.list <- lapply(X= MergedNML.list, FUN = function(x) {
x <- NormalizeData(x)
x <- FindVariableFeatures(x, selection.method= "vst", nfeatures=2000)
})
```

Now, we will select the integration features that are repeatedly variable across datasets for integration. For this we will use the `SelectIntegrationFeatures` function and provide our object list.
```{r}
features <- SelectIntegrationFeatures(object.list = MergedNML.list)
```

Next, we will find the integration Anchors with the Canonical Correlation Analysis (CCA) method, for that we will use the `FindIntegrationAnchors` function, we provide our object list, and the features that we selected previously.
```{r}
MergedNML.anchors <- FindIntegrationAnchors(object.list = MergedNML.list, anchor.features = features)
```

The we will use these anchors to perform integration data, in order to create an 'integrated' data assay. We will do that with `IntegrateData` function and providing the anchors we obtained.
```{r}
MergedNML.integrated <- IntegrateData(anchorset= MergedNML.anchors)
```

## Integrated analysis
We need to do this to specify that we will do downstream analysis on the on the integrated data. For that we will use the `DefaultAssay` function.
```{r}
DefaultAssay(MergedNML.integrated)
```

Now, we need to run again the workflow for visualization and clustering, the same workflow that we performed in the PCs tutorial.
```{r}
MergedNML.integrated <- MergedNML.integrated %>%
ScaleData(verbose = FALSE) %>%
RunPCA(ncps=50, verbose= FALSE) %>%
FindNeighbors(reduction = "pca", dims = 1:20) %>%
FindClusters(resolution = 0.1) %>%
RunUMAP(reduction = "pca", dims = 1:20)
```

## Visualization
Now we can visualize the data with a Dimplot. Here we will see that using the same parameters as before for the workflow, we have now clusters.
```{r}
DimPlot(MergedNML.integrated, reduction = "umap", label = TRUE)
```

Finally, lets compare the results from the previous workflow and the results after data integration. For that we will plot the two UMAPs with the `DimPlot` function.
Here we can see that on the left plot that cells were not well integrated, we can clearly see that they are separated from each other. In comparison with the plot on the right, cells look well integrated to each other.
```{r}
plot1 <- DimPlot(MergedNML, reduction = "umap", group.by = 'orig.ident')
plot2 <- DimPlot(MergedNML.integrated, reduction = "umap", group.by = 'orig.ident')
plot1+plot2
```
## References
- Tran, H.T.N., Ang, K.S., Chevrier, M. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol 21, 12 (2020). https://doi.org/10.1186/s13059-019-1850-9
- Batch effect corrections, (2023), 10X Genomics, Source: <https://www.10xgenomics.com/resources/analysis-guides/introduction-batch-effect-correction>
- Hoffman p. Introduction to scRNA-seq integration, (2023), Source: <https://satijalab.org/seurat/articles/integration_introduction.html>