Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RNA-seq DGE dataset switch #416

Merged
merged 6 commits into from
Dec 7, 2020
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 76 additions & 75 deletions 03-rnaseq/differential-expression_rnaseq_01.Rmd
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: "Differential Expression - RNA-seq"
author: "CCDL for ALSF"
date: "October 2020"
date: "December 2020"
output:
html_notebook:
toc: true
Expand All @@ -11,7 +11,13 @@ output:

# Purpose of this analysis

This notebook takes RNA-seq data and metadata from refine.bio and identifies differentially expressed genes between experimental groups.
This notebook takes RNA-seq expression data and metadata from refine.bio and identifies differentially expressed genes between two experimental groups.

Differential expression analysis is popular technique for expression data that will tell you what genes are most much higher or much lower between your experimental groups.
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
The simplest version of this analysis is comparing two groups where one of those groups is a control group.

Our refine.bio RNA-seq examples use DESeq2 for these analyses because it handles RNA-seq data well and has great documentation.
Read more about DESeq2 and why we like it on our [Getting Started page](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#02_About_DESeq2).

⬇️ [**Jump to the analysis code**](#analysis) ⬇️

Expand Down Expand Up @@ -66,7 +72,7 @@ In the same place you put this `.Rmd` file, you should now have three new empty

For general information about downloading data for these examples, see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-the-data).

Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP078441/rna-seq-of-primary-patient-aml-samples).
Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP123625).

Click the "Download Now" button on the right side of this screen.

Expand All @@ -87,9 +93,9 @@ You will get an email when it is ready.

## About the dataset we are using for this example

For this example analysis, we will use this [acute myeloid leukemia (AML) dataset](https://www.refine.bio/experiments/SRP078441/rna-seq-of-primary-patient-aml-samples) [@Micol2017]

@Micol2017 performed RNA-seq on primary peripheral blood and bone marrow samples from AML patients with and without _ASXL1/2_ mutations.
For this example analysis, we are using RNA-seq data from an [acute lymphoblastic leukemia (ALL) mouse lymphoid cell model](https://www.refine.bio/experiments/SRP123625) [@Kampen2019].
All lymphoid mouse cells have human RPL10 but three of the mice have a knock-in R98S mutated RPL10 and three have the human wildtype RPL10.
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
We will perform our differential expression using these knock-in and wildtype mice designations.

## Place the dataset in your new `data/` folder

Expand All @@ -104,15 +110,15 @@ For more details on the contents of this folder see [these docs on refine.bio](h
The `<experiment_accession_id>` folder has the data and metadata TSV files you will need for this example analysis.
Experiment accession ids usually look something like `GSE1235` or `SRP12345`.

Copy and paste the `SRP078441` folder into your newly created `data/` folder.
Copy and paste the `SRP123625` folder into your newly created `data/` folder.

## Check out our file structure!

Your new analysis folder should contain:

- The example analysis `.Rmd` you downloaded
- A folder called "data" which contains:
- The `SRP078441` folder which contains:
- The `SRP123625` folder which contains:
- The gene expression
- The metadata TSV
- A folder for `plots` (currently empty)
Expand All @@ -131,17 +137,17 @@ This is handy to do because if we want to switch the dataset (see next section f
```{r}
# Define the file path to the data directory
# Replace with the path of the folder the files will be in
data_dir <- file.path("data", "SRP078441")
data_dir <- file.path("data", "SRP123625")

# Declare the file path to the gene expression matrix file
# inside directory saved as `data_dir`
# Replace with the path to your dataset file
data_file <- file.path(data_dir, "SRP078441.tsv")
data_file <- file.path(data_dir, "SRP123625.tsv")

# Declare the file path to the metadata file
# inside the directory saved as `data_dir`
# Replace with the path to your metadata file
metadata_file <- file.path(data_dir, "metadata_SRP078441.tsv")
metadata_file <- file.path(data_dir, "metadata_SRP123625.tsv")
```

Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above.
Expand Down Expand Up @@ -226,19 +232,19 @@ We stored our file paths as objects named `metadata_file` and `data_file` in [th
metadata <- readr::read_tsv(metadata_file)

# Read in data TSV file
df <- readr::read_tsv(data_file) %>%
expression_df <- readr::read_tsv(data_file) %>%
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
tibble::column_to_rownames("Gene")
```

Let's ensure that the metadata and data are in the same sample order.

```{r}
# Make the data in the order of the metadata
df <- df %>%
expression_df <- expression_df %>%
dplyr::select(metadata$refinebio_accession_code)

# Check if this is in the same order
all.equal(colnames(df), metadata$refinebio_accession_code)
all.equal(colnames(expression_df), metadata$refinebio_accession_code)
```

The information we need to make the comparison is in the `refinebio_title` column of the metadata data.frame.
Expand All @@ -249,61 +255,75 @@ head(metadata$refinebio_title)

## Set up metadata

This dataset includes data from patients with and without ASXL gene mutations.
The authors of this data have ASXL mutation status along with other information is stored all in one string (this is not very convenient for us).
This dataset includes data from mouse lymphoid cells with human RPL10, with and without a `R98S` mutation.
The authors of this data have the mutation status along with other information stored all in one string (this is not very convenient for us).
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
We need to extract the mutation status information into its own column to make it easier to use.

```{r}
metadata <- metadata %>%
# The last bit of the title, separated by "-" contains the mutation
# information that we want to extract
dplyr::mutate(asxl_mutation_status = stringr::word(refinebio_title,
-1,
sep = "-"
)) %>%
# Now let's summarized the ASXL1 mutation status from this variable
dplyr::mutate(asxl_mutation_status = dplyr::case_when(
grepl("ASXL1|ASXL2", asxl_mutation_status) ~ "asxl_mutation",
grepl("ASXLwt", asxl_mutation_status) ~ "no_mutation"
# Let's get the RPL10 mutation status from this variable
dplyr::mutate(mutation_status = dplyr::case_when(
stringr::str_detect(refinebio_title, "R98S") ~ "R98S",
stringr::str_detect(refinebio_title, "WT") ~ "wildtype"
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
))
```

Let's take a look at `metadata_df` to see if this worked.
Let's take a look at `metadata` to see if this worked by looking at the `refinebio_title` and `mutation_status` columns.

```{r}
# looking at the first 6 rows of the metadata_df and only at the columns that
# contain the title and the mutation status we extracted from the title
head(dplyr::select(metadata, refinebio_title, asxl_mutation_status))
# Let's taake a look at the original metadata column's info
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
# and our new `mutation_status` column
dplyr::select(metadata, refinebio_title, mutation_status)
```

Before we set up our model in the next step, we want to check if our modeling variable is set correctly.
We want our "control" to to be set as the first level in the variable we provide as our experimental variable.
Here we will use the `str()` function to print out a preview of the **str**ucture of our variable

```{r}
# Print out a preview of `asxl_mutation_status`
str(metadata$asxl_mutation_status)
# Print out a preview of `mutation_status`
str(metadata$mutation_status)
```

Currently, `asxl_mutation_status` is a character.
Currently, `mutation_status` is a character, which is not necessarily what we want.
To make sure it is set how we want for the `DESeq` object and subsequent testing, let's mutate it to a factor so we can explicitly set the levels.
cansavvy marked this conversation as resolved.
Show resolved Hide resolved

In the `levels` argument, we list `wildtype` first since that is our control group.
cansavvy marked this conversation as resolved.
Show resolved Hide resolved

```{r}
# Make asxl_mutation_status a factor and set the levels appropriately
# Make mutation_status a factor and set the levels appropriately
metadata <- metadata %>%
dplyr::mutate(
# Here we will set up the factor aspect of our new variable.
asxl_mutation_status = factor(asxl_mutation_status, levels = c("no_mutation", "asxl_mutation"))
mutation_status = factor(mutation_status, levels = c("wildtype", "R98S"))
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
)
```

Note if you don't specify `levels`, the `factor()` function will set levels in alphabetical order -- which sometimes means your control group will not be listed first!

Let's double check if the levels are what we want using the `levels()` function.

```{r}
levels(metadata$asxl_mutation_status)
levels(metadata$mutation_status)
```

Yes! `wildtype` is the first level as we want it to be.
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
We're all set and ready to move on to making our `DESeq2Dataset` object.

## Define a minimum counts cutoff

We want to filter out the genes that have not been expressed or that have low expression counts, since these do not have high enough counts to yield reliable differential expression results.
Removing these genes saves on memory usage during the tests.
We are going to do some pre-filtering to keep only genes with 10 or more reads in total across the samples.

```{r}
# Define a minimum counts cutoff and filter the data to include
# only rows (genes) that have total counts above the cutoff
filtered_expression_df <- expression_df %>%
dplyr::filter(rowSums(.) >= 10)
```

Yes! `no_mutation` is the first level as we want it to be. We're all set and ready to move on to making our `DESeq2Dataset` object.
If you have a bigger dataset, you will probably want to make this cutoff larger.

## Create a DESeq2Dataset

Expand All @@ -312,39 +332,29 @@ First we need to prep our gene expression data frame so it's in the format that

```{r}
# We are making our data frame into a matrix and rounding the numbers
gene_matrix <- round(as.matrix(df))
gene_matrix <- round(as.matrix(filtered_expression_df))
```

Now we need to create `DESeqDataSet` from our expression dataset.
We use the `asxl_mutation_status` variable we created in the design formula because that will allow us
to model the presence/absence of _ASXL1/2_ mutation.
We use the `mutation_status` variable we created in the design formula because that will allow us
to model the presence/absence of _R98S_ mutation.

```{r}
ddset <- DESeqDataSetFromMatrix(
# Here we supply non-normalized count data
countData = gene_matrix,
# Supply the `colData` with our metadata data frame
colData = metadata,
design = ~asxl_mutation_status
# Supply our experimental variable to `design`
design = ~mutation_status
)
```

## Define a minimum counts cutoff

We want to filter out the genes that have not been expressed or that have low expression counts, since these do not have high enough counts to yield reliable differential expression results.
Removing these genes saves on memory usage during the tests.
We are going to do some pre-filtering to keep only genes with 10 or more reads in total across the samples.

```{r}
# Define a minimum counts cutoff and filter `DESeqDataSet` object to include
# only rows that have counts above the cutoff
genes_to_keep <- rowSums(counts(ddset)) >= 10
ddset <- ddset[genes_to_keep, ]
```

## Run differential expression analysis

We'll use the wrapper function `DESeq()` to do our differential expression analysis.
In our `DESeq2` object we designated our `asxl_mutation_status` variable as the `model` argument.
Because of this, the `DESeq` function will use groups defined by `asxl_mutation_status` to test for differential expression.
In our `DESeq2` object we designated our `mutation_status` variable as the `model` argument.
Because of this, the `DESeq` function will use groups defined by `mutation_status` to test for differential expression.

```{r}
deseq_object <- DESeq(ddset)
Expand Down Expand Up @@ -386,7 +396,7 @@ deseq_df <- deseq_results %>%
tibble::rownames_to_column("Gene") %>%
dplyr::mutate(threshold = padj < 0.05) %>%
# let's sort by statistic -- the highest values should be what is up in the
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
# ASXL mutated samples
# RPL10 mutated samples
dplyr::arrange(dplyr::desc(log2FoldChange))
```

Expand All @@ -401,10 +411,10 @@ head(deseq_df)
To double check what a differentially expressed gene looks like, we can plot one with `DESeq2::plotCounts()` function.

```{r}
plotCounts(ddset, gene = "ENSG00000196074", intgroup = "asxl_mutation_status")
plotCounts(ddset, gene = "ENSMUSG00000026623", intgroup = "mutation_status")
```

The `mutation` group samples have higher expression of this gene than the control group, which helps assure us that the results are showing us what we are looking for.
The `R98S` mutated samples have higher expression of this gene than the control group, which helps assure us that the results are showing us what we are looking for.

## Save results to TSV

Expand All @@ -415,31 +425,22 @@ readr::write_tsv(
deseq_df,
file.path(
results_dir,
"SRP078441_differential_expression_results.tsv" # Replace with a relevant output file name
"SRP123625_differential_expression_results.tsv" # Replace with a relevant output file name
)
)
```

## Create a volcano plot

We'll use the `EnhancedVolcano` package's main function to plot our data [@Zhu2018].

Here we are plotting the `log2FoldChange` (which was estimated by `lfcShrink` step) on the x axis and `padj` on the y axis.
The `padj` variable are the p values corrected with `Benjamini-Hochberg` (the default from the `results()` step).

```{r}
EnhancedVolcano::EnhancedVolcano(
deseq_df,
lab = deseq_df$Gene, # A vector that contains our gene names
x = "log2FoldChange", # The variable in `deseq_df` you want to be plotted on the x axis
y = "padj" # The variable in `deseq_df` you want to be plotted on the y axis
)
```

Here the red point is the gene that meets both the default p value and log2 fold change cutoff (which are 10e-6 and 1 respectively).

We used the adjusted p values for our plot above, so you may want to loosen this cutoff with the `pCutoff` argument (Take a look at all the options for tailoring this plot using `?EnhancedVolcano`).
Because we are using adjusted p values we can feel safe in making our `pCutoff` argument `0.01` (default is `1e-05`).
Take a look at all the options for tailoring this plot using `?EnhancedVolcano`.

Let's make the same plot again, but adjust the `pCutoff` since we are using multiple-testing corrected p values and this time we will assign the plot to our environment as `volcano_plot`.
We will assign the plot to our environment as `volcano_plot`.
cansavvy marked this conversation as resolved.
Show resolved Hide resolved

```{r}
# We'll assign this as `volcano_plot` this time
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
Expand All @@ -455,13 +456,13 @@ volcano_plot <- EnhancedVolcano::EnhancedVolcano(
volcano_plot
```

This looks pretty good.
This looks pretty good!
Let's save it to a PNG.

```{r}
ggsave(
plot = volcano_plot,
file.path(plots_dir, "SRP078441_volcano_plot.png")
file.path(plots_dir, "SRP123625_volcano_plot.png")
) # Replace with a plot name relevant to your data
```

Expand Down
Loading