Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move filtering to before DESeq2 object creation #425

Merged
merged 4 commits into from
Dec 11, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 28 additions & 24 deletions 03-rnaseq/clustering_rnaseq_01_heatmap.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -217,7 +217,7 @@ We stored our file paths as objects named `metadata_file` and `data_file` in [th
metadata <- readr::read_tsv(metadata_file)
# Read in data TSV file
df <- readr::read_tsv(data_file) %>%
expression_df <- readr::read_tsv(data_file) %>%
# Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later
tibble::column_to_rownames("Gene")
```
Expand All @@ -232,47 +232,51 @@ Now let's ensure that the metadata and data are in the same sample order.

```{r}
# Make the data in the order of the metadata
df <- df %>% dplyr::select(metadata$refinebio_accession_code)
expression_df <- expression_df %>%
dplyr::select(metadata$refinebio_accession_code)
# Check if this is in the same order
all.equal(colnames(df), metadata$refinebio_accession_code)
all.equal(colnames(expression_df), metadata$refinebio_accession_code)
```

Now we are going to use a combination of functions from the `DESeq2` and `pheatmap` packages to look at how are samples and genes are clustering.

## Define a minimum counts cutoff

We want to filter out the genes that have not been expressed or that have low expression counts since these genes are likely to add noise rather than useful signal to our analysis.
We are going to do some pre-filtering to keep only genes with 10 or more reads total.
Note that rows represent gene data and the columns represent sample data in our dataset.

```{r}
# Define a minimum counts cutoff and filter the data to include
# only rows (genes) that have total counts above the cutoff
filtered_expression_df <- expression_df %>%
dplyr::filter(rowSums(.) >= 10)
```

We also need our counts to be rounded before we can use them with the `DESeqDataSetFromMatrix()` function.

```{r}
# The `DESeqDataSetFromMatrix()` function needs the values to be integers
filtered_expression_df <- round(filtered_expression_df)
```

## Create a DESeqDataset

We will be using the `DESeq2` package for [normalizing and transforming our data](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods), which requires us to format our data into a `DESeqDataSet` object.
We turn the data frame (or matrix) into a [`DESeqDataSet` object](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#02_About_DESeq2). ) and specify which variable labels our experimental groups using the [`design` argument](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#multi-factor-designs) [@Love2014].
In this chunk of code, we will not provide a specific model to the `design` argument because we are not performing a differential expression analysis.

```{r}
# The `DESeqDataSetFromMatrix()` function needs the values to be converted to integers
df <- df %>%
# Mutate numeric variables to be integers
dplyr::mutate_if(is.numeric, round)
# Create a `DESeqDataSet` object
dds <- DESeqDataSetFromMatrix(
countData = df, # This is the data.frame with the counts values for all replicates in our dataset
colData = metadata, # This is the data.frame with the annotation data for the replicates in the counts data.frame
design = ~1 # Here we are not specifying a model -- Replace with an appropriate design variable for your analysis
countData = filtered_expression_df, # the counts values for all samples in our dataset
colData = metadata, # annotation data for the samples in the counts data frame
design = ~1 # Here we are not specifying a model
# Replace with an appropriate design variable for your analysis
)
```

## Define a minimum counts cutoff

We want to filter out the genes that have not been expressed or that have low expression counts because we want to remove any possible noise from our data before we normalize the data and create our heatmap.
We are going to do some pre-filtering to keep only genes with 10 or more reads total.
Note that rows represent gene data and the columns represent sample data in our dataset.

```{r}
# Define a minimum counts cutoff and filter `DESeqDataSet` object to include
# only rows that have counts above the cutoff
genes_to_keep <- rowSums(counts(dds)) >= 10
dds <- dds[genes_to_keep, ]
```

## Perform DESeq2 normalization and transformation

We are going to use the `rlog()` function from the `DESeq2` package to normalize and transform the data.
Expand Down
Loading