Skip to content

Commit

Permalink
Move filtering to before DESeq2 object creation (#425)
Browse files Browse the repository at this point in the history
* re-render it

* Further fix merge conflicts and re-render

* Address jashapiro comments
  • Loading branch information
cansavvy authored Dec 11, 2020
1 parent 4b043ce commit f188c7c
Show file tree
Hide file tree
Showing 6 changed files with 614 additions and 606 deletions.
52 changes: 28 additions & 24 deletions 03-rnaseq/clustering_rnaseq_01_heatmap.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -217,7 +217,7 @@ We stored our file paths as objects named `metadata_file` and `data_file` in [th
metadata <- readr::read_tsv(metadata_file)
# Read in data TSV file
df <- readr::read_tsv(data_file) %>%
expression_df <- readr::read_tsv(data_file) %>%
# Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later
tibble::column_to_rownames("Gene")
```
Expand All @@ -232,47 +232,51 @@ Now let's ensure that the metadata and data are in the same sample order.

```{r}
# Make the data in the order of the metadata
df <- df %>% dplyr::select(metadata$refinebio_accession_code)
expression_df <- expression_df %>%
dplyr::select(metadata$refinebio_accession_code)
# Check if this is in the same order
all.equal(colnames(df), metadata$refinebio_accession_code)
all.equal(colnames(expression_df), metadata$refinebio_accession_code)
```

Now we are going to use a combination of functions from the `DESeq2` and `pheatmap` packages to look at how are samples and genes are clustering.

## Define a minimum counts cutoff

We want to filter out the genes that have not been expressed or that have low expression counts since these genes are likely to add noise rather than useful signal to our analysis.
We are going to do some pre-filtering to keep only genes with 10 or more reads total.
Note that rows represent gene data and the columns represent sample data in our dataset.

```{r}
# Define a minimum counts cutoff and filter the data to include
# only rows (genes) that have total counts above the cutoff
filtered_expression_df <- expression_df %>%
dplyr::filter(rowSums(.) >= 10)
```

We also need our counts to be rounded before we can use them with the `DESeqDataSetFromMatrix()` function.

```{r}
# The `DESeqDataSetFromMatrix()` function needs the values to be integers
filtered_expression_df <- round(filtered_expression_df)
```

## Create a DESeqDataset

We will be using the `DESeq2` package for [normalizing and transforming our data](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods), which requires us to format our data into a `DESeqDataSet` object.
We turn the data frame (or matrix) into a [`DESeqDataSet` object](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#02_About_DESeq2). ) and specify which variable labels our experimental groups using the [`design` argument](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#multi-factor-designs) [@Love2014].
In this chunk of code, we will not provide a specific model to the `design` argument because we are not performing a differential expression analysis.

```{r}
# The `DESeqDataSetFromMatrix()` function needs the values to be converted to integers
df <- df %>%
# Mutate numeric variables to be integers
dplyr::mutate_if(is.numeric, round)
# Create a `DESeqDataSet` object
dds <- DESeqDataSetFromMatrix(
countData = df, # This is the data.frame with the counts values for all replicates in our dataset
colData = metadata, # This is the data.frame with the annotation data for the replicates in the counts data.frame
design = ~1 # Here we are not specifying a model -- Replace with an appropriate design variable for your analysis
countData = filtered_expression_df, # the counts values for all samples in our dataset
colData = metadata, # annotation data for the samples in the counts data frame
design = ~1 # Here we are not specifying a model
# Replace with an appropriate design variable for your analysis
)
```

## Define a minimum counts cutoff

We want to filter out the genes that have not been expressed or that have low expression counts because we want to remove any possible noise from our data before we normalize the data and create our heatmap.
We are going to do some pre-filtering to keep only genes with 10 or more reads total.
Note that rows represent gene data and the columns represent sample data in our dataset.

```{r}
# Define a minimum counts cutoff and filter `DESeqDataSet` object to include
# only rows that have counts above the cutoff
genes_to_keep <- rowSums(counts(dds)) >= 10
dds <- dds[genes_to_keep, ]
```

## Perform DESeq2 normalization and transformation

We are going to use the `rlog()` function from the `DESeq2` package to normalize and transform the data.
Expand Down
Loading

0 comments on commit f188c7c

Please sign in to comment.