index.rmarkdown

---
title: "Bioconductor classes for working <br> with microarrays or similar data"
author: "Alex Sanchez-Pla"
date: "`r Sys.Date()`"
format:
    html: 
      toc: true
      toc-depth: 4
      number-sections: true
      embed-resources: true
    pdf:
      toc: true
      toc-depth: 3
      number-sections: true      
knit:
  quarto:
    chunk_options:
      echo: true
      cache: false
      prompt: false
      tidy: true
      comment: NA
      message: false
      warning: false
    knit_options:
      width: 75
# reference-location: margin
editor_options: 
  chunk_output_type: console
editor: 
  markdown: 
    wrap: 72
---


# Introduction

Many omics data, once they have been pre-processed, can be stored as
numeric data that can be represented as the typical "data matrix". This
matrix is, however, usually transposed, that is genes (variables) are in
rows and samples (individuals) are in columns.

A person who is familiar with statistics and R can therefore explore an
omics dataset using standard univariate and multivariate statistical
methods.

In practice, omics datasets have more information than just what can be
stored in a table. This can be annotation data, multiple covariates
other than what is in the column names, or information about th
eexperimental design or simply the experiment.

Even for a person who is proficient with software, managing
simultaneously distinct objects, that contain related information, can
be "tricky" and there is always a danger that the distinct components
lose synchronization. For instance removing one sample from the
expression matrix requires that the corresponding information is removed
or updated in the covariates table. And an error at doing this can yield
different problems.

In this lab we introduce the <tt>ExpressionSet</tt> class as an option
for managing all these pieces of information simultaneously, which not
only simplifies the process, but also prevents mistakes derived from
lack of consistency between the parts.

The lab has two parts

1.  Introduces bioconductor classes to store and access microarray data.

2.  Shows how to use the `GEOquery` bioconductor package to download
    microarray data into an analysis-ready form.

## Availability

This document can be re-created using the repository

# Bioconductor classes to manage micrarray and similar data

## The OOP paradigm

Object-oriented design provides a convenient way to represent data
structures and actions performed on them.

-   A <em>class</em> can be tought of as a template, a description of
    what constitutes each instance of the class.
-   An <em>instance</em> of a class is a realization of what describes
    the class.
-   Attributes of a class are data components, and methods of a class
    are functions, or actions the instance/class is capable of.

The R language has several implementations of the OO paradigm but, in
spite of its success in other languages, it is relatively minoritary.

## Bioconductor Classes

One case where OOP has succeeded in R or, at least, is more used than in
others is in the Bioconductor Project
([bioconductor.org](http://bioconductor.org)). In Bioconductor we have
to deal with complex data structures such as the results of a microarray
experiment, a genome and its annotation or a complex multi-omics
dataset. These are situations where using OOP to create classes to
manage those complex types of data is clearly appropriate.

## The `Biobase` package

The `R`package{Biobase} package implements one of the best known
Bioconductor classes: <tt>ExpressionSet</tt>. It was originally intended
to contain microarray data and information on the study that generated
them and it has become a standard for similar data structures.


```{r loadBiobase}
library(Biobase)
```


Figure @ref(ExpressionSet) shows the structure of this class. It is
essentially a <em>container</em> that has distinct slots to store some
of the most usual components in an omics dataset.


```{r ExpressionSet, fig.cap="Structure of the <tt>ExpressionSet</tt> class, showing its slots and their meaning. Reproduced from Klaus, B., & Reisenauer, S. (2018)", echo=FALSE}
knitr::include_graphics("images/Structure-of-Bioconductors-ExpressionSet-class.png")
```


The advantage of the OOP approach is that, if a new type of omics data
needs a similar but different structure it can be created using
inheritance, which means much less work than and better consistency than
creating it from scratch.

## A toy dataset

For the purpose of this lab we are going to simulate a toy (fake)
dataset that consists of the following:

-   <b>Expression values</b> A matrix of 30 rows and 10 columns
    containing expression values from a gene expression experiment.
    Matrix column names are sample identifiers

-   <b>Covariates</b> A table of ten rows and four columns containing
    the sample identifiers, the treatment groups and the age and sex of
    individuals.

-   <b>Genes</b> Information about the features contained in the data.
    May be the gene names, the probeset identifiers etc. Usually stored
    in a character vector but may also be a table with distinct
    annotations per feature.

-   <b>Information about the experiment</b> Additional information about
    the study, such as the authors and their contact details or the
    title and url of the study that originated them.


```{r simulateData}
expressionValues <- matrix (rnorm (300), nrow=30)
colnames(expressionValues) <- paste0("sample",1:10)
head(expressionValues)
```


**VERY IMPORTANT**: To create the ExpressionSet the following has to be
verified:

-   The names of the columns of the object that contains the
    expressions, that will be stored in `assayData`
-   must match the names of the rows of the object that contains the
    covariates, that will be stored in `phenoData`.

In this example it is saved in the variable `sampleNames` but this field
will be used as the *name of the rows*, not as another column


```{r simulateCovariates}
targets <- data.frame(sampleNames = paste0("sample",1:10),
                      group=c(paste0("CTL",1:5),paste0("TR",1:5)),
                      age = rpois(10, 30), 
                      sex=as.factor(sample(c("Male", "Female"),10,replace=TRUE)),
                      row.names=1)
head(targets, n=10)
```

```{r simulateGeneInfo}
myGenes <-  paste0("gene",1:30)
```

```{r simulateInfo}
myInfo=list(myName="Alex Sanchez", 
            myLab="Bioinformatics Lab",
            myContact="alex@somemail.com", 
            myTitle="Practical Exercise on ExpressionSets")
show(myInfo)
```


Having data stored in this way is usually enough for most of the analyes
we may want to do. The only unconvenient comes from the fact that the
information about the same individuals is in separate R objects so that,
for certain applications, we will have to access several objects and
<em>assume they are well related</em>.

For example if we want to make a principal components analysis and plot
the groups by treatment we need to use both
`expressionValues" and`targets."


```{r PCA1}
pcs <- prcomp(expressionValues)
names(pcs)
barplot(pcs$sdev)
plot(pcs$rotation[,1], pcs$rotation[,2], 
     main="Representation of first two principal components")
text(pcs$rotation[,1], pcs$rotation[,2], targets$group, cex=0.8, pos=3)
```


Or, if we sort the genes from most to least variable and whant to see
which are the top variable genes. We need to use both objects
`expressionValues" and`myGenes" assuming they are well linked:


```{r sortGenesByVar}
variab <- apply(expressionValues, 1, sd)
orderedGenes <- myGenes[order(variab, decreasing=TRUE)]
head(variab[order(variab, decreasing=TRUE)])
head(orderedGenes)
```


Imagine we are informed that individual has to be removed. We have to do
it in "expressionValues" and "targets".


```{r subsetExpressions}
newExpress<- expressionValues[,-9]
newTargets <- targets[-9,]
wrongNewTargets <- targets [-10,]
```


It is relatively easy to make an unnoticeable mistake in removing
unrelated values from the data matrix and the targets table. If instead
of removing individual 9 we remove individual 10 it may be difficult to
realize what has happened unless it causes a clear unconsistency!

## Creating and using objects of class ExpressionSet

In order to use a class we need to <em>instantiate</em> it, that is we
need to create an object of this class.

This can be done using the generic constructor <tt>new</tt> or with the
function <tt>ExpressionSet</tt>.

Both the constructor or the function require a series of parameters
which roughly correspond to the slots of the class (type <tt>?
ExpressionSet</tt> to see a list of compulsory and optional arguments).

In the following subsections we describe how to create an
<tt>ExpressionSet</tt> using the components of the toy dataset. Some of
the elements will directly be the element in the toy dataset, such as
the expression matrix. For others such as the covariates or the
experiment information, specific classes have been introduced so that we
have to instantiate these classes first and then use the the objects
created to create the <tt>ExpressionSet</tt> object.

### Slot <tt>AssayData</tt>

The main element, and indeed the only one to be provided to create an
<tt>ExpressionSet</tt>, is <tt>AssayData</tt>. For our practical
purposes it can be seen as a matrix with as many rows as genes or
generically "features" and as many columns as samples or individuals.


```{r creaExpressionSet1}
myEset <- ExpressionSet(expressionValues)
class(myEset)
show(myEset)
```


### Information about covariates

Covariates, such as those contained in the "targets" data frame are not
included in the "ExpressionSet" "as.is". Instead we have first to create
an intermediate object of class <tt>AnnotatedDataFrame</tt>.

Class `R`class{AnnotatedDataFrame} is intended to contain a data frame
where we may want to provide enhanced information for columns, i.e.
besides the short column names, longer labels to describe them better.

The information about covariates, contained in an instance of class
<tt>AnnotatedDataFrame</tt>, is stored in the slot <tt>phenoData</tt>.


```{r AnnotatedDataFrame2}
columnDesc <-  data.frame(labelDescription= c("Treatment/Control", 
                                                "Age at disease onset", 
                                                "Sex of patient (Male/Female"))
myAnnotDF <- new("AnnotatedDataFrame", data=targets, varMetadata= columnDesc)
show(myAnnotDF)
```


Notice that we have not included a label for sample names because this
information is not a column of the `phenoData` object.

Once we have an <tt>AnnotatedDataFrame</tt> we can add it to the
<tt>ExpressionSet</tt>


```{r addAnnotatedDataFrame}
phenoData(myEset) <- myAnnotDF
```


Alternatively we could have created the<tt>AnnotatedDataFrame</tt>
object first and then create the <tt>ExpressionSet</tt> object with both
the expression values and the covariates. In this case it would be
required that the expression matrix colum names are the same as the
targets row names.


```{r creaEset2}
myEset <- ExpressionSet(assayData=expressionValues, phenoData=myAnnotDF)
show(myEset)
```


### Adding information about features

Similarly to what we do to store information about covariates,
information about genes (or generically "features") may be stored in the
optional slot <tt>featureData</tt> as an <tt>AnnotatedDataFrame</tt>.

The number of rows in <tt>featureData</tt> must match the number of rows
in <tt>assayData.</tt> Row names of <tt>featureData</tt> must match row
names of the matrix / matrices in assayData.

This slot is good if one has an annotations table that one wishes to
store and manage jointly with the other values. ALternatively we can
simple store the names of the features using a character vector in the
slot <tt>featureNames</tt>.


```{r creaEset3}
myEset <- ExpressionSet(assayData=expressionValues,
                        phenoData=myAnnotDF,
                        featureNames =myGenes)
# show(myEset)
```


### Storing information about the experiment

In a similar way to what happens with the <tt>AnnotatedDataFrame</tt>
class there has been developed a class to store information about the
experiment. The structure of the class, called <tt>MIAME</tt> follows
the structur of what has been described as the "Minimum Information
About a Microarray Experiment" see
[www.ncbi.nlm.nih.gov/pubmed/11726920](https://www.ncbi.nlm.nih.gov/pubmed/11726920)

This is useful information but it is clearly optional for data analysis.


```{r label=MIAME}
myDesc <- new("MIAME", name= myInfo[["myName"]],
            lab= myInfo[["myLab"]],
            contact= myInfo[["myContact"]] ,
            title=myInfo[["myTitle"]])
print(myDesc)
```


Again we could add this object to the <tt>ExpressionSet</tt> or use it
when creating it from scratch.


```{r creaEset4}
myEset <- ExpressionSet(assayData=expressionValues,
                        phenoData=myAnnotDF,
                        fetureNames =myGenes,
                        experimentData = myDesc)
# show(myEset)
```


## Using objects of class <tt>ExpressionSet</tt>

The advantage of working with <tt>ExpressionSets</tt> lies in the fact
that action on the objects are done in such a way that its consistency
is ensured. That means for instance that if we subset the
<tt>ExpressionSet</tt> it is automatically done on the columns of the
expressions and on the rows of the covariates and it is no possible that
a distinct row/column are removed.

The following lines illustrate some management of data in an
<tt>ExpressionSet</tt>.

### Accessing Slot values

Notice that to access the values we use special functions called
"accessors" instead of the dollar symbol (which would not work for
classes) or the \@ symbol that does substitute the \$ symbol.

Notice also that, in order to access the data frame contained in the
<tt>phenoData</tt> slot, which is an <tt>AnnotatedDataFrame</tt>, we
need to use two accessors: <tt>phenoData</tt> to access the
<tt>ExpressionSet</tt>'s<tt>phenoData</tt> slot and <tt>pData</tt> to
access the <tt>data</tt> slot in it. It is strange until you get used to
it!


```{r usingExpressionSets}
dim(exprs(myEset))
class(phenoData(myEset))
class(pData(phenoData(myEset)))
head(pData(phenoData(myEset)))
head(pData(myEset))
```


### Subsetting `ExpressionSets`

This is where the interest of using <tt>ExpressionSets</tt> is most
clearly realized.

The <tt>ExpressionSet</tt> object has been cleverly-designed to make
data manipulation consistent with other basic R object types. For
example, creating a subset of an ExpressionsSet will subset the
expression matrix, sample information and feature annotation (if
available) simultaneously in an appropriate manner. The user does not
need to know how the object is represented "under-the-hood". In effect,
we can treat the <tt>ExpressionSet</tt> as if it is a standard R data
frame


```{r smallEset}
smallEset <- myEset[1:15,c(1:3,6:8)]
dim(exprs(smallEset))
dim(pData(smallEset))
head(pData(smallEset))
all(colnames(exprs(smallEset))==rownames(pData(smallEset)))
```


We can for instance create a new dataset for all individuals younger
than 30 or for all females without having to worry about doing it in
every component.


```{r creaEset5}
youngEset <- myEset[,pData(myEset)$age<30]
dim(exprs(youngEset))
head(pData(youngEset))
```


## Exercises

4.  Create an `ExpressionSet` object to contain the data for the example
    study using the data you have downloaded and used in the first
    section. That is, adapt the steps taken to creat the ExpressionSet
    with the toy dataset to create one with the data from the study.

5.  Do some subsetting and check the consistency of the results
    obtained. For example remove some sample from the covariates slot
    (the `phenoData`) and see if it is automatically removed from the
    expression matrix\`.

6.  Check that you are able to reproduce the analysis in the first part
    accessing the components of the object created.

# The `GEOquery` package to download data from GEO

The NCBI Gene Expression Omnibus (GEO) serves as a public repository for
a wide range of high-throughput experimental data. These data include
single and dual channel microarray-based experiments measuring mRNA,
genomic DNA, and protein abundance, as well as non-array techniques such
as serial analysis of gene expression (SAGE), mass spectrometry
proteomic data, and high-throughput sequencing data.

At the most basic level of organization of GEO, there are four basic
entity types. The first three (Sample, Platform, and Series) are
supplied by users; the fourth, the dataset, is compiled and curated by
GEO staff from the user-submitted data. More information is available in
the [GEO site](https://www.ncbi.nlm.nih.gov/geo/info/overview.html) and
in the document
[Analisis_de_datos_omicos-Ejemplo_0-Microarrays](https://github.com/ASPteaching/Analisis_de_datos_omicos-Ejemplo_0-Microarrays)
available in github.

Data can be downloaded from GEO in a wide variety of formats and using a
variety of mechanisms. See the download page in [this
link](https://www.ncbi.nlm.nih.gov/geo/info/download.html).

Here we focus on an alternative based on Bioconductor, the `GEOquery`
package
(<http://bioconductor.org/packages/release/bioc/html/GEOquery.html>)

This package has been developed **to facilitate downloading data from
GEO and turning them into objects of Bioconductor classes such as
`expressionSets`**

The best way to learn how to use this package is following its
[vignette, available at the package
site](http://bioconductor.org/packages/release/bioc/vignettes/GEOquery/inst/doc/GEOquery.html).

Here we only describe how to download a datset using either its series
("GSExxx") or its Dataset ("GDSxxx") identifier.

In the following lines we illustrate how to get the data for this
example using the dataset used in the case study
[Analisis_de_datos_omicos-Ejemplo_0-Microarrays](https://github.com/ASPteaching/Analisis_de_datos_omicos-Ejemplo_0-Microarrays),
avilable from github.

As can be seen there the the dataset has the following identifiers:

-   Series accesion ID for : GSE27174
-   Dataset accesion ID for : GDS4155
-   Plattform accession ID : GPL6246

## Downloading a dataset in GSE format

Getting a series dataset from GEO is straightforward. There is only one
command that is needed: `getGEO`.

This function interprets its input (depending on the data format) to
determine how to get the data from GEO and then parse the data into
useful R data structures.


```{r GEOquery}
if (!require(GEOquery)) {
  BiocManager::install("GEOquery")
}
require(GEOquery)
gse <- getGEO("GSE27174", GSEMatrix=TRUE, AnnotGPL=TRUE)
```


If the data format required is a "Series" (GSExxxx) the function returns
a list, each of which elements is an expressionSet (this is so because
sometimes a Series may have several collections of samples).


```{r gseObject}
class(gse)
names(gse)
length(gse)
gse[[1]]
esetFromGEO <- gse[[1]]
```


By creating the expressionSet automatically the slow process of creating
the object step by step, as in the previous section, can be avoided.

The expressioSet can now be used as usual:


```{r esetFromGEO}
head(exprs(esetFromGEO))
```


We can look at the covariates information, but the phenoData object
created automatically contains lot of repeated information. Eventually
we can explore it and decide which columns we keep and whichs may be
removed. For instance we keep the last two columns and see that column
39 contains the information that defines the groups.


```{r esetfromGEO2}
colnames(pData(esetFromGEO))
pData(esetFromGEO)[,39:40]
```


## Downloading a dataset in GSD format

Eventually, we may prefer to download the data in GSD format.


```{r getGSD}
gds <- getGEO("GDS4155")
```


The object that has been created now is not a list but it is of a
special class "GDS"


```{r gdsObject}
class(gds)
slotNames(gds)
```


Class 'GDS' is comprised of a metadata header (taken nearly verbatim
from the SOFT format header) and a GEODataTable. The GEODataTable has
two simple parts, a Columns part which describes the column headers on
the Table part. There is also a show method ("Meta") for the class.


```{r gdsMetaData}
head(Meta(gds))
```


The gds object can be turned into an expressionSet that contains the
same information as in the previous case:


```{r esetFromGDS}
eset <- GDS2eSet(gds,do.log2=FALSE)
eset
```


# Exercises

1.  Select a *GEO (*Gene Expression Omnibus[) dataset
    ](https://www.ncbi.nlm.nih.gov/geo/)from the list presented in the
    "GEOdatasets_enhanced.xls" document available in the resources of
    the activity.

2.  Read the data from GEO using the GEOquery package. This will provide
    you with an expressionSet class object with the normalized data and
    an additional table with information about the study.
    
3.  Determine the structure of the data (rows, columns) and the design
    of the study (groups of samples or individuals, treatments if any,
    etc.) that generated them.
    -   The information of the experiment can also be downloaded from
        GEO, either with GEOquery if you provide the dataset identifier
        GDSxxxx or by accessing the study page.

# References

-   Cui, Dapeng, K. J. Dougherty, DW Machacek, S. Hochman, and D. J
    Baro. 2006. "Divergence Between Motoneurons: Gene Expression
    Profiling Provides a Molecular Characterization of Functionally
    Discrete Somatic and Autonomic Motoneurons." Physiol Genomics 24
    (3): 276--89. https://doi.org/ 10.1152/physiolgenomics.00109.2005.

-   Clough, E., & Barrett, T. (2016). The Gene Expression Omnibus
    Database. In Methods in molecular biology (Clifton, N.J.) (Vol.
    1418, pp. 93--110). <https://doi.org/10.1007/978-1-4939-3578-9_5>

-   Davis, S., & Meltzer, P. (2007). GEOquery: a bridge between the Gene
    Expression Omnibus (GEO) and BioConductor. Bioinformatics, 14,
    1846--1847.

-   W. Huber, V.J. Carey, R. Gentleman, ..., M. Morgan. Orchestrating
    high-throughput genomic analysis with Bioconductor. Nature Methods,
    2015:12, 115.

-   Klaus, B., & Reisenauer, S. (2018). An end to end workflow for
    differential gene expression using Affymetrix microarrays \[version
    2; referees: 2 approved\]. F1000Research, 5, 1384.

    <https://doi.org/10.12688/f1000research.8967.2>

# Additional info


```{r sessionInfo}
sessionInfo()
```

```{r creaIndex}
# An "index.html" file is created to allow visualitzation in the web using github pages
file.copy(from="Introduction_2_Bioc_classes_4_tabular_data.html", to="index.html", overwrite=TRUE)
```

```{r eval=FALSE}
# The R code for the document can be extracted from the document with the 
# knitr::purl() command
# knitr::purl("Introduction_2_Bioc_classes_4_tabular_data.qmd")
```