01-introduction.Rmd

# (PART) Introduction {-}

# Data Infrastructure {#data-introduction}

```{r setup, echo=FALSE, results="asis"}
library(rebook)
chapterPreamble()
```

The
[`SummarizedExperiment`](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html)
(`SE`) is a widely used class for analyzing data obtained by common sequencing
techniques. `SE` is common structure for several Bioconductor packages that are
used for analyzing RNAseq, ChIp-Seq data. `SE` class is also used in R packages
for analyzing microarrays, flow cytometry, proteomics, single-cell sequencing
data and many more. The single-cell analysis is facilitated by
[SingelCellExperiment
class](https://bioconductor.org/packages/release/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html)
(`SCE`), which allows the user to store results of dimensionality reduction and
alternative experiments. Alternative experiments (`altExps`) can be differently
processed data within the analysis workflows.

Recently,
[`TreeSummarizedExperiment`](https://www.bioconductor.org/packages/release/bioc/html/TreeSummarizedExperiment.html)
(`TSE`),
[`MicrobiomeExperiment`](https://github.com/FelixErnst/MicrobiomeExperiment)
were developed to extend the `SE` and `SCE` class for incorporating hierarchical
information (including phylogenetic tree) and reference sequences.

The `mia` package implements tools using these classes for analysis of
microbiome sequencing data.

## Installation

Install the development version from GitHub using `remotes` R package.  

```{r eval=FALSE, message=FALSE}
# install remotes 
#install.packages("remotes")
BiocManager::install("FelixErnst/mia")
```

### Packages    

1. `mia`    : Microbiome analysis tools   
2. `miaViz` : Microbiome analysis specific visualization

**See also:**    

[`microbiome`](https://bioconductor.org/packages/devel/bioc/html/microbiome.html)

## Background

The `phyloseq` package and class was around before the [`SummarizedExperiment`](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html)  
and the derived 
[`TreeSummarizedExperiment`](https://www.bioconductor.org/packages/release/bioc/html/TreeSummarizedExperiment.html) 
class. To make the transition easy 
here is short description how `phyloseq` and `*Experiment` classes relate to 
each other.

`assays`     : This slot is similar to `otu_table` in `phyloseq`. In a `SummarizedExperiment`
               object multiple assays, raw counts, transformed counts can be stored. See also 
               [`MultiAssayExperiment`](https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html) for storing data from multiple `experiments` such as RNASeq, Proteomics, etc.       
`rowData`    : This slot is similar to `tax_table` in `phyloseq` to store taxonomic information.     
`colData`    : This slot is similar to `sample_data` in `phyloseq` to store information related to samples.    
`rowTree`    : This slot is similar to `phy_tree` in `phyloseq` to store phylogenetic tree.     

In this book, you will come across terms like `FeatureIDs` and `SampleIDs`.   
`FeatureIDs` : These are basically OTU/ASV ids which are row names in `assays` and `rowData`.    
`SampleIDs`  : As the name suggests, these are sample ids which are column names in `assays` and row names in `colData`.  

`FeatureIDs` and `SampleIDs` are used but the technical terms `rownames` and 
`colnames` are encouraged to be used, since they relate to actual objects we 
work with.

## Loading experimental microbiome data

## Metadata

## Microbiome and tree data specific aspects

```{r, message=FALSE}
library(mia)
data("GlobalPatterns")
se <- GlobalPatterns 
se
```

### Assays  

The `assays` slots contains the experimental data as count matrices. Multiple 
matrices can be stored the result of `assays` is actually a list of matrices.

```{r}
assays(se)
```

Individual assays can be accessed via `assay`

```{r}
assay(se, "counts")[1:5,1:7]
```

To illustrate the use of multiple assays, the relative abundance data can be 
calcualted and stored along the original count data using `relAbundanceCounts`.

```{r}
se <- relAbundanceCounts(se)
assays(se)
```

Now there are two assays available in the `se` object, `counts` and 
`relabundance`.

```{r}
assay(se, "relabundance")[1:5,1:7]
```

### colData

`colData` contains data on the samples.

```{r coldata}
colData(se)
```

### rowData

`rowData` contains data on the features of the samples analyzed. Of particular
interest for the microbiome field this is used to store taxonomic information.

```{r rowdata}
rowData(se)
```

### rowTree  

Phylogenetic trees also play an important role for the microbiome field. The 
`TreeSummarizedExperiment` class is able to keep track of feature and node
relations via two functions, `rowTree` and `rowLinks`.

A tree can be accessed via `rowTree` as `phylo` object.       
```{r rowtree}
rowTree(se)
```

The links to the individual features are available through `rowLinks`.

```{r rowlinks}
rowLinks(se)
```

Please note that there can be a 1:1 relationship between tree nodes and 
features, but this is not a must have. This means there can be features, which
are not linked to nodes, and nodes, which are not linked to features. To change
the links in an existing object, the `changeTree` function is available.

## Data conversion

Sometimes custom solutions are need for analyzing the data. `mia` contains a 
few functions to help in these situations.

### Tidy data

For several custom analysis and visualization packages such as those from the 
`tidyverse` the `SE` data can be converted to long data.frame format with 
`meltAssay`.    

```{r}
library(mia)
molten_se <- meltAssay(se,
                       add_row_data = TRUE,
                       add_col_data = TRUE,
                       abund_values = "relabundance")
molten_se
```

## Conclusion

Some wrapping up...

## Session Info {-}

```{r sessionInfo, echo=FALSE, results='asis'}
prettySessionInfo()
```