parsing.qmd

---
reference-section-title: References
bibliography: bibliography.bib
---

# Manipulating Hi-C data in R

```{r}
#| echo: false
#| results: "hide"
#| message: false
#| warning: false
source("_common.R")
library(ggplot2)
library(GenomicRanges)
library(InteractionSet)
library(HiCExperiment)
library(HiContactsData)
coolf <- HiContactsData('yeast_wt', 'mcool')
cf <- CoolFile(coolf)
pairsf <- HiContactsData('yeast_wt', 'pairs.gz')
```

::: {.callout-note}
## Aims
This chapter focuses on: 

- Modifying information associated with an existing `HiCExperiment` object
- Subsetting a `HiCExperiment` object
- Coercing a `HiCExperiment` object in a base data structure
:::

::: {.callout-important}
## Important reminder
- An `HiCExperiment` object allows random access parsing of a disk-stored contact matrix.
- An `HiCExperiment` object operates by wrapping together (1) a `ContactFile` 
(i.e. a connection to a disk-stored data file) and (2) a `GInteractions` 
generated by parsing the data file.  
:::

::: {.callout-warning collapse="true"}
## Recap on `HiCExperiment` objects 👇

- Creating a connection to a disk-stored contact matrix: 

```{r eval = FALSE}
# coolf <- "<path-to-disk-stored-contact-matrix.cool>"
coolf <- HiContactsData('yeast_wt', 'mcool')
cf <- CoolFile(coolf)
availableResolutions(cf)

availableChromosomes(cf)
```

- Importing a contact matrix over a specific genomic location, at a given resolution: 

```{r}
hic <- import(cf, focus = 'II:10000-50000', resolution = 4000)
hic
```

- Recovering genomic interactions stored in a `HiCExperiment`: 

```{r}
interactions(hic)
```
:::

::: {.callout-tip collapse="true"}
## Generating the example `hic` object 👇

To demonstrate how to manipulate a `HiCExperiment` object, we will create 
an `HiCExperiment` object from an example `.cool` file provided 
in the `HiContactsData` package. 

```{r}
library(HiCExperiment)
library(HiContactsData)

# ---- This downloads an example `.mcool` file and caches it locally 
coolf <- HiContactsData('yeast_wt', 'mcool')

# ---- This creates a connection to the disk-stored `.mcool` file
cf <- CoolFile(coolf)
cf

# ---- This imports contacts from the long arm of chromosome `II`, at resolution `2000`
hic <- import(cf, focus = 'II:300001-813184', resolution = 2000)
hic
```
:::

## Subsetting a contact matrix

Two entirely different approaches are possible to subset of a Hi-C contact matrix:

- **Subsetting before importing**: leveraging random access to a **disk-stored** 
contact matrix to **only import interactions overlapping with a genomic locus 
of interest**.

- **Subsetting after importing**: parsing the **entire contact matrix** in memory, 
and **subsequently** subset interactions overlapping with a genomic locus 
of interest.

![](images/20230525134200.png)

### Subsetting **before** import: with `focus`

Specifying a `focus` **when** importing a dataset in R 
(i.e. `"Subset first, then parse"`) is generally the 
recommended approach to import Hi-C data in R. 

The `focus` argument can be set when `import`ing a `ContactFile` in R, as follows: 

```{r eval = FALSE}
import(cf, focus = "...")
```

This ensures that only the needed data is parsed in R, reducing memory load and 
accelerating the import. Thus, this should be the preferred way of 
parsing `HiCExperiment` data, as disk-stored contact matrices allow 
**efficient random access to indexed data**.  

`focus` can be any of the following string types:

```{r}
#   "II"                                  --> import contacts over an entire chromosome
#   "II:300001-800000"                    --> import on-diagonal contacts within a chromosome
#   "II:300001-400000|II:600001-700000"   --> import off-diagonal contacts within a chromosome
#   "II|III"                              --> import contacts between two chromosomes
#   "II:300001-800000|V:1-500000"         --> import contacts between segments of two chromosomes
```

::: {.callout-important collapse="true"}
## More examples for import with `focus` argument 👇

- Subsetting to a specific **on-diagonal** genomic location using standard UCSC coordinates query: 

```{r}
import(cf, focus = 'II:300001-800000', resolution = 2000)
```

- Subsetting to a specific **off-diagonal** genomic location using pairs of coordinates query: 

```{r}
import(cf, focus = 'II:300001-400000|II:600001-700000', resolution = 2000)
```

- Subsetting interactions to retain those constrained within a single chromosome: 

```{r}
import(cf, focus = 'II', resolution = 2000)
```

- Subsetting interactions to retain those between two chromosomes: 

```{r}
import(cf, focus = 'II|III', resolution = 2000)
```

- Subsetting interactions to retain those between parts of two chromosomes:

```{r}
import(cf, focus = 'II:300001-800000|V:1-500000', resolution = 2000)
```
:::

<!-- 
### `refocus(<HiCExperiment>, focus = "...")`

Bear in mind that every call to `import` creates a `HiCExperiment` from 
scratch. However, in some cases, one may want to **change** the `focus` of a 
pre-existing `HiCExperiment` object, rather than creating a new one. This 
allows to preserve `metadata`, `topologicalFeatures` and `pairsFile` slots of 
the existing `HiCExperiment` object. This can be achieved with the `refocus()` 
function. `refocus` takes a `HiCExperiment` object rather than a `ContactFile`, 
as well as a `focus`. 

```{r}
# --- Note existing `metadata`, `topologicalFeatures` and `pairsFile` fields 
hic 

# --- Note how these fields are erased upon parsing of the original 
#     `cf` contact matrix with a new focus
import(cf, focus = 'III', resolution = 2000)

# --- However, these fields can be preserved using the `refocus` 
#     function on the original `HiCExperiment` object
refocus(hic, 'III')
```
-->

### Subsetting **after** import

It may sometimes be desirable to import a full dataset from
disk first, and **only then** perform in-memory subsetting of the `HiCExperiment` 
object (i.e. `"Parse first, then subset"`). **This is 
for example necessary when the end user aims to investigate subsets of 
interactions across a large number of different areas of a contact matrix.**    

Several strategies are possible to allow subsetting of imported data, 
either with `subsetByOverlaps` or `[`. 

#### `subsetByOverlaps(<HiCExperiment>, <GRanges>)`

`subsetByOverlaps` can take a `HiCExperiment` as a `query` and a `GRanges` as a query. 
In this case, the `GRanges` is used to extract a subset of a `HiCExperiment` 
**constrained** within a specific genomic location. 

```{r}
telomere <- GRanges("II:700001-813184")
subsetByOverlaps(hic, telomere) |> interactions()
```

::: {.callout-important icon='true'}
## `type` argument

By default, `subsetByOverlaps(hic, telomere)` will only recover interactions 
**constrained** within `telomere`, i.e. interactions for which both ends are 
in `telomere`. 

Alternatively, `type = "any"` can be specified to get all interactions with 
at least one of their anchors within `telomere`. 

```{r}
subsetByOverlaps(hic, telomere, type = "any") |> interactions()
```
:::

#### `<HiCExperiment>["..."]`

The square bracket operator `[` allows for more advanced textual queries, similarly 
to `focus` arguments that can be used when importing contact matrices in memory. 

**This ensures that only the needed data is parsed in R, reducing memory load and 
accelerating the import.** Thus, this should be the preferred way of 
parsing `HiCExperiment` data, as disk-stored contact matrices allow 
efficient random access to indexed data.  

The following string types can be used to subset a `HiCExperiment` object
with the `[` notation:

```{r}
#   "II"                                  --> import contacts over an entire chromosome
#   "II:300001-800000"                    --> import on-diagonal contacts within a chromosome
#   "II:300001-400000|II:600001-700000"   --> import off-diagonal contacts within a chromosome
#   "II|III"                              --> import contacts between two chromosomes
#   "II:300001-800000|V:1-500000"         --> import contacts between segments of two chromosomes
#   c("II", "III", "IV")                  --> import contacts within and between several chromosomes
```

::: {.callout-important collapse="true"}
## More examples for subsetting with `[` 👇

- Subsetting to a specific **on-diagonal** genomic location using standard UCSC coordinates query: 

```{r}
hic["II:800001-813184"]
```

- Subsetting to a specific **off-diagonal** genomic location using pairs of coordinates query: 

```{r}
hic["II:300001-320000|II:800001-813184"]
```

- Subsetting interactions to retain those constrained within a single chromosome: 

```{r}
hic["II"]
```

- Subsetting interactions to retain those between two chromosomes: 

```{r}
hic["II|IV"]
```

- Subsetting interactions to retain those between segments of two chromosomes: 

```{r}
hic["II:300001-320000|IV:1-100000"]
```

- Subsetting interactions to retain those constrained within several chromosomes: 

```{r}
hic[c('II', 'III', 'IV')]
```
:::

::: {.callout-note}
## Note
- This last example (subsetting for a vector of several chromosomes) is the 
only scenario for which `[`-based in-memory subsetting of pre-imported data is 
the only way to go, as such subsetting is not possible with `focus` 
from disk-stored data. 
- All the other `[` subsetting scenarii illustrated above can be achieved more 
efficiently using the `focus` argument when `import`ing data into a 
`HiCExperiment` object. 
- However, keep in mind that subsetting preserves extra data, e.g. added `scores`, 
`topologicalFeatures`, `metadata` or `pairsFile`, whereas this information is 
lost using `focus` with `import`.
:::

### Zooming on a `HiCExperiment`

"Zooming" refers to dynamically changing the **resolution** of a `HiCExperiment`. 
By `zoom`ing a `HiCExperiment`, one can refine or coarsen the contact matrix. 
This operation takes a`ContactFile` and `focus` from an existing `HiCExperiment` 
input and re-generates a new `HiCExperiment` with updated `resolution`, 
`interactions` and `scores`. 
Note that `zoom` will preserve existing `metadata`, `topologicalFeatures` and `pairsFile` 
information.

```{r}
hic

zoom(hic, 4000)

zoom(hic, 1000)
```

::: {.callout-note}
## Note
The sum of raw counts do not change after `zoom`ing, however the number of 
individual `interactions` and `regions` changes. 

```{r}
length(hic)
length(zoom(hic, 1000))
length(zoom(hic, 4000))
sum(scores(hic, "count"))
sum(scores(zoom(hic, 1000), "count"))
sum(scores(zoom(hic, 4000), "count"))
```
:::

::: {.callout-important}
- `zoom` does not change the `focus`! It only affects the `resolution` 
(and consequently, the `interactions`).
- `zoom` will only work for multi-resolution contact matrices, e.g. `.mcool` or 
`.hic`. 
:::

## Updating an `HiCExperiment` object

::: {.callout-tip}
## TL;DR: Which `HiCExperiment` slots are mutable (✅) / immutable (⛔️)?

- `fileName(hic)`: ⛔️ (obtained from disk-stored file)
- `focus(hic)`: 🤔 (see [subsetting section](#subsetting-methods))
- `resolutions(hic)`: ⛔️ (obtained from disk-stored file)
- `resolution(hic)`: 🤔 (see [zooming section](#zooming-on-a-hicexperiment))
- `interactions(hic)`: ⛔️ (obtained from disk-stored file)
- `scores(hic)`: ✅ 
- `topologicalFeatures(hic)`: ✅
- `pairsFile(hic)`: ✅
- `metadata(hic)`: ✅
:::

### Immutable slots 

An `HiCExperiment` object acts as an interface exposing disk-stored data. 
This implies that the `fileName` slot itself is immutable 
(i.e. **cannot** be changed). This should be obvious, as a `HiCExperiment` 
*has to* be associated with a disk-stored contact matrix to properly function
(except in some advanced cases developed in next chapters).  

For this reason, methods to manually modify 
`interactions` and `resolutions` slots are also **not** exposed in the 
`HiCExperiment` package.  

A corollary of this is that the associated `regions` and `anchors` of an 
`HiCExperiment` should **not** be modified by hand either, since they are 
directly linked to `interactions`. 

### Mutable slots 

That being said, `HiCExperiment` objects are flexible and can be partially modified 
in memory without having to change/overwrite the original, disk-stored contact matrix.  

Several `slots` can be modified in memory: 
`slots`, `topologicalFeatures`, `pairsFile` and `metadata`.

#### `scores`

We have seen in the previous chapter that scores are stored in a `list` and 
are available using the `scores` function. 

```{r}
scores(hic)

head(scores(hic, "count"))

head(scores(hic, "balanced"))
```

Extra scores can be added to this list, e.g. to describe the "expected" interaction 
frequency for each interaction stored in the `HiCExperiment` object). This can be 
achieved using the `scores()<-` function.

```{r}
scores(hic, "random") <- runif(length(hic))

scores(hic)

head(scores(hic, "random"))
```

#### `topologicalFeatures`

The end-user can create additional `topologicalFeatures` or modify the existing 
ones using the `topologicalFeatures()<-` function. 

```{r}
topologicalFeatures(hic, 'CTCF') <- GRanges(c(
    "II:340-352", 
    "II:3520-3532", 
    "II:7980-7992", 
    "II:9240-9252" 
))
topologicalFeatures(hic, 'CTCF')

topologicalFeatures(hic, 'loops') <- GInteractions(
    topologicalFeatures(hic, 'CTCF')[rep(1:3, each = 3)],
    topologicalFeatures(hic, 'CTCF')[rep(1:3, 3)]
)
topologicalFeatures(hic, 'loops')

hic
```

::: {.callout-note}
## Note
All these objects can be used in `*Overlap` methods, as they all 
extend the `GRanges` class of objects. 

```{r}
# ---- This counts the number of times `CTCF` anchors are being used in the 
#      `loops` `GInteractions` object
countOverlaps(
    query = topologicalFeatures(hic, 'CTCF'), 
    subject = topologicalFeatures(hic, 'loops')
)
```
:::

#### `pairsFile`

If `pairsFile` is not specified when importing the `ContactFile` into a 
`HiCExperiment` object, one can add it later. 

```{r eval = FALSE}
pairsf <- HiContactsData('yeast_wt', 'pairs.gz')
```

```{r}
pairsFile(hic) <- pairsf
hic
```

#### `metadata`

Metadata associated with a `HiCExperiment` can be updated at any point. 

```{r}
metadata(hic) <- list(
    info = "HiCExperiment created from an example .mcool file from `HiContactsData`", 
    date = date()
)
metadata(hic)
```

## Coercing `HiCExperiment` objects

Convenient coercing functions exist to transform data stored as a `HiCExperiment`
into another class. 

- `as.matrix()`: allows to coerce the `HiCExperiment` into a `sparse` or `dense` 
matrix (using the `sparse` logical argument, `TRUE` by default) and choosing specific `scores` 
of interest (using the `use.scores` argument, `"balanced"` by default). 

```{r}
# ----- `as.matrix` coerces a `HiCExperiment` into a `sparseMatrix` by default 
as.matrix(hic) |> class()

as.matrix(hic) |> dim()

# ----- One can specify which scores should be used when coercing into a matrix
as.matrix(hic, use.scores = "balanced")[1:5, 1:5]

as.matrix(hic, use.scores = "count")[1:5, 1:5]

# ----- If **expressly required**, one can coerce a HiCExperiment into a dense matrix
as.matrix(hic, use.scores = "count", sparse = FALSE)[1:5, 1:5]
```

- `as.data.frame()`: simply coercing `interactions` into a rectangular data frame

```{r}
as.data.frame(hic) |> head()
```

::: {.callout-warning}
These coercing methods only operate on `interactions` and `scores`, 
and discard all other information, e.g. regarding genomic `regions`, 
available `resolutions`, associated `metadata`, `pairsFile` or 
`topologicalFeatures`. 
:::