data-representation.qmd

---
reference-section-title: References
bibliography: bibliography.bib
---

# Hi-C data structures in R

```{r}
#| echo: false
#| results: "hide"
#| message: false
#| warning: false
source("_common.R")
library(ggplot2)
library(GenomicRanges)
library(InteractionSet)
library(HiCExperiment)
library(HiContactsData)
coolf <- HiContactsData('yeast_wt', 'mcool')
hicf <- HiContactsData('yeast_wt', 'hic')
hicpromatrixf <- HiContactsData('yeast_wt', 'hicpro_matrix')
hicproregionsf <- HiContactsData('yeast_wt', 'hicpro_bed')
pairsf <- HiContactsData('yeast_wt', 'pairs.gz')
yeast_hic <- contacts_yeast(full = TRUE)
pairsFile(yeast_hic) <- pairsf
metadata(yeast_hic) <- list(
    ID = "SRR13994279", 
    org = "S288c", 
    date = "2019"
)
```

::: {.callout-note}
## Aims
This chapter introduces the four main classes offered by `Bioconductor` 
leveraged to perform Hi-C analysis, describes their structure and how to interact with them:  

- `GRanges` (jump to [the section](#granges-class))  
- `GInteractions` (jump to [the section](#ginteractions-class))  
- `ContactFile` (jump to [the section](#contactfile-class))  
- `HiCExperiment` (jump to [the section](#hicexperiment-class))  
:::

::: {.callout-tip}
## TL;DR
Directly jump to the [last section of this chapter](#visual-summary-of-the-hicexperiment-data-structure) 
to get a visual representation of these data structures. 
:::

## `GRanges` class

`GRanges` is a shorthand for `GenomicRanges`, a core class in `Bioconductor`. 
This class is primarily used to describe genomic ranges of any nature, e.g. 
sets of promoters, SNPs, chromatin loop anchors, ....  
The data structure has been published in the seminal 2015 publication by 
the `Bioconductor` team (@Huber2015Feb). 

### `GRanges` fundamentals 

The easiest way to generate a `GRanges` object is to coerce it from a vector of 
genomic coordinates in the UCSC format (e.g. `"chr2:2004-4853"`):

```{r}
library(GenomicRanges)
gr <- GRanges(c(
    "chr2:2004-7853:+", 
    "chr4:4482-9873:-", 
    "chr5:1943-4203:+", 
    "chr5:4103-5004:+"  
))
gr
```

A single `GRanges` object can contain one or several **"ranges"**, or genomic intervals. 
To navigate between these ranges, `GRanges` can be subset using the standard `R` single bracket notation `[`: 

```{r}
gr[1]

gr[1:3]
```

`GenomicRanges` objects aim to provide a natural description of genomic 
intervals (ranges) and are incredibly versatile. They extend the 
`data.frame` object and have four required pieces of information: 

- `seqnames` (i.e. chromosome names) (accessible with `seqnames()`)
- `start` (accessible with `start()`)
- `end` (accessible with `end()`)
- `strand` (accessible with `strand()`)

```{r}
seqnames(gr)

start(gr)

end(gr)

strand(gr)
```

Here is a graphical representation of a `GRanges` object, taken from 
[`Bioconductor` course material](https://www.bioconductor.org/help/course-materials/2015/UseBioconductorFeb2015/): 

![](images/20230306102639.png)

We will now delve into the detailed structure and operability of `GRanges` objects. 

### `GRanges` metadata

An important aspect of `GRanges` objects is that each entry (range) can have 
extra optional metadata. This metadata is stored in a rectangular 
`DataFrame`. Each column can contain a different type of information, 
e.g. a `numerical` vector, a `factor`, a list, ...  

One can directly access this `DataFrame` using the `mcols()` function, 
and individual columns of metadata using the `$` notation: 

```{r}
mcols(gr)
mcols(gr)$GC <- c(0.45, 0.43, 0.44, 0.42)
mcols(gr)$annotation <- factor(c(NA, 'promoter', 'enhancer', 'centromere'))
mcols(gr)$extended.info <- c(
    list(c(NA)), 
    list(c(date = 2023, source = 'manual')), 
    list(c(date = 2021, source = 'manual')), 
    list(c(date = 2019, source = 'homology'))
)
mcols(gr)
```

When metadata columns are defined for a `GRanges` object, they are pasted next 
to the minimal 4 required `GRanges` fields, separated by a `|` character. 

```{r}
gr
```

### Genomic arithmetics on individual `GRanges` objects

A `GRanges` object primarily describes a set of genomic ranges (*it is in the name!*). 
Useful genomic-oriented methods have been implemented to investigate individual 
`GRanges` object from a genomic perspective.   

#### Intra-range methods 

Standard genomic arithmetics are possible with `GRanges`, e.g. 
shifting ranges, resizing, trimming, ... 
These methods are referred to as "intra-range" methods 
as they work *"one-region-at-a-time"*. 

::: {.callout-note}
## Note 
- Each range of the input `GRanges` object is modified independently from the 
other ranges in the following code chunks.
- Intra-range operations are **endomorphisms**: they all take `GRanges` inputs 
and always return `GRanges` objects. 
:::

- Shifting each genomic range in a `GRanges` object by a certain number of bases: 

```{r}
gr

# ----- Shift all genomic ranges towards the "right" (downstream in `+` strand), by 1000bp:
shift(gr, 1000)

# ----- Shift all genomic ranges towards the "left" (upstream in `+` strand), by 1000bp:
shift(gr, -1000)
```

- Narrowing each genomic range in a `GRanges` object by a certain number of bases: 

```{r}
gr

# ----- Extract 21st-40th subrange for each range in `gr`:
narrow(gr, start = 21, end = 40)

width(narrow(gr, start = 21, end = 40))
```

- Resizing each genomic range in a `GRanges` object to a certain number of bases: 

```{r}
gr

# ----- Resize `gr` entries to 100, fixed at the start of each range:
resize(gr, 100, fix = "start")

# ----- Resize `gr` entries to 100, fixed at the start of each range, disregarding strand information:
resize(gr, 100, fix = "start", ignore.strand = TRUE)

# ----- Resize `gr` entries to 1 bp, fixed at the center of each range:
resize(gr, 1, fix = "center")
```

- Extracting flanking coordinates for each entry in `gr`:

```{r}
gr

# ----- Extract 100bp UPSTREAM of each genomic range, according to range strandness:
flank(gr, 100, start = TRUE)

# ----- Extract 1bp DOWNSTREAM of each genomic range, according to range strandness:
flank(gr, 1, start = FALSE)
```

Note how here again, strand information is crucial and correctly leveraged 
to extract "upstream" or "downstream" flanking regions in agreement with 
genomic range orientation.

- Several arithmetics operators can also directly work with `GRanges`: 

```{r}
gr

gr + 100 # ----- Extend each side of the `GRanges` by a given number of bases

gr - 200 # ----- Shrink each side of the `GRanges` by a given number of bases 

gr * 1000 # ----- Zoom in by a given factor (effectively decreasing the `GRanges` width by the same factor)
```

::: {.callout-warning}
## Going further
To fully grasp how to operate `GRanges` objects, we highly recommend reading the 
detailed documentation for this class by typing `?GenomicRanges` and 
`` ?GenomicRanges::`intra-range-methods` ``.
:::

#### Inter-range methods {#inter-range}

Compared to "intra-range" methods described above, **inter-range** methods 
involve comparisons *between* ranges **in a single** GRanges object.

::: {.callout-note}
## Note 
Compared to previous section, the result of each function described below 
depends on the entire set of ranges in the input `GRanges` object.
:::

- Computing the "inverse" genomic ranges, i.e. ranges in-between the input ranges:

```{r}
gaps(gr)
```

- For each entry in a `GRanges`, finding the index of the preceding/following/nearest genomic range:

```{r}
precede(gr)

follow(gr)

nearest(gr)
```

- Computing a coverage over a genome, optionally indicated a "score" column from metadata: 

```{r}
coverage(gr, weight = 'GC')
```

::: {.callout-warning}
## Going further
To fully grasp how to operate `GRanges` objects, we highly recommend reading the 
detailed documentation for this class by typing `` ?GenomicRanges::`inter-range-methods` ``. 
:::

### Comparing multiple `GRanges` objects

Genomic analysis typically requires intersection of two sets of genomic ranges, 
e.g. to find which ranges from one set overlap with those from another set.  

In the next examples, we will use two `GRanges`: 

- `peaks` represents dummy 8 ChIP-seq peaks 

```{r}
peaks <- GRanges(c(
    'chr1:320-418',
    'chr1:512-567',
    'chr1:843-892',
    'chr1:1221-1317', 
    'chr1:1329-1372', 
    'chr1:1852-1909', 
    'chr1:2489-2532', 
    'chr1:2746-2790'
))
peaks
```

- `TSSs` represents dummy 3 gene promoters (± 10bp around the TSS)

```{r}
genes <- GRanges(c(
    'chr1:358-1292:+',
    'chr1:1324-2343:+', 
    'chr1:2732-2751:+'
))
TSSs <- resize(genes, width = 1, fix = 'start') + 10
TSSs
```

Let's see how they overlap by plotting them: 

```{r}
library(ggplot2)
peaks$type <- 'peaks'
TSSs$type <- 'TSSs'
ggplot() + 
    ggbio::geom_rect(c(peaks, TSSs), aes(fill = type), facets = type~.) + 
    ggbio::theme_alignment() + 
    coord_fixed(ratio = 300)
```

#### Finding overlaps between two `GRanges` sets

- Finding overlaps between a query and a subject

In our case, we want to identify which ChIP-seq peaks overlap with a TSS: 
the query is the set of peaks and the subject is the set of TSSs. 

`findOverlaps` returns a `Hits` object listing which `query` ranges overlap with which `subject` ranges. 

```{r}
ov <- findOverlaps(query = peaks, subject = TSSs)
ov
```

The `Hits` output clearly describes what overlaps with what: 

- The query (peak) `#1` overlaps with subject (TSS) `#1`
- The query (peak) `#5` overlaps with subject (TSS) `#2`

::: {.callout-note}
## Note 
Because no other query index or subject index is listed in the `ov` output, 
none of the remaining ranges from `query` overlap with ranges from `subject`.
:::

- Subsetting by overlaps between a query and a subject

To directly **subset** ranges from `query` overlapping with ranges from a 
`subject` (e.g. to only keep *peaks* overlapping a *TSS*), we can use the 
`subsetByOverlaps` function. 

```{r}
subsetByOverlaps(peaks, TSSs)
```

::: {.callout-note}
## Note 
The output of `subsetByOverlaps` is a subset of the original `GRanges` object
provided as a `query`, with retained ranges being unmodified. 
:::

- Counting overlaps between a query and a subject

Finally, the `countOverlaps` is used to count, for each range in a `query`, how 
many ranges in the `subject` it overlaps with. 

```{r}
countOverlaps(query = peaks, subject = TSSs)
```

::: {.callout-note}
## Note 
Note that which `GRanges` goes in `query` or `subject` is crucial! Counting 
**for each peak, the number of TSSs it overlaps with** is very different from 
**for each TSS, how many peaks it overlaps with**. 

In our case example, it would also be informative to count how many peaks overlap 
with each TSS, so we'd need to swap `query` and `subject`: 

```{r}
countOverlaps(query = TSSs, subject = peaks)
```

We can add these counts to the original `query` object: 

```{r}
TSSs$n_peaks <- countOverlaps(query = TSSs, subject = peaks)
TSSs
```
:::

- `%over%`, `%within%`, `%outside%` : handy operators

Handy operators exist that return logical vectors (same length as the `query`). 
They essentially are short-hands for specific `findOverlaps()` cases. 

`<query> %over% <subject>`: 

```{r}
peaks %over% TSSs

peaks[peaks %over% TSSs] # ----- Equivalent to `subsetByOverlaps(peaks, TSSs)`
```

`<query> %within% <subject>`: 

```{r}
peaks %within% TSSs

TSSs %within% peaks
```

`<query> %outside% <subject>`: 

```{r}
peaks %outside% TSSs
```

::: {.callout-warning}
## Going further
To fully grasp how to find overlaps between `GRanges` objects, we highly recommend reading the 
detailed documentation by typing `` ?IRanges::`findOverlaps-methods` ``.
:::

#### Find nearest range from a subject for each range in a query

`*Overlaps` methods are not always enough to match a `query` to a `subject`. 
For instance, some peaks in the `query` might be very near to some TSSs in 
the `subject`, but not quite overlapping. 

```{r}
peaks[8]

TSSs[3]
```

- `nearest()`

Rather than finding the *overlapping* range in a `subject` for each range in a `query`, 
we can find the `nearest` range. 

For each range in the `query`, this returns the *index* of the range in the `subject` 
to which the `query` is the nearest. 

```{r}
nearest(peaks, TSSs)

TSSs[nearest(peaks, TSSs)]
```

- `distance()`

Alternatively, one can simply ask to calculate the `distanceToNearest` between 
ranges in a `query` and ranges in a `subject`. 

```{r}
distanceToNearest(peaks, TSSs)

peaks$distance_to_nearest_TSS <- mcols(distanceToNearest(peaks, TSSs))$distance
```

Note how close from a TSS the 8th peak was. It could be worth considering this 
as an overlap!

## `GInteractions` class

`GRanges` describe genomic ranges and hence are of general use to study 1D genome organization. 
To study chromatin interactions, we need a way to link pairs of `GRanges`. This is exactly what 
the `GInteractions` class does. This data structure is defined in the 
`InteractionSet` package and has been published in the 
2016 paper by `Lun et al.` (@Lun2016May). 

![](images/20230309114047.png)

### Building a `GInteractions` object from scratch

Let's first define two parallel `GRanges` objects (i.e. two `GRanges` of same length). 
Each `GRanges` will contain 5 ranges. 

```{r}
gr_first <- GRanges(c(
    'chr1:1-100', 
    'chr1:1001-2000', 
    'chr1:5001-6000', 
    'chr1:8001-9000', 
    'chr1:7001-8000'  
))
gr_second <- GRanges(c(
    'chr1:1-100', 
    'chr1:3001-4000', 
    'chr1:8001-9000', 
    'chr1:7001-8000', 
    'chr2:13000-14000'  
))
```

Because these two `GRanges` objects are of same length (5), one can "bind" them 
together by using the `GInteractions`function. This effectively associate each entry from 
one `GRanges` to the entry aligned in the other `GRanges` object. 

```{r}
library(InteractionSet)
gi <- GInteractions(gr_first, gr_second)
gi
```

The way `GInteractions` objects are printed in an R console mimics that of 
`GRanges`, but pairs two "ends" (a.k.a. *anchors*) of an interaction together, 
each end being represented as a separate `GRanges` range. 

::: {.callout-note}
## Notes
- Note that it is possible to have interactions joining two identical anchors. 

```{r}
gi[1]
```

- It is also technically possible (though not advised) to have interactions for which the "first" end is located after the "second" end along the chromosome. 

```{r}
gi[4]
```

- Finally, it is possible to define inter-chromosomal interactions (a.k.a. trans interactions).

```{r}
gi[5]
```
:::

### `GInteractions` specific slots

Compared to `GRanges`, extra slots are available for `GInteractions` objects, e.g.
`anchors` and `regions`. 

#### Anchors

"Anchors" of a single genomic interaction refer to the two ends of this interaction. 
These anchors can be extracted from a `GInteractions` object using the `anchors()` function. 
This outputs a list of two `GRanges`, the first corresponding to the "left" end of interactions 
(when printed to the console) and the second corresponding to the "right" end of interactions 
(when printed to the console). 

```{r}
# ----- This extracts the two sets of anchors ("first" and "second") from a GInteractions object
anchors(gi)

# ----- We can query for the "first" or "second" set of anchors directly
anchors(gi, "first")

anchors(gi, "second")
```

#### Regions

"Regions" of a *set* of interactions refer to the universe of unique 
anchors represented in a set of interactions. Therefore, the length 
of the `regions` can only be equal to or strictly lower than twice the 
length of `anchors`.  

The `regions` function returns the regions associated with a `GInteractions` 
object, stored as a `GRanges` object. 

```{r}
regions(gi)

length(regions(gi))

length(anchors(gi, "first"))
```

### `GInteractions` methods

`GInteractions` behave as an extension of `GRanges`. For this reason, many methods 
that work with `GRanges` will work seamlessly with `GInteractions`. 

#### Metadata 

One can add metadata columns directly to a `GInteractions` object. 

```{r}
mcols(gi)
mcols(gi) <- data.frame(
    idx = seq(1, length(gi)),
    type = c("cis", "cis", "cis", "trans", "cis")
)
gi

gi$type
```

Importantly, metadata columns can also be directly added to **regions** of a 
`GInteractions` object, since these `regions` are a `GRanges` object themselves! 

```{r}
regions(gi)
regions(gi)$binID <- seq_along(regions(gi))
regions(gi)$type <- c("P", "P", "P", "E", "E", "P", "P")
regions(gi)
```

#### Sorting `GInteractions` 

The `sort` function works seamlessly with `GInteractions` objects. It sorts the 
interactions using a similar approach to that performed by `pairtools sort ...` for disk-stored `.pairs`
files, sorting on the "first" anchor first, then for interactions with the same 
"first" anchors, sorting on the "second" anchor. 

```{r}
gi

sort(gi)
```

#### Swapping `GInteractions` anchors

For an individual interaction contained in a `GInteractions` object, the "first" and 
"second" anchors themselves can be sorted as well. This is called "pairs swapping", 
and it is performed similarly to `pairtools flip ...` for disk-stored `.pairs`
files. This ensures that interactions, when represented as a contact matrix, 
generate an upper-triangular matrix. 

```{r}
gi

swapAnchors(gi)
```

::: {.callout-warning}
## Note
"Sorting" and "swapping" a `GInteractions` object are two 
**entirely different actions**: 

- "sorting" reorganizes all *rows* (interactions);
- "swapping" *anchors* reorganizes *"first" and "second" anchors* for each interaction independently.
:::

#### `GInteractions` distance method

"Distance", when applied to genomic interactions, typically refers to the 
genomic distance between the two anchors of a single interaction. For 
`GInteractions`, this is computed using the `pairdist` function.

```{r}
gi

pairdist(gi)
```

Note that for "trans" inter-chromosomal interactions, i.e. interactions with anchors on 
different chromosomes, the notion of genomic distance is meaningless and for 
this reason, `pairdist` returns a `NA` value. 

::: {.callout-tip}
## Advanced `pairdist` arguments
The `type` argument can be tweaked to specify which type of "distance" should 
be computed:  

- `mid`: The distance between the midpoints of the two regions
          (rounded down to the nearest integer) is returned (Default).
- `gap`: The length of the gap between the closest points of the
          two regions is computed - negative lengths are returned for
          overlapping regions, indicating the length of the overlap.
- `span`: The distance between the furthermost points of the two
          regions is computed.
- `diag`: The difference between the anchor indices is returned.
          This corresponds to a diagonal on the interaction space when
          bins are used in the 'regions' slot of 'x'.
:::

#### `GInteractions` overlap methods

"Overlaps" for genomic interactions could be computed in different contexts: 

- Case 1: Overlap between any of the two anchors of an interaction with a genomic range
- Case 2: Overlap between anchors of an interaction with anchors of another interaction
- Case 3: Spanning of the interaction "across" a genomic range

> Case 1: Overlap between any of the two anchors of an interaction with a genomic range

This is the default behavior of `findOverlaps` when providing a `GInteractions`
object as `query` and a `GRanges` as a `subject`. 

```{r}
gr <- GRanges(c("chr1:7501-7600", "chr1:8501-8600"))
findOverlaps(query = gi, subject = gr)

countOverlaps(gi, gr)

subsetByOverlaps(gi, gr)

```

Here again, the order matters! 

```{r}
countOverlaps(gr, gi)
```

And again, the `%over%` operator can be used here: 

```{r}
gi %over% gr

gi[gi %over% gr] # ----- Equivalent to `subsetByOverlaps(gi, gr)`
```

> Case 2: Overlap between anchors of an interaction with anchors of another interaction

This slightly different scenario involves overlapping two sets of interactions, 
to see whether any interaction in `Set-1` has its two anchors overlapping 
anchors from an interaction in `Set-2`. 

```{r}
gi2 <- GInteractions(
    GRanges("chr1:1081-1090"), 
    GRanges("chr1:3401-3501")
)
gi %over% gi2
```

Note that both anchors of an interaction from a `query` have to overlap to a 
pair of anchors of a single interaction from a `subject` with this method!

```{r}
gi3 <- GInteractions(
    GRanges("chr1:1-1000"), 
    GRanges("chr1:3401-3501")
)
gi %over% gi3
```

> Case 3 : Spanning of the interaction "accross" a genomic range

This requires a bit of wrangling, to mimic an overlap between two `GRanges` objects: 

```{r}
gi <- swapAnchors(gi) # ----- Make sure anchors are correctly sorted
gi <- sort(gi) # ----- Make sure interactions are correctly sorted
gi <- gi[!is.na(pairdist(gi))] # ----- Remove inter-chromosomal interactions
spanning_gi <- GRanges(
    seqnames = seqnames(anchors(gi)[[1]]), 
    ranges = IRanges(
        start(anchors(gi)[[1]]), 
        end(anchors(gi)[[2]])
    )
)
spanning_gi 

spanning_gi %over% gr
```

::: {.callout-warning}
## Going further
A detailed manual of overlap methods available for `GInteractions` object can 
be read by typing `` ?`Interaction-overlaps` `` in R.
:::

## `ContactFile` class

Hi-C contacts can be stored in four different formats 
(see [previous chapter](principles.qmd#binned-contact-matrix-files)): 

- As a `.(m)cool` matrix (multi-scores, multi-resolution, indexed)
- As a `.hic` matrix (multi-scores, multi-resolution, indexed)
- As a HiC-pro derived matrix (single-score, single-resolution, non-indexed)
- Unbinned, Hi-C contacts can be stored in `.pairs` files

### Accessing example Hi-C files 

Example contact files can be downloaded using `HiContactsData` function. 

```{r eval = FALSE}
library(HiContactsData)
coolf <- HiContactsData('yeast_wt', 'mcool')
```

This fetches files from the cloud, download them locally and returns the path of 
the local file. 

```{r}
coolf
```

Similarly, example files are available for other file formats: 

```{r eval = FALSE}
hicf <- HiContactsData('yeast_wt', 'hic')
hicpromatrixf <- HiContactsData('yeast_wt', 'hicpro_matrix')
hicproregionsf <- HiContactsData('yeast_wt', 'hicpro_bed')
pairsf <- HiContactsData('yeast_wt', 'pairs.gz')
```

We can even check the content of some of these files to make sure they are 
actually what they are: 

```{r}
# ---- HiC-Pro generates a tab-separated `regions.bed` file
readLines(hicproregionsf, 25)

# ---- Pairs are also tab-separated 
readLines(pairsf, 25)
```

### `ContactFile` fundamentals

A `ContactFile` object **establishes a connection with a disk-stored Hi-C file** 
(e.g. a `.cool` file, or a `.pairs` file, ...). `ContactFile` classes are 
defined in the `HiCExperiment` package.

`ContactFile`s come in four different flavors:  

- `CoolFile`: connection to a `.(m)cool` file
- `HicFile`: connection to a `.hic` file
- `HicproFile`: connection to output files generated by HiC-Pro 
- `PairsFile`: connection to a `.pairs` file

To create each flavor of `ContactFile`, one can use the corresponding function: 

```{r}
library(HiCExperiment)

# ----- This creates a connection to a `.(m)cool` file (path stored in `coolf`)
CoolFile(coolf)

# ----- This creates a connection to a `.hic` file (path stored in `hicf`)
HicFile(hicf)

# ----- This creates a connection to output files from HiC-Pro
HicproFile(hicpromatrixf, hicproregionsf)

# ----- This creates a connection to a pairs file
PairsFile(pairsf)
```

### `ContactFile` slots

Several "slots" (i.e. pieces of information) are attached to a `ContactFile` object: 

- The path to the disk-stored contact matrix;
- The active resolution (by default, the finest resolution available in a multi-resolution contact matrix);
- Optionally, the path to a matching `pairs` file (see below);
- Some metadata.

Slots of a `CoolFile` object can be accessed as follow: 

```{r slots}
cf <- CoolFile(coolf)
cf

resolution(cf)

pairsFile(cf)

metadata(cf)
```

::: {.callout-warning}
## Important!
`ContactFile` objects are only **connections** to a disk-stored HiC file. 
Although metadata is available, they do not contain actual data!
:::

### `ContactFile` methods 

Two useful methods are available for `ContactFile`s: 

- `availableResolutions` checks which resolutions are available in a `ContactFile`.

```{r}
availableResolutions(cf)
```

- `availableChromosomes` checks which chromosomes are available in a `ContactFile`, along with their length.

```{r}
availableChromosomes(cf)
```

## `HiCExperiment` class

Based on the previous sections, we have different Bioconductor classes 
relevant for Hi-C: 

- `GInteractions` which can be used to represent genomic interactions in R
- `ContactFile`s which can be used to establish a connection with disk-stored Hi-C files

`HiCExperiment` objects are created when *parsing* a `ContactFile` in R. 
The `HiCExperiment` class reads a `ContactFile` in memory and store genomic 
interactions as `GInteractions`. The `HiCExperiment` class is, quite obviously, 
defined in the `HiCExperiment` package.

### Creating a `HiCExperiment` object

#### Importing a `ContactFile` 

In practice, to create a `HiCExperiment` object from a `ContactFile`, 
one can use the `import` method. 

::: {.callout-important}
## Caution
- Creating a `HiCExperiment` object means ***importing data from a Hi-C matrix*** (e.g. 
from a `ContactFile`) in memory in R.  
- Creating a `HiCExperiment` object from large disk-stored contact matrices 
can potentially take a long time. 
:::

```{r}
cf <- CoolFile(coolf)
hic <- import(cf)
hic
```

Printing a `HiCExperiment` to the console will not reveal the actual data 
stored in the object (it would most likely crash your R session!). 
Instead, it gives a **summary of the data** stored in the object: 

- The `fileName`, i.e. the path to the disk-stored data file 
- The `focus`, i.e. the genomic location for which data has been imported (in the example above, `"whole genome"` implies that all the data has been imported in R)
- `resolutions` available in the disk-stored data file (this will be identical to `availableResolutions(cf)`)
- `active resolution` indicates at which resolution the data is currently imported
- `interactions` refers to the actual `GInteractions` imported in R and "hidden" (for now!) in the `HiCExperiment` object
- `scores` refer to different interaction frequency estimates. These can be raw
`count`s, `balanced` (if the contact matrix has been previously normalized), or 
whatever score the end-user want to attribute to each interaction 
(e.g. ratio of counts between two Hi-C maps, ...)
- `topologicalFeatures` is a `list` of `GRanges` or `GInteractions` objects
to describe important topological features. 
- `pairsFile` is a pointer to an optional disk-stored `.pairs` file from which 
the contact matrix has been created. This is often useful to estimate some Hi-C metrics. 
- `metadata` is a `list` to further describe the experiment. 

::: {.callout-tip}
## `HiCExperiment` slots
These pieces of information are called `slots`. They can be directly accessed 
using `getter` functions, bearing the same name than the slot. 

```{r}
fileName(hic)

focus(hic)

resolutions(hic)

resolution(hic)

interactions(hic)

scores(hic)

topologicalFeatures(hic)

pairsFile(hic)

metadata(hic)
```
:::

::: {.callout-note}
## Notes

`import` also works for other types of `ContactFile` (`HicFile`, 
`HicproFile`, `PairsFile`), e.g. 

- For `HicFile` and `HicproFile`, `import` seamlessly returns a `HiCExperiment` as well: 

```{r}
hf <- HicFile(hicf)
hic <- import(hf)
hic
```

- For `PairsFile`, the returned object is a representation of Hi-C "pairs" in 
R, i.e. `GInteractions`

```{r}
pf <- PairsFile(pairsf)
pairs <- import(pf)
pairs
```
:::

#### Customizing the `import`

To reduce the `import` to only parse the data that is relevant to the study,
two arguments can be passed to `import`, along with a `ContactFile`. 

::: {.callout-tip}
## Key `import` arguments:
- `focus`: This can be used to **only parse data for a specific genomic location**.
- `resolution`: This can be used to choose which resolution to parse the contact matrix at (this is ignored if the `ContactFile` is not multi-resolution, e.g. `.cool` or HiC-Pro generated matrices)
:::

- Import interactions within a single chromosome:

```{r}
hic <- import(cf, focus = 'II', resolution = 2000)

regions(hic) # ---- `regions()` work on `HiCExperiment` the same way than on `GInteractions`

table(seqnames(regions(hic)))

anchors(hic) # ---- `anchors()` work on `HiCExperiment` the same way than on `GInteractions`
```

- Import interactions within a segment of a chromosome:

```{r}
hic <- import(cf, focus = 'II:40000-60000', resolution = 1000)

regions(hic) 

anchors(hic)
```

- Import interactions between two chromosomes:

```{r}
hic2 <- import(cf, focus = 'II|XV', resolution = 4000)

regions(hic2)

anchors(hic2)
```

- Import interactions between segments of two chromosomes:

```{r}
hic3 <- import(cf, focus = 'III:10000-40000|XV:10000-40000', resolution = 2000)

regions(hic3)

anchors(hic3)
```

### Interacting with `HiCExperiment` data

- An `HiCExperiment` object allows parsing of a disk-stored contact matrix.
- An `HiCExperiment` object operates by wrapping together (1) a `ContactFile` 
(i.e. a connection to a disk-stored data file) and (2) a `GInteractions` 
generated by parsing the data file.  

We will use the `yeast_hic` `HiCExperiment` object to demonstrate how to 
parse information from a `HiCExperiment` object. 

```{r eval = FALSE}
yeast_hic <- contacts_yeast()
```

```{r}
yeast_hic
```

#### Interactions

The imported genomic interactions can be directly **exposed** using the 
`interactions` function and are returned as a `GInteractions` object. 

```{r}
interactions(yeast_hic)
```

::: {.callout-note}
## Note 
Because genomic interactions are actually stored as `GInteractions`, `regions` 
and `anchors` work on `HiCExperiment` objects just as they work with 
`GInteractions`!
:::

```{r}
regions(yeast_hic)

anchors(yeast_hic)
```

#### Bins and seqinfo

Additional useful information can be recovered from a `HiCExperiment` object. 
This includes: 

- The `seqinfo` of the `HiCExperiment`: 

```{r}
seqinfo(yeast_hic)
```

This lists the different chromosomes available to parse along with their length. 

- The `bins` of the `HiCExperiment`: 

```{r}
bins(yeast_hic)
```

::: {.callout-warning}
## Difference between `bins` and `regions`
`bins` are **not** equivalent to `regions` of an `HiCExperiment`. 

- `bins` refer to all the **possible** `regions` of a `HiCExperiment`. For instance, 
for a `HiCExperiment` with a total genome size of `1,000,000` and a resolution of 
`2000`, `bins` will always return a `GRanges` object with `500` ranges. 
- `regions`, on the opposite, refer to the union of `anchors` of all the 
`interactions` imported in a `HiCExperiment` object. 

Thus, all the `regions` will necessarily be a subset of the `HiCExperiment` `bins`, 
or equal to `bins` if no focus has been specified when importing a `ContactFile`. 
:::

#### Scores

Of course, what the end-user would be looking for is the **frequency** for each 
genomic interaction. Such frequency scores are available using the `scores` function. 
`scores` returns a list with a number of different types of scores. 

```{r}
head(scores(yeast_hic))

head(scores(yeast_hic, "count"))

head(scores(yeast_hic, "balanced"))
```

::: {.callout-tip}
## Tip
Calling `interactions(hic)` returns a `GInteractions` with `scores` already 
stored in extra columns. This short-hand allows one to dynamically check `scores` 
directly from the `interactions` output.

```{r}
interactions(yeast_hic)

head(interactions(yeast_hic)$count)
```
:::

#### topologicalFeatures

In Hi-C studies, "topological features" refer to genomic structures identified 
(usually from a Hi-C map, but not necessarily). For instance, one may want to 
study known structural loops anchored at CTCF sites, or interactions around 
or over centromeres, or simply specific genomic "viewpoints". 

`HiCExperiment` objects can store `topologicalFeatures` to facilitate this analysis. 
By default, four empty `topologicalFeatures` are stored in a list: 

- `compartments`
- `borders`
- `loops`
- `viewpoints`

Additional `topologicalFeatures` can be added to this list (read 
[next chapter](parsing.html) for more detail). 

```{r}
topologicalFeatures(yeast_hic)

topologicalFeatures(yeast_hic, 'centromeres')
```

#### pairsFile

As a contact matrix is typically obtained from binning a `.pairs` file, 
it is often the case that the matching `.pairs` file is available to then end-user. 
A `PairsFile` can thus be created and associated to the corresponding 
`HiCExperiment` object. This allows more accurate estimation of contact 
distribution, e.g. when calculating distance-dependent genomic interaction 
frequency. 

```{r}
pairsFile(yeast_hic)

readLines(pairsFile(yeast_hic), 25)
```

::: {.callout-important}
## Importing a `PairsFile`
The `.pairs` file linked to a `HiCExperiment` object can itself be imported in 
a `GInteractions` object: 

```{r}
import(pairsFile(yeast_hic), format = 'pairs')
```

Note that these `GInteractions` are **not** binned, contrary to `interactions` 
extracted from a `HiCExperiment`. Anchors of the interactions listed in the 
`GInteractions` imported from a disk-stored `.pairs` file are all of width `1`. 
:::

## Visual summary of the `HiCExperiment` data structure

The `HiCExperiment` data structure provided by the `HiCExperiment` package 
inherits methods from core `GInteractions` and `BiocFile` classes to provide 
a flexible representation of Hi-C data in `R`. It allows random access-based
queries to seamlessly import parts or all the data contained in 
disk-stored Hi-C contact matrices in a variety of formats. 

![](images/20230309114202.png)