Skip to content

Latest commit

 

History

History
86 lines (60 loc) · 3.11 KB

schema_v2.rst

File metadata and controls

86 lines (60 loc) · 3.11 KB
orphan:

Version: 2

This schema describes a compressed sparse row storage scheme (CSR) for a symmetric matrix with genomic dimension/axis annotations.

Notes:

  • Any number of additional optional data columns can be added to each table.
  • Genomic coordinates are assumed to be 0-based and intervals half-open (1-based ends).

Cooler

We refer to the data representation of a single contact matrix as a "Cooler".

Following the convention of the odo package, we identify a Cooler using a Cooler URI string, separating the path to the container file from the data path within the container by :::

/path/to/container.cool::/path/to/cooler/group

Contact matrix

The tables and indexes can be represented in the Datashape layout language:

{
  chroms: {
    name:     typevar['Nchroms'] * string['ascii'],
    length:   typevar['Nchroms'] * int32,
  },
  bins: {
    chrom:    typevar['Nbins'] * categorical[typevar['name'], type=string, ordered=True],
    start:    typevar['Nbins'] * int32,
    end:      typevar['Nbins'] * int32,
    weight:   typevar['Nbins'] * float64
  },
  pixels: {
    bin1_id:  typevar['Nnz'] * int64,
    bin2_id:  typevar['Nnz'] * int64,
    count:    typevar['Nnz'] * int32
  },
  indexes: {
    chrom_offset:  (typevar['Nchroms'] + 1) * int64,
    bin1_offset:   (typevar['Nbins'] + 1) * int64
  }
}

Notes:

  • Having the bin1_offset index, the bin1_id column becomes redundant, but we keep it for convenience as it is extremely compressible. It may be dropped in future versions.

Metadata

Essential key-value properties are stored as root-level HDF5 attributes. A specific bucket called metadata is reserved for arbitrary JSON-compatible user metadata.

nchroms         : <int> Number of rows in scaffolds table
nbins           : <int> Number of rows in bins table
nnz             : <int> Number of rows in matrix table
bin-type        : {"fixed" or "variable"}
bin-size        : <int or null> Size of bins in base pairs if bin-type is "fixed"
genome-assembly : <string> Name of genome assembly
generated-by    : <string> Agent that created the file (e.g. 'cooler-x.y.z')
creation-date   : <datetime> Date the file was built
format-version  : <string> The version of the format used
format-url      : <url> URL to page providing format details
metadata        : <json> custom user metadata about the experiment

Indexes

Indexes are stored as 1D datasets in a separate group. The current indexes can be thought of as run-length encodings of the bins/chrom and pixels/bin1_id columns, respectively.

  • chrom_offset : indicates what row in the bin table each chromosome first appears.
  • bin1_offset : indicates what row in the pixel table each bin1 ID appears. This is often called indptr in CSR data structures.