Skip to content

Relational algebra Genome Intervals

Stuart Lee edited this page Aug 31, 2017 · 1 revision

Relational algebra for Genome Intervals

Notes on Formalization of Genome Interval Relations by Chris J. Mungall

In order to get precise answers to biological questions, precise definitions of the relations between biological entities are required. These can be specified using First-Order Logic (FOL) axioms. The benefit of this system is that relations can be inferred and established by a theorem solver/program. From an API perspective they are useful for thinking about what functions should act on genomic data structures.

Example of relationships in Sequence Ontology

Region Connection Calculus (RCC-8)

The relations are defined to hold between two continuous regions (this can be any abstract notion of space i.e. Euclidean space) - actually used in geographical information systems.

DC - disconnected

EC - externally connected

EQ - equal

PO - partially overlapping

TPP - tangential proper part

TPPi - tangential proper part inverse

NTPP - non-tangential proper part

NTPPi - non-tangential proper part inverse

Allan's Interval algebra

Originally defined for reasoning about temporal intervals and like RCC-8 assumes continous intervals, Allan's interval algebra has been implemented in the Bioconductor package GenomicRanges (seeing ?GenomicRanges::findOverlaps for details.

There are 13 relationships, 6 of which are inverses

  • precedes (p)
  • meets (m)
  • overlaps (o)
  • finishes (f) [equivalently called ends]
  • starts (s)
  • equals (e)
  • containedBy (c) [equivalently called during]
  • precededBy (pi) [inverse of precedes]
  • metBy (mi) [inverse of meets]
  • overlappedBy (oi) [inverse of overlaps]
  • finishedBy (fi) [inverse of finishes]
  • startedBy (si) [inverse of starts]

The algebra is also equipped with intersection, union and complement operators.

Mungall's Genome Interval Algebra

There are sixteen relations in Mungall's algebra that extend and modify Allen's algebra:

The algebra is equipped with intersection, union and complement operators but is not isomorphic to Allan's algebra, nor is the set of relations pairwise disjoint. One advantage is this algebra can be defined to act on intervals (and take into account strandedness for a total of 32 relations, using the reverse complement (RC) relation) or defined to act on junctions.

Junctions are defined in terms of point-positions: a proper junction "is a discrete point connecting two nucleotide bases." and junctions are formed by unions of proper junctions and the boundary points of a sequence. In the junction defintion of the algebra two functions are defined, α and ω which map the start and end points of an interval to a junction, respectively. An other relation operator succeeds, abreviated to succ, which acts between two junctions seperated by a base and signifies that first junction is at the 5' end, while the second is at the 3' end. The inequality operators are overloaded so they are transitiive. The functions α and ω can be eliminated by translating them to the interval operator.

In it's final form the algebra consists of three sets of relations: R: the orginal set, R' the stranded set via the RC operator, and the union set R^U = R union R', for a total of 48 relations.

Collections of intervals

Usually you do not work with an isolated interval but a collection of them - the algebra introduces the relation interleaves as distinct from overlaps (i.e. exons can interleave on opposite strands even if they do not overlap).