SEP | |
---|---|
Title | Make all sequence associations explicit |
Authors | Jacob Beal ([email protected]) |
Editor | |
Type | Data Model |
SBOL Version | 3.0 |
Replaces | |
Status | Accepted |
Created | 22-Jan-2020 |
Last modified | 22-Jan-2020 |
Issue | #89 |
This SEP proposes to require that the association of a Sequence to complete Component be explicit, rather than implicit, and to enable this via changes to Location, Component, and a new EntireComponent location class.
Currently, when a ComponentDefinition includes a Sequence, that Sequence is implicitly assumed to apply to the entire ComponentDefinition. If ComponentDefinition and ModuleDefinition are merged into Component, however, (as proposed by SEP 025), then we will often have circumstances where there is more than one Sequence in a Component.
Example: a system on two plasmids:
* DNA Plasmid 1: constitutive CDS for TetR
* DNA Plasmid 2: pTet regulation of GFP
* Proteins: TetR, GFP
* Interactions: P1 CDS produces TetR; TetR represses pTet; GFP CDS produces GFP
We need to have two Sequences to describe this Component, one for each plamid. There are two alternatives for doing so:
- The Sequences cannot be in this Component, but must be in the SubComponents for the plasmids.
- Advantage: this is what we do now.
- Disadvantage: this limits flattening, since we have to have two layers of system. Clever counter-examples might even generate unflattenable structures of arbitrary depth.
- Disadvantage: raises the question of whether "Component" should be separated into subclasses that do or don't allow a sequence.
- The Sequences can be in included in the Component, but must have an explicit link between a Sequence and the thing that it describes.
- Advantage: complete flattening is possible
- Advantage: uniform handling of sequences in all types of Component
- Disadvantage: requires some sort of extra relationship in simple components (like the simple promoter pBAD)
This SEP makes a formal proposal for option #2, in contrast to to SEP 025, which effectively proposes Option 1.
The Location
class's sequence
property changes from OPTIONAL (cardinality [0..1]). to REQUIRED (cardinality [1]).
The sequence
property is removed from Component. The sequences associated with a component are now linked only via the location
fields of its child SequenceAnnotation and SubComponent objects (per SEP 10 addition of location
to SubComponent).
This change ensures that no Sequence
is ever associated with a Component
without being given an explicit location.
The EntireComponent
class is a new subclass of Location with no additional properties. Use of this class indicates that the linked Sequence describes the entirety of the Component
or SubComponent
parent of this Location object.
Note that this class is not strictly necessary, as EntireComponent
can be taken as equivalent to Range(1 .. sequence size). Using EntireComponent, however, will make validation rules easier.
The following new validation rules are added for Component and SubComponent respectively:
If a Component has a SequenceAnnotation that refers to Sequence X, then either the Component or one of its SubComponents must also have a SequenceAnnotation that gives Sequence X an EntireComponent Location.
If a SubComponent has a Location that refers to Sequence X, then the either the SubComponent or its parent Component must also have a SequenceAnnotation that gives Sequence X an EntireComponent Location.
This validation rule ensures that at every sequence referred to has at least one EntireComponent relationship associating it with a particular entity in the representation.
Note that it is not inconsistent to have multiple SubComponents with an EntireComponent pointing to the same Sequence, or both a Component and its SubComponent. A simple example of the first case would be a system that uses the same plasmid in two different compartments. The second case is not expected to arise in practice, but would not be incorrect---such a SubComponent would simply be redundant and pointless.
The Component for the pTet promoter is given a SequenceAnnotation with an EntireComponent location for its Sequence.
The Component contains the following:
- DNA Plasmid 1: constitutive CDS for TetR
- DNA Plasmid 2: pTet regulation of GFP
- Proteins: TetR, GFP
- Interactions: P1 CDS produces TetR; TetR represses pTet; GFP CDS produces GFP
Plasmid 1 is given an EntireComponent location for its Sequence
Plasmid 2 is given an EntireComponent location for its (different) Sequence.
The Component contains the following:
- SubComponent for a system expressing Cas9 and a targeting gRNA
- definition is Cas9-gRNA Component:
- SubComponent: promoter - CDS for Cas9
- SubComponent: U6promoter - gRNA coding sequence
- SubComponent: Cas9 (protein)
- SubComponent: gRNA (RNA)
- SubComponent: Cas9-gRNA complex
- Interaction: Association - Cas9 + gRNA -> Cas9-gRNA
- definition is Cas9-gRNA Component:
- SubComponent for a system with a genome that is going to be edited.
- location is EntireComponent for Sequence: E-coli-genome-sequence
- definition is a Component for E. coli genome
- SequenceAnnotation with location EntireComponent for Sequence: E-coli-genome-sequence
- SequenceAnnotation for each gRNA binding site:
- role: SO binding site
- location: Range [nn .. nn] on Sequence: E-coli-genome-sequence
- ... similar for each Cas9m break site
In order to indicate where the binding targets and edit locations on the genome will be, we add SequenceAnnotations to this Component, each annotation referencing the Sequence for the genome.
Now, however, we also need to satisfy the rule that there is an EntireComponent location for that genome in either the Component or the SubComponent. This cannot be done as a SequenceAnnotation on the Component, since the Component also contains the Cas9 system, the gRNA, whatever else is in the genome system, etc.---the genome isn't the whole Component. Instead (assuming SEP 037 is accepted) we make a SubComponent of type ComponentReference that points to the genome within the genomic system, and give that the EntireComponent location for the Sequence.
SBOL 2 systems can be readily converted to follow this convention by changing each sequence
property to a SequenceAnnotation with an EntireComponent
Location linking to the Sequence.
SBOL 3 systems with precisely one EntireComponent SequenceAnnotation can be converted back to SBOL 2 by the converse. Otherwise, they will need a more complex process of extracting required sub-components.
Potential alternative names for EntireComponent:
- EntireLocation
- EntireSequence
- Any of the above with "Whole" instead of "Entire"
This SEP proposes a modification of the unification of ComponentDefinition and ModuleDefinition proposed in SEP 025. Technically, it stacks on top of SEP 025.
To the extent possible under law,
SBOL developers
has waived all copyright and related or neighboring rights to
SEP 041.
This work is published from:
United States.