Skip to content

Commit

Permalink
updated term. and info. model
Browse files Browse the repository at this point in the history
  • Loading branch information
DanielPuthawala committed Apr 9, 2024
1 parent eda95a1 commit 99a6875
Show file tree
Hide file tree
Showing 3 changed files with 130 additions and 43 deletions.
47 changes: 7 additions & 40 deletions docs/source/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ While a single categorical variant may have many assayed variant members, the sa
:alt: The figure depicts a single centralized assayed variant, with arrows radiating out to a number of categorical variants to which it is a member. Among these, the assayed variant NC_000007.13:g.140453136A>T is a BRAF V600E variant, a BRAF gene variant, and a chromosome 7 variant.


Because a single categoricla variant may have many assayed variants as members, while a single assayed variant can be a member of many categorical variants, different categorical have complex heirarchical relationships with each other. the figure below depicts some of the relationships between some of the categorical variants to which NC_000007.13:g.140453136A>T is a member. For example, all BRAF V600E variants are also BRAF gene variants. And all BRAF V600E variants and BRAF gene variants are chromosome 7 variants. A BRAF V600E variant is also an inframe protein variant, which is itself a type of sequence variant.
Because a single categorical variant may have many assayed variants as members, while a single assayed variant can be a member of many categorical variants, different categorical have complex heirarchical relationships with each other. the figure below depicts some of the relationships between some of the categorical variants to which NC_000007.13:g.140453136A>T is a member. For example, all BRAF V600E variants are also BRAF gene variants. And all BRAF V600E variants and BRAF gene variants are chromosome 7 variants. A BRAF V600E variant is also an inframe protein variant, which is itself a type of sequence variant.


.. image:: images/relations-between-assayed-and-CatVars-and-CatVars-to-other-CatVars(1).png
Expand All @@ -87,16 +87,20 @@ To make categoricla variant matching even more complicated, it is often the case
:alt: The figure depicts a hypothetical variant where an ACT sequence has been inserted directly 3' of a ACTG sequence. While this would not be considered a duplication variant in the HGVS nomenclature due to the intervening G base pair, it could appear in other resources as a duplication of the preceeding ACT sequence, or alternately simply as an insertion of ACT. This implies that the catgorical variant descriptor "duplication" has different meanings across different resources.


On the other hand, it is also the often the case that spurious ambiguity exists within resources. The figure depicts a hypothetical case where compared to a reference sequence ACT, the variant sequence is ACCCCCT. In HVGS, this variant could either be described as an insertion of 4 C nucleotides, or else a five repetitions of the single nucleotide sequence C. This demonstrates spurious ambiguity of categorical variant descriptors, as both categorical variants desribe two sets with all and only the same member variants.
On the other hand, it is also often the case that spurious ambiguity exists within resources. The figure depicts a hypothetical case where compared to a reference sequence ACT, the variant sequence is ACCCCCT. In HVGS, this variant could either validly be described as an insertion of 4 C nucleotides, or else a five repetitions of the single nucleotide sequence C. This demonstrates spurious ambiguity of categorical variant descriptors, as both categorical variants desribe two sets with all and only the same member variants.


.. image:: images/CatVar-CatVar-spurious-ambiguity.png
:width: 60%
:width: 40%
:align: center
:alt: The figure depicts a hypothetical case where compared to a reference sequence ACT, the variant sequence is ACCCCCT. In HVGS, this variant could either be described as an insertion of 4 C nucleotides, or else a five repetitions of the single nucleotide sequence C. This demonstrates spurious ambiguity of categorical variant descriptors, as both categorical variants desribe two sets with all and only the same member variants.



Discussion
@@@@@@@@@@


In summary, a crucial step in the course of genomic variant interpretation is assayed-categorical variant matching, where one determines all and only those categorical variants to whoch the assayed variant in question is a member. Successful assayed-categorical variant matching makes it possible to connect evidence to support or refute determinations of pathogenicity and/or oncogenicity of the assayed variants. In a different but related use case, categorical-categorical variant matching is crucial to the process of data harmonization and knowledgebase curation.


Expand Down Expand Up @@ -156,43 +160,6 @@ This repository is the for the GA4GH Categorical Variation Study Group. As a st
Relatedly, The contents of this repository represents a very early pre-alpha version of the Cat-VRS. First, this means that the schemas contained herein are not yet an officially-released version of the specification. Second, this menas that the spec is expected to undergo frequent and potentially breaking updates until a more stable beta version is released. Caveat emptor.


.. _CategoricalVariation:

[THIS SECTION WILL GET UPDATEED WHEN I HAVE SCHEMAS READY]


Categorical Variation
@@@@@@@@@@@@@@@@@@@@@

.. include:: defs/CategoricalVariation.rst

.. _Canonical:

Canonical Allele
################

.. include:: defs/CanonicalAllele.rst

.. _Described:

Described Variation
###################

.. include:: defs/DescribedVariation.rst

.. _CatCNV:

Categorical Copy Number
#######################

.. include:: defs/CategoricalCnv.rst

.. _ProtConsequence:

Protein Sequence Consequence
############################

.. include:: defs/ProteinSequenceConsequence.rst


.. _CA123643: https://reg.genome.network/redmine/projects/registry/genboree_registry/by_caid?caid=CA123643
Expand Down
37 changes: 36 additions & 1 deletion docs/source/schema.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,40 @@ Overview
Machine Readable Specifications
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

.. _CategoricalVariation:

blah
[THIS SECTION WILL GET UPDATEED WHEN I HAVE SCHEMAS READY]


Categorical Variation
@@@@@@@@@@@@@@@@@@@@@

.. include:: defs/CategoricalVariation.rst

.. _Canonical:

Canonical Allele
################

.. include:: defs/CanonicalAllele.rst

.. _Described:

Described Variation
###################

.. include:: defs/DescribedVariation.rst

.. _CatCNV:

Categorical Copy Number
#######################

.. include:: defs/CategoricalCnv.rst

.. _ProtConsequence:

Protein Sequence Consequence
############################

.. include:: defs/ProteinSequenceConsequence.rst
89 changes: 87 additions & 2 deletions docs/source/terms_and_model.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,96 @@
Terminology & Information Model
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

.. information on the terminology and information model go here. subsections include:
information on the terminology and information model go here. subsections include:
When biologists and clinical researchers define terms in order to describe phenomena and
observations, they rely on a background of human experience and
intelligence for interpretation. Definitions may be abstract, perhaps
correctly reflecting uncertainty of our understanding at the
time. Unfortunately, such terms are not readily translatable into an
unambiguous representation of knowledge.

As discussed in the :ref:'Introduction', categorical variation labels are homophonous, ambiguous, and vague, often all three simultanously. This poses a great difficulty to the precise repreentation of categorical variation. In contrast, **the computational representation of categorical variation concepts requires
translating precise categorical definitions into information models and
data structures that may be used in software.** This translation
should result in a representation of information that is consistent
with conventional variant ontologies and, ideally, be able to
accommodate future data as well. The resulting *computational
representation* of information should also be cognizant of
computational performance, the minimization of opportunities for
misunderstanding, and ease of manipulating and transforming data.

Accordingly, for each term we define below, we begin by describing the
term as used by the genetics and/or bioinformatics communities as
available. When a term has multiple such definitions, we
explicitly choose one of them for the purposes of computational
modelling. We then define the **computational definition** that
reformulates the community definition in terms of information content.
Finally, we translate each of these computational definitions into precise
specifications for the (**information model**).

.. Terms are ordered
"bottom-up" so that definitions depend only on previously-defined terms.

.. note:: The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as
described in `RFC 2119`_.


Information Model Principles
@@@@@@@@@@@@@@@@@@@@@@@@@@@@

* **Cat-VRS objects are minimal** `value objects
<https://en.wikipedia.org/wiki/Value_object>`_. Two objects are
considered equal if and only if their respective attributes are
equal. As value objects, Cat-VRS objects are used as primitive types
and MUST NOT be used as containers for related data, such as primary
database accessions, representations in particular formats, or links
to external data. Instead, related data should be associated with
VRS objects through identifiers. See :ref:`computed-identifiers`.

* **Error handling is intentionally unspecified and delegated to
implementation.** VRS provides foundational data types that
enable significant flexibility. Except where required by this
specification, implementations may choose whether and how to
validate data. For example, implementations MAY choose to validate
that particular combinations of objects are compatible, but such
validation is not required.

* **Cat-VRS uses** `snake_case
<https://simple.wikipedia.org/wiki/Snake_case>`__ **to represent
compound words.** Although the schema is currently JSON-based (which
would typically use camelCase), Cat-VRS itself is intended to be neutral
with respect to languages and database.

* **Optional attributes start with an underscore.** Optional
attributes are not part of the value object. Such attributes are
not considered when evaluating equality or creating computed
identifiers.
.. The ``_id`` attribute is available to identifiable
objects, and MAY be used by an implementation to store the
identifier for a Cat-VRS object. If used, the stored ``_id`` element
MUST be a `CURIE`_. If used for creating a :ref:`truncated-digest`
for parent objects, the stored element must be a :ref:`GA4GH
Computed Identifier <identify>`. Implementations MUST ignore
attributes beginning with an underscore and they SHOULD NOT transmit
objects containing them.
Basic types
@@@@@@@@@@@

THIS SECTION WILL BE UPDATED WITH FEEDBACK OF TYPE-LOGICAL FRAGMENT.


Primitives
(When relevant) Deprecated and obsolete classes.
@@@@@@@@@@

THIS SECTION WILL BE UPDATED WITH FEEDBACK OF TYPE-LOGICAL FRAGMENT.



.. Deprecated and obsolete classes.

0 comments on commit 99a6875

Please sign in to comment.