Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFO: "gene_functions" collection #40

Open
StevenCannon-USDA opened this issue May 4, 2023 · 13 comments
Open

RFO: "gene_functions" collection #40

StevenCannon-USDA opened this issue May 4, 2023 · 13 comments

Comments

@StevenCannon-USDA
Copy link
Contributor

I propose formats and methods for collecting and storing information about genes experimentally associated with phenotypes. See the description in the README and examples of the three file types in this datastore-specifications directory.

You can also see a few more examples, and the two associated scripts, in this repository (which will go away once the RFO is settled).

A few comments about my objectives and philosophy behind the specification:

  • This will be for genes associated with phenotypes, for which there is strong experimental evidence. This wouldn't be for collections of genes from a family involved in a trait, or lists of candidate genes in a GWAS region. The evidence needs to be stronger and more particular: for example, fine-mapping of a gene within a GWAS region, and identification of a causal mutation that explains the observed phenotype.
  • A "confidence" field indicates strength of evidence. There is some subjectivity here, but I prefer this kind of scale to evidence codes, which I find hard to use and interpret.
  • The system should be curator-friendly. The curator fills out a small yaml "traits" template; then scripts flesh out the template and populate a file of citations and a file of references. The yaml template has only five essential fields: gene_model_full_id, confidence, traits: entity, references: citation, references: [doi or pmid]. In total, there are nine top-level keys, and essentially five second-level keys.
  • The traits file contains as many "documents" (headed by ---) as there are genes-with-described-functions. Each document (kind of a "function card") is unnamed, but a primary key could be composed from two required fields: gene_model_full_id and the first ontology accession, e.g. glyma.Wm82.gnm2.ann1.Glyma.10G221500 and TO:0002616 (for flowering time).
  • Citations and references are all derivable (and derived) from the DOI or PMID in the traits file (every publication has a DOI but not every one has a PMID). Two scripts retrieve the data using ncbi e-utilities. Thus, the citations.txt and references.txt files are somewhat superfluous, but probably have utility for users, curators, and QC.
@sammyjava
Copy link
Contributor

The references block contains one or more blocks of citations, each containing three key-value pairs: "citation", "doi", and "pmid". Of these, either the pmid or doi is required (some publications lack a pmid, but all should have a doi). The citation should be in one of the following forms (depending on whether there are one, two, or three-or-more authors):

Let's make DOI required, since it is in the other READMEs and I use DOI to fill out the Publication object. PMID must be optional, of course. There are some older papers that don't have DOIs, and I say let's not cite them.

This is because folks forget to put the DOI in. If it's optional, then it doesn't fail validation.

@StevenCannon-USDA
Copy link
Contributor Author

The journal I come across frequently that lacks PMID is Crop Science. But I'm fine with requiring DOI and making PMID optional.

@StevenCannon-USDA
Copy link
Contributor Author

I'd like to add an optional key, "phenotype_description", to hold a free-text brief description of the phenotype described by the gene_function record. Examples:

phenotype_description: fragrant seeds
phenotype_description: Red-brown seed coat color
phenotype_description: Early flowering
phenotype_description: photoperiod insensitivity to short day conditions

@sammyjava
Copy link
Contributor

So those are in addition to, but not linked to in any way, the ontology terms. I'd argue that any specific "phenotype description" should be associated with an ontology term, such as:

  - entity_name: flowering time
    entity: TO:0002616
    phenotype_description: Early flowering
  - entity_name: days to maturity
    entity: TO:0000469
    phenotype_description: Days from planting to 10 inch seedling height
  - entity_name: seed coat color
    entity: TO:0000190
    phenotype_description: Red-brown seed coat color

Otherwise, they're just orphaned text attributes that don't link to anything higher up.

(And, reminder, the spec needs to be updated to put relations with the entities that they refer to. Order doesn't have meaning in YAML.)

@StevenCannon-USDA
Copy link
Contributor Author

A single "phenotype_description" key-value pair, to hold the human-readable gestalt description. These may sometimes be fairly complex, whereas the ontology terms are "pointillistic" and often difficult to select appropriately. The phehotype_description would, indeed, be orphaned relative to the atomic ontology terms. Here are some examples from some work-in-progress:

phenotype_description: Small and nonfunctional nodules arrested in growth when both normally spliced and alternatively spliced variants repressed.  When only the alternative spliced form repressed the nodules are small but still fix nitrogen successfully.
traits:
  - entity_name: root nodule morphology trait
    entity: TO:0000898
  - entity_name: root nodule
    entity: PO:0003023
references:
  - citation: Chen, Liu, et al., 2015
    doi: 10.3389/fpls.2015.00575
    pmid: 26284091
  - citation: Oellrich, Walls et al., 2015
    doi: 10.1186/s13007-015-0053-y
    pmid: 25774204
phenotype_description: Doesn't make nodules; infection thread aborts
traits:
  - entity_name: root nodule number
    entity: TO:0000900
  - entity_name: root system
    entity: PO:0025025
  - entity_name: root nodule
    entity: PO:0003023
references:
  - citation: Herrbach, Chirinos, et al., 2017
    doi: 10.1093/jxb/erw474
    pmid: 28073951
  - citation: Oellrich, Walls et al., 2015
    doi: 10.1186/s13007-015-0053-y
    pmid: 25774204

@sammyjava
Copy link
Contributor

Ahh, OK, so a single YAML has a single phenotype_description which is therefore associated with all the listed traits. Gotcha. Kinda like a description or summary.

@StevenCannon-USDA
Copy link
Contributor Author

@sammyjava - right. So maybe "phenotype_summary" conveys the idea better.

@sammyjava
Copy link
Contributor

Well sometimes we have a summary "Doesn't make nodules; infection thread aborts" and a longer description that describes the measurement, e.g. "Nodule formation was inspected using a confocal microscope; if fewer than 10 nodules are present on an full root strand then the phenotype is defined as Doesn't make nodules." (I'm sure I got that wrong, but you get the idea.)

Something to consider since you're adding in bespoke trait attributes.

@StevenCannon-USDA
Copy link
Contributor Author

Brevity is a virtue.

@StevenCannon-USDA
Copy link
Contributor Author

Sorry: for continuity with other READMEs, let's make it "phenotype_synopsis" rather than "...description" or "...summary". I'll make it so.

@adf-ncgr
Copy link
Contributor

adf-ncgr commented Jun 5, 2023

would it make sense to associate the phenotype in this sense with the reference that described it? Just thinking that the specifics of the phenotype in this sense will depend on the type of mutation of the gene (induced knockout/overexpression/natural variation) in which deviation from wild-type is observed. In any case, presumably such a description is derived from specific reference, but if it would be a synthesis across several that we don't plan to tie to specific alleles, then top-level as you have suggestion is appropriate. Just something to consider.

@StevenCannon-USDA
Copy link
Contributor Author

would it make sense to associate the phenotype in this sense with the reference that described it

It would - but at the cost of more "method and protocol". We would end up doing it wrong or inconsistently. Overall, my preference is to try to keep things simple where possible.

Somewhat relatedly: one of my take-aways from the pain of this paper ... Oellrich et al., 2015(url) ... is that ontologies are cumbersome and difficult to apply well, difficult to compose into meaningful "sentences," etc. So, I'll encourage focusing on the entities (anatomy or trait terms) and discourage use of relation and quality terms. I am revising the README now, and will write a protocols document.

@sammyjava
Copy link
Contributor

sammyjava commented Jun 5, 2023

Yeah, FWIW we only have regular terms associated with stuff in the mines, not quality or relation terms. The ontologies themselves have their heirarchy, of course, but I just find a term that goes with a trait and if it's up- or down- or whatever I don't add that. Every term is standalone, they are not linked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants