RFO: "gene_functions" collection #40

StevenCannon-USDA · 2023-05-04T13:14:00Z

I propose formats and methods for collecting and storing information about genes experimentally associated with phenotypes. See the description in the README and examples of the three file types in this datastore-specifications directory.

You can also see a few more examples, and the two associated scripts, in this repository (which will go away once the RFO is settled).

A few comments about my objectives and philosophy behind the specification:

This will be for genes associated with phenotypes, for which there is strong experimental evidence. This wouldn't be for collections of genes from a family involved in a trait, or lists of candidate genes in a GWAS region. The evidence needs to be stronger and more particular: for example, fine-mapping of a gene within a GWAS region, and identification of a causal mutation that explains the observed phenotype.
A "confidence" field indicates strength of evidence. There is some subjectivity here, but I prefer this kind of scale to evidence codes, which I find hard to use and interpret.
The system should be curator-friendly. The curator fills out a small yaml "traits" template; then scripts flesh out the template and populate a file of citations and a file of references. The yaml template has only five essential fields: gene_model_full_id, confidence, traits: entity, references: citation, references: [doi or pmid]. In total, there are nine top-level keys, and essentially five second-level keys.
The traits file contains as many "documents" (headed by ---) as there are genes-with-described-functions. Each document (kind of a "function card") is unnamed, but a primary key could be composed from two required fields: gene_model_full_id and the first ontology accession, e.g. glyma.Wm82.gnm2.ann1.Glyma.10G221500 and TO:0002616 (for flowering time).
Citations and references are all derivable (and derived) from the DOI or PMID in the traits file (every publication has a DOI but not every one has a PMID). Two scripts retrieve the data using ncbi e-utilities. Thus, the citations.txt and references.txt files are somewhat superfluous, but probably have utility for users, curators, and QC.

The text was updated successfully, but these errors were encountered:

sammyjava · 2023-05-04T15:48:04Z

The references block contains one or more blocks of citations, each containing three key-value pairs: "citation", "doi", and "pmid". Of these, either the pmid or doi is required (some publications lack a pmid, but all should have a doi). The citation should be in one of the following forms (depending on whether there are one, two, or three-or-more authors):

Let's make DOI required, since it is in the other READMEs and I use DOI to fill out the Publication object. PMID must be optional, of course. There are some older papers that don't have DOIs, and I say let's not cite them.

This is because folks forget to put the DOI in. If it's optional, then it doesn't fail validation.

StevenCannon-USDA · 2023-05-04T17:58:29Z

The journal I come across frequently that lacks PMID is Crop Science. But I'm fine with requiring DOI and making PMID optional.

StevenCannon-USDA · 2023-06-05T13:51:16Z

I'd like to add an optional key, "phenotype_description", to hold a free-text brief description of the phenotype described by the gene_function record. Examples:

phenotype_description: fragrant seeds
phenotype_description: Red-brown seed coat color
phenotype_description: Early flowering
phenotype_description: photoperiod insensitivity to short day conditions

sammyjava · 2023-06-05T14:16:42Z

So those are in addition to, but not linked to in any way, the ontology terms. I'd argue that any specific "phenotype description" should be associated with an ontology term, such as:

  - entity_name: flowering time
    entity: TO:0002616
    phenotype_description: Early flowering
  - entity_name: days to maturity
    entity: TO:0000469
    phenotype_description: Days from planting to 10 inch seedling height
  - entity_name: seed coat color
    entity: TO:0000190
    phenotype_description: Red-brown seed coat color

Otherwise, they're just orphaned text attributes that don't link to anything higher up.

(And, reminder, the spec needs to be updated to put relations with the entities that they refer to. Order doesn't have meaning in YAML.)

StevenCannon-USDA · 2023-06-05T14:25:47Z

A single "phenotype_description" key-value pair, to hold the human-readable gestalt description. These may sometimes be fairly complex, whereas the ontology terms are "pointillistic" and often difficult to select appropriately. The phehotype_description would, indeed, be orphaned relative to the atomic ontology terms. Here are some examples from some work-in-progress:

phenotype_description: Small and nonfunctional nodules arrested in growth when both normally spliced and alternatively spliced variants repressed.  When only the alternative spliced form repressed the nodules are small but still fix nitrogen successfully.
traits:
  - entity_name: root nodule morphology trait
    entity: TO:0000898
  - entity_name: root nodule
    entity: PO:0003023
references:
  - citation: Chen, Liu, et al., 2015
    doi: 10.3389/fpls.2015.00575
    pmid: 26284091
  - citation: Oellrich, Walls et al., 2015
    doi: 10.1186/s13007-015-0053-y
    pmid: 25774204

phenotype_description: Doesn't make nodules; infection thread aborts
traits:
  - entity_name: root nodule number
    entity: TO:0000900
  - entity_name: root system
    entity: PO:0025025
  - entity_name: root nodule
    entity: PO:0003023
references:
  - citation: Herrbach, Chirinos, et al., 2017
    doi: 10.1093/jxb/erw474
    pmid: 28073951
  - citation: Oellrich, Walls et al., 2015
    doi: 10.1186/s13007-015-0053-y
    pmid: 25774204

sammyjava · 2023-06-05T14:34:16Z

Ahh, OK, so a single YAML has a single phenotype_description which is therefore associated with all the listed traits. Gotcha. Kinda like a description or summary.

StevenCannon-USDA · 2023-06-05T14:54:56Z

@sammyjava - right. So maybe "phenotype_summary" conveys the idea better.

sammyjava · 2023-06-05T15:02:39Z

Well sometimes we have a summary "Doesn't make nodules; infection thread aborts" and a longer description that describes the measurement, e.g. "Nodule formation was inspected using a confocal microscope; if fewer than 10 nodules are present on an full root strand then the phenotype is defined as Doesn't make nodules." (I'm sure I got that wrong, but you get the idea.)

Something to consider since you're adding in bespoke trait attributes.

StevenCannon-USDA · 2023-06-05T15:06:39Z

Brevity is a virtue.

StevenCannon-USDA · 2023-06-05T15:35:40Z

Sorry: for continuity with other READMEs, let's make it "phenotype_synopsis" rather than "...description" or "...summary". I'll make it so.

adf-ncgr · 2023-06-05T15:52:16Z

would it make sense to associate the phenotype in this sense with the reference that described it? Just thinking that the specifics of the phenotype in this sense will depend on the type of mutation of the gene (induced knockout/overexpression/natural variation) in which deviation from wild-type is observed. In any case, presumably such a description is derived from specific reference, but if it would be a synthesis across several that we don't plan to tie to specific alleles, then top-level as you have suggestion is appropriate. Just something to consider.

StevenCannon-USDA · 2023-06-05T16:32:57Z

would it make sense to associate the phenotype in this sense with the reference that described it

It would - but at the cost of more "method and protocol". We would end up doing it wrong or inconsistently. Overall, my preference is to try to keep things simple where possible.

Somewhat relatedly: one of my take-aways from the pain of this paper ... Oellrich et al., 2015(url) ... is that ontologies are cumbersome and difficult to apply well, difficult to compose into meaningful "sentences," etc. So, I'll encourage focusing on the entities (anatomy or trait terms) and discourage use of relation and quality terms. I am revising the README now, and will write a protocols document.

sammyjava · 2023-06-05T17:04:29Z

Yeah, FWIW we only have regular terms associated with stuff in the mines, not quality or relation terms. The ontologies themselves have their heirarchy, of course, but I just find a term that goes with a trait and if it's up- or down- or whatever I don't add that. Every term is standalone, they are not linked.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFO: "gene_functions" collection #40

RFO: "gene_functions" collection #40

StevenCannon-USDA commented May 4, 2023

sammyjava commented May 4, 2023

StevenCannon-USDA commented May 4, 2023

StevenCannon-USDA commented Jun 5, 2023

sammyjava commented Jun 5, 2023

StevenCannon-USDA commented Jun 5, 2023

sammyjava commented Jun 5, 2023

StevenCannon-USDA commented Jun 5, 2023

sammyjava commented Jun 5, 2023

StevenCannon-USDA commented Jun 5, 2023

StevenCannon-USDA commented Jun 5, 2023

adf-ncgr commented Jun 5, 2023

StevenCannon-USDA commented Jun 5, 2023

sammyjava commented Jun 5, 2023 •

edited

Loading

RFO: "gene_functions" collection #40

RFO: "gene_functions" collection #40

Comments

StevenCannon-USDA commented May 4, 2023

sammyjava commented May 4, 2023

StevenCannon-USDA commented May 4, 2023

StevenCannon-USDA commented Jun 5, 2023

sammyjava commented Jun 5, 2023

StevenCannon-USDA commented Jun 5, 2023

sammyjava commented Jun 5, 2023

StevenCannon-USDA commented Jun 5, 2023

sammyjava commented Jun 5, 2023

StevenCannon-USDA commented Jun 5, 2023

StevenCannon-USDA commented Jun 5, 2023

adf-ncgr commented Jun 5, 2023

StevenCannon-USDA commented Jun 5, 2023

sammyjava commented Jun 5, 2023 • edited Loading

sammyjava commented Jun 5, 2023 •

edited

Loading