-
Notifications
You must be signed in to change notification settings - Fork 12
OTT id registry proposal 2
(basically a rewrite of OTT id registry proposal. I repeat myself.)
Anchor OTT ids using an analog of literature references. Each OTT id, in practice, refers to some taxon record in some version of some source taxonomy. For example, OTT id 244265 refers to record 40674 in the 2013-02-27 version of NCBI Taxonomy. (Call something like "record 40674 in the 2013-02-27 version of NCBI Taxonomy" a "versioned taxon record".)
By anchoring to a particular version of the source taxonomy, instead of just to the source taxonomy series (NCBI, GBIF, etc.), we defend against changes in the source taxonomy between versions, specifically deletions and incompatible changes. (I am not aware of any cases of incompatible changes to ids in sources, but it could happen.)
So the registry is just a four-column write-once table, with columns OTT id, source taxonomy series (e.g. 'ncbi'), source taxonomy date (or other version number - e.g. SILVA has its own numbering), and source taxon record id.
To do anything with this, we need a way to find correspondences between OTT ids and versioned source records other than the record the OTT id has as its registration. E.g. consider OTT id 244265 = record 40674 in the 2013-02-27 version of NCBI Taxonomy. How does this compare to:
- A record in the same source series, e.g. NCBI taxon id 40674 in the 2014-01-06 version of NCBI Taxonomy
- A record in a different source version, e.g. GBIF taxon id 359 in the 2013-07-02 version of the GBIF backbone
Both of these are instances of the 'alignment' problem that has to be solved in order to build OTT, and we have a heuristic solution for that. The alignment policy is independent of the registry or the 'identity' of an OTT taxon, so it can be modified without changing the 'identity' associated with the OTT id.
For expedience one might choose not to align NCBI Taxonomy versions to one other; we could just assume that taxon ids correspond. But to be rigorous, they really ought to be aligned, to make sure nothing funny is going on.
There may be cases where we want to shift from one versioned source id to another for what appears to be the "same" taxon. Perhaps the earlier one is too vague or sparse, or contains an error. In this situation we could mint a new OTT id for the "better" source, and perhaps make a link between the two (representing the heuristic hypothesis that the OTT ids "denote" the "same" "taxon").
But in no circumstances would the registration of an OTT id change (replacing its versioned source id with a different one).
Relationships in OTT: It's important to understand that relationships (parent/child) in the Open Tree taxonomy are hypotheses, or heuristic conclusions, based on information provided by the source taxonomies. If OTT says that GBIF v1 taxon X is a child of NCBI v2 taxon Y, that is just a heuristic inference, since neither of the taxonomies directly addresses the question of how X relates to Y. The relationships are not definitive of the OTT id. The relationships can be wrong due to misinterpretation of the source taxonomies, or flaws in the logic or heuristics. So to fully understand the stable identity associated with an OTT id, and its relationships, you have to look at the versioned source taxonomies, not at any OTT version. - Not that any sane person will do this.
And if you do that, you will find that the source taxonomy doesn't tell you everything that you would like to know, and you will still have to make blind guesses in order to make progress.