OTT id registry proposal

See OTT id registry proposal 2 for a different articulation of the same idea.

Requirements:

Stability
Support provenance tracking (source and version awareness) - distinguish versions of X from ongoing X
support use cases...

Use cases:

Mint OTT ids for user-added (study) "taxa" to put into a nexson file; an operation to be performed by the add-some-taxa user interface.
Patches to OTT independent of any study; either written without benefit of UI, like the current patch systems, or using some as yet undesigned UI.
When preparing a new OTT release, smasher needs a bunch of new OTT ids for "new" "taxa" coming in from new versions of source taxonomies.
OTT ids are stored in nexson files and it would be nice to know what old ones mean and whether 'meaning' has changed over time.

What is in the registry?

At an abstract level, each OTT id is associated in the registry with a pair <capture-spec, export-id>. A 'capture' (defined here by fiat) is a machine-readable, archivable particular snapshot of some resource, where a 'resource' is some potentially changing editable document or database, such as NCBI taxonomy. For example, if the resource is NCBI taxonomy (which changes over time), a capture of it is a particular snapshot or download of the taxdump.tgz file from their web site. (A 'capture-spec' is something that specifies a capture; to be discussed later.)

An export-id is a hook into the content of the capture; for NCBI it would be an NCBI taxon id, but for some resources it might just be a name. Heuristically an export-id applied to two captures of the same resource will probably give you something with similar function (the "same taxon"), but this is not a requirement, and how true it is may depend on which resource you're talking about.

That is, we anchor OTT ids to a particular node in a particular capture of an imported taxonomy, the capture where the "taxon" was first seen by the OTT builder.

A salient aspect of this proposal, compared to a previous proposal, is that taxonomic relationships are not in the registry. They are in things referenced by the registry (captures), but not in the registry itself. Any requirements on interpretation of OTT ids as referring to taxa, clades, etc. are outside the scope of the registry.

The intent is that an OTT id refer to an equivalence class of nodes, usually nodes in different captures all considered to have the same interpretation (maybe to "represent the same taxon"). The particular equivalence class would be the one containing the node named in the OTT id's registration. But how those equivalence classes are to be constituted is up to the application layer, not the registry, and unlike the registrations, may change as new information becomes available.

For a study or patch, what is the "resource" and what are its captures?

Define a "taxonomic claim resource" (or just "claim resource") to be a file in any github repo intended to amend or correct OTT. This covers the first two use cases. For the new-taxa-from-study use case, the claim is written (created, captureed) during the curation process. Captures are particular versions of the document, and they are selectable by commit sha if the resource path is known.

The contents of a claim is written in the "claim language" yet to be designed. (Probably JSON syntax, if that helps.) Things the claim language has to be able to do are

provide an export-id namespace of some kind,
anchor the export-ids in the taxonomic literature (ideally by taxonomic name plus DOI; also Genbank id and anything else that might be useful for deciding whether another "taxon" (from another resource (taxonomy or claim) capture) is the same as this one),
provide taxonomic relationships expressed in terms of these export-ids (i.e. between the "taxa" represented by the ids)
relate these export-ids to nodes defined outside this claim capture, e.g. to nodes in captures of other resources that already have OTT ids.
permit comparison and analysis of successive captures of the claim resource
support the things people are asking for in regard to OTT: "define new taxa", change relationships, delete from OTT, align or disalign to other resources, define synonyms, change spelling, set flags, etc.

Also see here.

What has to happen for registration

First, avoid minting new OTT ids. Align "taxa" used inside the resource to nodes in OTT (preferably latest) or the registry in general, when possible, using TNRS, taxonomy alignment, resource capture alignment, explicit directives, etc. - every trick available. Otherwise an opportunity for synthesis will be lost. Matches can be made retrospectively (OTT id "synonyms") but better to get it right the first time.

If at least one node has no suitable already-registered id:

Obtain provisional OTT ids from a registry API. Invoke the service passing it the list of export-ids that are in need of OTT ids. The service returns one new OTT id per provided export-id, plus a "password" enabling one to complete the registration by providing the final capture-spec to a second service.
Store OTT ids in resource (resource = study nexson, OTT, etc.)
Write new capture of resource, obtaining the new capture's capture-spec
Invoke second API service passing it the password and the capture-spec. This completes the registration.

Resources and captures

The registry will have to refer to captures, i.e. it will contain capture-specs. Capture-specs will have to be as durable as any id in the registry.

One way to do this is to register capture-specs for captures, using almost the same apparatus as for nodes. A capture registration will need a resource-spec of some kind and some key that selects that capture of the resource from the ones available. A selector could be a date, hash, version number, or anything else that can be generally understood (intentially being vague given the wide variety of resources that need to be handled).

It's important that capture-specs be short since they are often repeated millions of times.

Similarly, we'll also need durable resource-specs. A resource registration should include information adequate to access the current capture (today), with the understanding that the access method may need to change in the future. It should therefore have some amount of accompanying description and documentation, although that needn't be baked into the registration.

Implementation

The registry as described above could be stored in a set of CSV files, and/or in a database as simple tables.

A resource table would associate a resource-spec with a resource description (TBD).
A capture table would associate a capture-spec with a (resource-spec, selector) pair. The association is 1-1 - no two captures can have the same combination of resource and selector.
A node table would associate an OTT id with a (capture-spec, export-id) pair, similarly.

There are many other design options. For example we could just use (resource-spec, selector) pairs for capture-specs, making the node table four columns instead of three. This could be seen as a simplification, but it would multiply the size of the registry files many times over.

Doesn't this mean you need copies of all versions of all resources (source taxonomies) on hand in order to use the registry?

We use new OTT ids to 'refer' to old "taxa" in NCBI taxonomy, but how does the association get made? Typically we're only looking at a new capture of NCBI taxonomy.

We can analyze the correspondence between old and new taxonomy captures offline, propose alignments, and store it for use by clients. Typically I expect the alignment to simply match up nodes that have the same id, after confirming that the ids haven't been repurposed - I don't think any of our resources do such a thing, but it should probably be checked. So what we compute is exceptions to alignment by id.

So, we can take what we have in the registry, e.g. OTT #7338 = NCBI.2014-05-02 #2456 (old), and use stored information that NCBI.2014-05-02 #2456 (old) ~= NCBI.2016-04-12 #2456 (new), to obtain something we can use, which is OTT #7338 ~= NCBI.2016-04-12 #2456 (new).

I think the typical user would consult OTT directly, rather than look at the registry.

Status

The idea is very similar to what smasher does now without a registry, but codified to enable reasoning. It is much simpler to implement than it is to explain.

I have an list of resources and captures used so far in creating all of the OTT builds, with associations for all captures of OTT that got deployed (i.e. which version of NCBI, GBIF, etc. was used to make each version of OTT). And I have all the captures. From this I can create registrations for all OTT ids, even those that have been discontinued.

I have drafts of the resource and capture tables adequate for OTT's (smasher's) purposes.

The claim language needs to be designed and implemented, as a successor to patch language version 3 (slideshare).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly