Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul taxon subsets #3363

Merged
merged 8 commits into from
Oct 9, 2024
Merged

Overhaul taxon subsets #3363

merged 8 commits into from
Oct 9, 2024

Conversation

gouttegd
Copy link
Collaborator

@gouttegd gouttegd commented Sep 18, 2024

This PR updates the way we are building “taxon subsets”.

As explained in #3362, we currently have, for reasons unknown (to me at least), two slightly different methods to create taxon subsets: one producing the -view subsets (human-view, mouse-view, xenopus-view) and one producing the -basic subsets (amniote-basic, euarchotonglires-basic). Both methods rely on the use of OWLTools.

This PR replaces both methods by a single one (so that all taxon subsets are produced in the same way) that relies on a new command in Uberon’s custom ROBOT plugin.

The PR does not change which taxon subsets are produced and released by default (the five subsets aforementioned: human, mouse, xenopus, amniote, and euarchotonglires).

More subsets can be produced on demand, all that is needed is to define a TAXON_ID_subsetname Make variable pointing to the desired NCBITaxon ID.

For example, to create a subset for, say, insects, one can do:

sh run.sh make subsets/insect-view.owl TAXON_ID_insect=NCBITaxon:50557

The PR also adds a possibility to create, not a subset directly, but a small component containing only oboInOwl:inSubset annotations to “tag” classes that belong to a taxon subset. For example:

sh run.sh make subsets/human-tags.ofn

would create a human-tags.ofn component containing, for all Uberon classes that belong to the human subset, oboInOwl:inSubset <http://purl.obolibrary.org/obo/uberon/core#human_subset> annotation assertion axioms. Such a component can then be merged with the main ontology for downstream use (e.g., extracting all the classes of the subset).

closes #3362

Use latest version (0.3.1) of the Uberon-specific ROBOT plugin, which
provides a new command to facilitate the creation of taxon subsets.
The custom Makefile contains two sets of rules to create taxon subsets
in two different ways:

* one set using OWLTools' `--make-species-subset` command (resulting in
  the `*-basic.owl` subsets);
* one set using the files in `src/ontology/contexts` to do basically the
  same thing as `--make-species-subset`, merely in a slightly different
  way (resulting in the `*-view.owl` subsets).

This command replaces both sets of rules by a single rule that relies on
the newly available `create-species-subset` command in the Uberon ROBOT
plugin.

In addition, a new rule is added to allow the creation of a component
file that contains `oboInOwl:inSubset` annotations to tag all the
classes that belong to a given subset. That rule is currently unused,
but the expectation is that it could be used by downstream applications
to facilitate the use of taxon-specific subsets.
There is no reason to have two different naming conventions for the
taxon-specific subsets (-view and -basic). Let's settle for -view.

This may require updating the PURL configuration for Uberon, if there
are people out there that are using the euarchontoglires-basic.owl
and/or amniote-basic.owl artifacts (though GitHub download stats suggest
nobody ever downloaded them).
@gouttegd gouttegd self-assigned this Sep 18, 2024
@gouttegd gouttegd requested a review from dosumis September 18, 2024 10:59
matentzn
matentzn previously approved these changes Oct 3, 2024
Copy link
Contributor

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed the code changes, they look great, - I would like to offer a single word of caution: renaming files, even subset files, may break existing pipelines somewhere on the deep web unless we also add a purl redirect to the OBO purl config.

src/ontology/uberon.Makefile Show resolved Hide resolved
--reasoner ELK \
$(foreach root,$(TAXON_SUBSET_ROOTS),--root $(root)) \
reason --reasoner ELK --equivalent-classes-allowed all \
--exclude-tautologies structural \
relax \
remove --axioms equivalent \
relax \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't imagine what this second relax does but well since it's there..

src/ontology/uberon.Makefile Outdated Show resolved Hide resolved
@gouttegd
Copy link
Collaborator Author

gouttegd commented Oct 3, 2024

renaming files, even subset files, may break existing pipelines somewhere on the deep web unless we also add a purl redirect to the OBO purl config.

I know. But precisely, this is to be handled at the PURL level, which is here exactly for this purpose. There is no reason for us to refrain from renaming files if it brings better consistency (in this case, by having all taxon subsets consistently named something-view.owl instead of having some subsets being named something-basic).

@dosumis
Copy link
Contributor

dosumis commented Oct 3, 2024

Hi @gouttegd - great to see this.

I'm especially excited to see this:

The PR also adds a possibility to create, not a subset directly, but a small component containing only oboInOwl:inSubset annotations to “tag” classes that belong to a taxon subset. For example:

sh run.sh make subsets/human-tags.ofn

would create a human-tags.ofn component containing, for all Uberon classes that belong to the human subset, oboInOwl:inSubset http://purl.obolibrary.org/obo/uberon/core#human_subset annotation assertion axioms. Such a component can then be merged with the main ontology for downstream use (e.g., extracting all the classes of the subset).

I would really like the tags to in incorporated into the release files. We can use them straight away in our autosuggest pipelines to boost for species relevant terms.

@gouttegd
Copy link
Collaborator Author

gouttegd commented Oct 3, 2024

I would really like the tags to in incorporated into the release files.

I wasn’t sure whether this was your preferred option, so for now the idea was to merely produce the -tags file and leave downstream users of Uberon merge them with uberon.owl if they wanted.

But it certainly can be done directly upstream if preferred.

Do we want all release artefacts to include those tags (e.g. uberon.owl, oberon-basic.owl, but also all the organ-specific subsets such as nephron-minimal, sensory-minimal, etc)? Or just one supplementary artefact containing the taxon subset tags (e.g. something like uberon-with-subset-tags.owl)?

Based on your comment I assume the former (tags included in all release artefacts), in which case we’ll need a new intermediate file (upstream of uberon.owl) from which the subset tags can be generated before we produce the final uberon.owl with the subset tags included.

In addition to producing taxon subset files, we also want to include,
directly into the release products, the oboInOwl:inSubset annotations
that mark all terms that belong to the taxon subsets.

There is a bit of a conundrum here as the taxon subsets are computed on
the main uberon.owl product, but we want to include them in that very
product. To solve this, we introduce a "near-final" intermediate product
(tmp/uberon.owl, referenced by the new Make variable POSTPROCESS_SRC).
The pipeline that produced the final uberon.owl now produces that
intermediate tmp/uberon.owl, from which the taxon subsets are derived.
The new final uberon.owl is produced simply by merging the intermediate
tmp/uberon.owl with the taxon subset tag files.

Some of the other parts of the Uberon pipeline that were using the final
uberon.owl are now using the intermediate tmp/uberon.owl, because the
taxon subset annotations are not needed for those steps. This notably
concerns the bridge checks and the building of Composite Metazoan.
The rule that produces the `uberon.json.gz` file has nothing to do in
the main "BUILDING UBERON ITSELF" pipeline, and it is in fact doubtful
that this rule is even useful, so we move it to the purgatory.
The infortunate use of $^ in the rule that produces `uberon.owl` leads
to `uberon-full.owl` being forcefully injected into `uberon.owl`,
because `uberon-full.owl` is declared (in the ODK-generated Makefile) as
a dependency of `uberon.owl`. We must avoid using that variable and only
use the dependencies that are explicitly listed in uberon.Makefile.
@gouttegd
Copy link
Collaborator Author

gouttegd commented Oct 8, 2024

PR updated to include the oboInOwl:inSubset tags directly into the release products.

For now, we include only the tags for the human subset and the mouse subset. To add tags for another subset, it would simply be a matter of adding another tag file to the POSTPROCESS_ADDITIONS Make variable, for example:

POSTPROCESS_ADDITIONS = subsets/human-tags.ofn \
                        subsets/mouse-tags.ofn \
                        subsets/drosophila-tags.ofn

Some considerations:

  1. The tags are added as part of the release pipeline, and are not imported from the -edit file. This means the tags will not be visible when working on the ontology from Protégé. This is on purpose. The tags are automatically computed and are not supposed to be manually edited, but there would be no way to enforce that if they were visible from Protégé.

  2. The tags are added to the main uberon.owl artefact, and from there are propagated to all files that are derived from this artefact. They are, however, explicitly excluded from Composite-Metazoan (and its progenitor Collected-Metazoan). Including them into Composite-Metazoan would produce a somewhat weird result in that Uberon terms would be tagged as belonging to a given taxon subset, but terms coming from the taxon-specific anatomy ontologies would not be tagged. For example, UBERON’s artery would be tagged as belonging to the mouse subset, but not EMAPA’s superior vesicle artery, even though that term is (obviously, as coming from EMAPA) definitely applicable to mice. I believe this would be confusing.

  3. The additions of the mouse and human tags are adding about ~3MB to the main uberon.owl file (from 88MB to 91MB). Not sure whether this is a big concern or not, but I think it should be noted, especially if we want to add more taxon subset tags to the main release file.

Copy link
Contributor

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming that uberon.owl was swapped out to POST_PROCESS in all the relevant places in the makefile, this looks good to me!

@gouttegd
Copy link
Collaborator Author

gouttegd commented Oct 9, 2024

assuming that uberon.owl was swapped out to POST_PROCESS in all the relevant places in the makefile

Only two places, in fact:

  1. As the input of the Composite-Metazoan pipeline, because as I have said in my last comment, we (at least I) do not want the taxon subset tags to be included in Composite-Metazoan, so it’s fine to build Composite-Metazoan from the POSTPROCESS_SRC intermediate file (which is the same thing as the final uberon.owl minus the taxon subset tags);

  2. As the input of the “bridge checks”, because the taxon subset tags are completely unnecessary for those checks (they would cause no harm, but would serve no purpose either).

All other custom pipelines that were dependent on uberon.owl are still dependent on uberon.owl, meaning those custom pipelines now work on a version of the ontology that includes the taxon subset tags. Notably, this means that the “system subsets” (e.g. circulatory-minimal, digestive-minimal, etc.) do contains those tags.

@gouttegd
Copy link
Collaborator Author

gouttegd commented Oct 9, 2024

Merging here.

@dosumis: (1) If you want more taxon subset tags than just human and mouse to be included in the main product, just say so and we can add new subsets anytime, it’s one line to add in the custom Makefile.

(2) I assume you want to see the same thing in CL?

@gouttegd gouttegd merged commit 3a11216 into master Oct 9, 2024
1 check passed
@gouttegd gouttegd deleted the overhaul-taxon-subsets branch October 9, 2024 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Overhauling “taxon subsets”
4 participants