Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDR software upgrade (OMERO/Bio-Formats) #389

Closed
sbesson opened this issue Oct 27, 2022 · 4 comments
Closed

IDR software upgrade (OMERO/Bio-Formats) #389

sbesson opened this issue Oct 27, 2022 · 4 comments

Comments

@sbesson
Copy link
Member

sbesson commented Oct 27, 2022

Since its inception the IDR project has been a driver for technological improvements of the OME software. To meet deadlines and requirements of the resource, the software stack has diverged early from the mainline OMERO/Bio-Formats. Although several features have been backported and IDR is effectively compatible with the OMERO API, the resource is currently deployed on a custom OMERO and the maintenance of this fork comes with its set of challenges especially when it comes to upgrade.

As the IDR team looks into integrating support for OME-NGFF, this issue attempts to capture the state of the divergence and the various options for the software upgrade.

OMERO

IDR drove several OMERO functionalities including the ability to create read-only OMERO instances as well as several improvements to the Python stack. Most of this work has been ported back to the mainline with OMERO 5.4.6 work notably including the read-only functionality.

DUe to the usage of a custom Bio-Formats version (see below), custom OMERO builds were created and deployed. Since OMERO 5.5, the deployment playbooks has been updated to consume directly the mainline OMERO artifacts and override the Bio-Formats JARs - see #168.

Currently, IDR tracks the OMERO 5.6.x development line and uses version 5.6.0 (

idr_omero_server_release: 5.6.0
). Attempts to upgrade to more recent versions of OMERO 5.6.x have highlighted some issues with the restrictions introduced in OMERO 5.6.1 to address 2016-SV1 which incompatible with some of the historical IDR imports - see #229 and #327 for more details.

To allow imports to be functional in modern environment (JDK 11), the deployment now also needs to override some of the server JARs - see #331. An upgrade to the a recent OMERO 5.6.x release would make these unnecessary.

The latest OMERO 5.6.5 release containing ZarrReader is currently deployed on several pilot IDR servers (see #380) to test submissions where the raw data is available using the OME-NGFF format. There is still ongoing feedback and potential development to make this production ready but assuming validation, either the deployment will need to be upgraded to track OMERO 5.6.5 or the reader will need to be injected at deployment time like above. In both cases, some changes will need to happen at the IDR Bio-Formats level to allow the new reader to be detected.

Bio-Formats

The custom modifications of Bio-Formats were made to support the publication of several historical studies. Unlike OMERO, there are still several features only present in the IDR fork of OME Bio-Formats. Managing and reducing this divergence has been discussed several times around three classes of solutions:

Backporting the IDR fixes to Bio-Formats mainline. Similar to above, some of the modifications has already made upstream (ome/bioformats#2830, ome/bioformats#2562 ome/bioformats#2900, ome/bioformats#2912 ). However, several readers and reader fixes are still only in IDR incl. FlexReader, ScreenReader, FilePatternReader as well as the Memoizer propagation.
Converting studies necessitating custom IDR readers are into open formats e.g. OME-TIFF or OME-NGFF. The primary advantage would be to use the mainline OMERO/Bio-Formats software directly. The main challenges would be computational as well as working on the OMERO workflow to point at a new Fileset.
Maintaining an IDR Bio-Formats fork and periodically synchronizing it with the mainline Bio-Formats. This is the current statu quo and is primarily driven by new features.

The last synchronization was made in IDR/bioformats#12 to include Bio-Formats 6.x. The current IDR Bio-Formats 0.6.x line is effectively derived from the mainline Bio-Formats 6.2.x and all the relevant changes in the following 0.6.x patch releases should have been backported and released in the mainline Bio-Formats (IDR/bioformats#19, IDR/bioformats#21, IDR/bioformats#22, IDR/bioformats#25, IDR/bioformats#26, IDR/bioformats#27).

In the context of adding ZarrReader to the IDR software, the current version shipped by OMERO 5.6.5 is Bio-Formats 6.10.0. Although it is probably desirable, merging the mainline Bio-Formats (currently 6.11.x) might not be a strict requirement. At minimum, formats-api will need to be modified so that readers.txt declares the new reader. Note also several dependencies have been upgraded in the Bio-Formats stack, some of them in coordination with ZarReader e.g. ome/bioformats#3788. Although the API should remain compatible, it is possible making the minimal set of changes could result in runtime issues.

Deployment & testing

While deploying new versions of the software is relatively straightforward, the biggest challenge is that it invalidates the Bio-FOrmats memo files. As IDR strives to be a high-availability service, these memo files need to be regenerated for the 1-10M filesets imported into IDR.

https://github.com/IDR/deployment/blob/master/docs/operating-procedures.md#bio-formats-cache-regeneration describes the last used procedure for regenerating these memo files which largely involves deleting the old cache and calling setId on every single fileset using GNU parallel. This took several days to complete last time it was executed so at least a similar time should be expected.

If merging the mainline Bio-Formats is in scope, it is important to ensure the synchronization does not affect the custom modifications and that historical studies remain available. Historically, software upgrades necessitated several rounds of software changes, deployment, testing and feedback.

To allow quick iterations, the initial phase of testing was focusing on the regeneration of the memo files for only the first plate/dataset of each screen/project. Testing was only executed against these datasets/plates for reporting issues and making fixes. Once a satisfying version has been validated, all memo files could be regenerated for a final assessment.

/cc @jmarie @francesw

@will-moore
Copy link
Member

Notes from meeting just now...

Testing sheet: https://docs.google.com/spreadsheets/d/11Eg8JzY7dqRUMGVAjGIfTcGDs0BitJQJQ20BVoaiZOg/edit#gid=1516505897

Josh: vote for converting data that uses custom readers to NGFF as it boosts the amount of sample data and we can get rid of custom fork.
Also avoid costly memo-file regeneration on upgrades.

What's not supported by latest BioFormats?
@sbesson Anything before idr0043 is possible? SPW and pattern-file e.g. idr0026
Steps: pilot with latest BioFormats - check each study and see what opens

Run bioformats2raw on data. Only use e.g. omero-cli-zarr if needed.
Copy data where? E.g. BIA submission gets the data on the right storage.
E.g. HPA uploads each get their own BIA ID now.
Need to contact Matthew - Need to know how much data we're talking about

Re-import NGFF data? Josh: no
bioformats2raw data should preserve series order. OMERO should be able to switch without issue. Doesn't care what reader it is using. Need auto tests to check lists of images, dimensions etc.
Will need something to update the "used files" in the database.
Hopefully no metadata work (Image IDs should stay the same).

JM: these formats that need a custom version of BioFormats will also need a custom bioformats2raw to do the NGFF conversion?
Seb: yes, shouldn't be too hard, but will need testing.

Josh: How many pilots do we need? Workflow above can proceed in parallel (for green columns on spreadsheet).
Converge on idr-next (blue columns) will interrupt release cycle.

Testing: assume all plates in each study are same format: if one fails - all will fail.
But good to test multiple Datasets.

@will-moore
Copy link
Member

will-moore commented Feb 14, 2023

Meeting today:

Seb will open bioformats2raw PR to update reader with production IDR bioformats, which can read pattern files etc.

Use this on a pilot to read directly from /filesets/ to convert original data.
e.g. pilot-idr0125 which has latest OMERO.
Test convert an image / plate and import into OMERO. Keep an eye on disk space! df -h
Create top-level dir in e.g.

$ sudo mkdir /ngff
$ sudo chmod 777 /ngff

Should be able to OMERO from there.

Good to compare size of disk usage used before conversion and after.

NB: idr0054 data is on BioStudies - hold off for now

Will to have a first go and make notes...

@will-moore
Copy link
Member

will-moore commented Feb 22, 2023

Meeting 22nd Feb:

Some studies have issues reading with bioformats2raw due to some XML chars IDR/bioformats#29
But many others convert successfully with IDR/bioformats2raw#1

Next steps:

  • Created a project at https://github.com/orgs/IDR/projects/19/views/1?layout=board
  • Will create an issue for each study to track progress [Done]
  • Need to import the converted images/plates to check if they look OK in OMERO
  • Ideally, we will eventually want to replace the existing Filesets in IDR with NGFF data, and NOT replace existing Images. This will maintain Image IDs and avoid re-annotation, but is an unknown process
  • Starting point could be to use code from https://gitlab.com/openmicroscopy/incubator/omero-python-importer/-/blob/master/import.py to create a Fileset - but not to initiate import (don't create images etc).
  • Then, re-link existing images with the new NGFF Fileset.
  • Test this workflow first with an Image, then a Plate. Can use merge-ci as a test server since this logic doesn't depend on IDR data

@sbesson
Copy link
Member Author

sbesson commented Jul 4, 2024

Now done with the release of prod122

@sbesson sbesson closed this as completed Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants