-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IDR software upgrade (OMERO/Bio-Formats) #389
Comments
Notes from meeting just now... Testing sheet: https://docs.google.com/spreadsheets/d/11Eg8JzY7dqRUMGVAjGIfTcGDs0BitJQJQ20BVoaiZOg/edit#gid=1516505897 Josh: vote for converting data that uses custom readers to NGFF as it boosts the amount of sample data and we can get rid of custom fork. What's not supported by latest BioFormats? Run bioformats2raw on data. Only use e.g. omero-cli-zarr if needed. Re-import NGFF data? Josh: no JM: these formats that need a custom version of BioFormats will also need a custom bioformats2raw to do the NGFF conversion? Josh: How many pilots do we need? Workflow above can proceed in parallel (for green columns on spreadsheet). Testing: assume all plates in each study are same format: if one fails - all will fail. |
Meeting today: Seb will open bioformats2raw PR to update reader with production IDR bioformats, which can read pattern files etc. Use this on a pilot to read directly from /filesets/ to convert original data.
Should be able to OMERO from there. Good to compare size of disk usage used before conversion and after. NB: idr0054 data is on BioStudies - hold off for now Will to have a first go and make notes... |
Meeting 22nd Feb: Some studies have issues reading with bioformats2raw due to some XML chars IDR/bioformats#29 Next steps:
|
Now done with the release of |
Since its inception the IDR project has been a driver for technological improvements of the OME software. To meet deadlines and requirements of the resource, the software stack has diverged early from the mainline OMERO/Bio-Formats. Although several features have been backported and IDR is effectively compatible with the OMERO API, the resource is currently deployed on a custom OMERO and the maintenance of this fork comes with its set of challenges especially when it comes to upgrade.
As the IDR team looks into integrating support for OME-NGFF, this issue attempts to capture the state of the divergence and the various options for the software upgrade.
OMERO
IDR drove several OMERO functionalities including the ability to create read-only OMERO instances as well as several improvements to the Python stack. Most of this work has been ported back to the mainline with OMERO 5.4.6 work notably including the read-only functionality.
DUe to the usage of a custom Bio-Formats version (see below), custom OMERO builds were created and deployed. Since OMERO 5.5, the deployment playbooks has been updated to consume directly the mainline OMERO artifacts and override the Bio-Formats JARs - see #168.
Currently, IDR tracks the OMERO 5.6.x development line and uses version 5.6.0 (
deployment/ansible/group_vars/omero-hosts.yml
Line 11 in a673021
To allow imports to be functional in modern environment (JDK 11), the deployment now also needs to override some of the server JARs - see #331. An upgrade to the a recent OMERO 5.6.x release would make these unnecessary.
The latest OMERO 5.6.5 release containing ZarrReader is currently deployed on several pilot IDR servers (see #380) to test submissions where the raw data is available using the OME-NGFF format. There is still ongoing feedback and potential development to make this production ready but assuming validation, either the deployment will need to be upgraded to track OMERO 5.6.5 or the reader will need to be injected at deployment time like above. In both cases, some changes will need to happen at the IDR Bio-Formats level to allow the new reader to be detected.
Bio-Formats
The custom modifications of Bio-Formats were made to support the publication of several historical studies. Unlike OMERO, there are still several features only present in the IDR fork of OME Bio-Formats. Managing and reducing this divergence has been discussed several times around three classes of solutions:
Backporting the IDR fixes to Bio-Formats mainline. Similar to above, some of the modifications has already made upstream (ome/bioformats#2830, ome/bioformats#2562 ome/bioformats#2900, ome/bioformats#2912 ). However, several readers and reader fixes are still only in IDR incl. FlexReader, ScreenReader, FilePatternReader as well as the Memoizer propagation.
Converting studies necessitating custom IDR readers are into open formats e.g. OME-TIFF or OME-NGFF. The primary advantage would be to use the mainline OMERO/Bio-Formats software directly. The main challenges would be computational as well as working on the OMERO workflow to point at a new Fileset.
Maintaining an IDR Bio-Formats fork and periodically synchronizing it with the mainline Bio-Formats. This is the current statu quo and is primarily driven by new features.
The last synchronization was made in IDR/bioformats#12 to include Bio-Formats 6.x. The current IDR Bio-Formats 0.6.x line is effectively derived from the mainline Bio-Formats 6.2.x and all the relevant changes in the following 0.6.x patch releases should have been backported and released in the mainline Bio-Formats (IDR/bioformats#19, IDR/bioformats#21, IDR/bioformats#22, IDR/bioformats#25, IDR/bioformats#26, IDR/bioformats#27).
In the context of adding ZarrReader to the IDR software, the current version shipped by OMERO 5.6.5 is Bio-Formats 6.10.0. Although it is probably desirable, merging the mainline Bio-Formats (currently 6.11.x) might not be a strict requirement. At minimum, formats-api will need to be modified so that
readers.txt
declares the new reader. Note also several dependencies have been upgraded in the Bio-Formats stack, some of them in coordination with ZarReader e.g. ome/bioformats#3788. Although the API should remain compatible, it is possible making the minimal set of changes could result in runtime issues.Deployment & testing
While deploying new versions of the software is relatively straightforward, the biggest challenge is that it invalidates the Bio-FOrmats memo files. As IDR strives to be a high-availability service, these memo files need to be regenerated for the 1-10M filesets imported into IDR.
https://github.com/IDR/deployment/blob/master/docs/operating-procedures.md#bio-formats-cache-regeneration describes the last used procedure for regenerating these memo files which largely involves deleting the old cache and calling
setId
on every single fileset using GNU parallel. This took several days to complete last time it was executed so at least a similar time should be expected.If merging the mainline Bio-Formats is in scope, it is important to ensure the synchronization does not affect the custom modifications and that historical studies remain available. Historically, software upgrades necessitated several rounds of software changes, deployment, testing and feedback.
To allow quick iterations, the initial phase of testing was focusing on the regeneration of the memo files for only the first plate/dataset of each screen/project. Testing was only executed against these datasets/plates for reporting issues and making fixes. Once a satisfying version has been validated, all memo files could be regenerated for a final assessment.
/cc @jmarie @francesw
The text was updated successfully, but these errors were encountered: