Improve fileset distribution during the memo file regeneration #150

sbesson · 2024-09-10T13:50:50Z

Fixes #148

This implements the strategy suggested in the issue above to improve both the predictability and the performance of the memo file regeneration.

The SQL query generating the list inputs has been updated with a new column containing the initial reader initialization time per fileset and sort the results using these values in descending order. The initialization times are computed in a very similar fashion tor to omero fs importtime by querying and taking the difference between the image creation timestamp and the fileset upload job timestamp. As for omero fs importtime, these values reflect the server-side reader initialization at import time and are unaware of any change since maybe it be on the software side like Bio-Formats reader performance improvements or on the storage side like moving the data through a tiered storage. Nevertheless, this should be a good initial proxy measure allowing to classify the different filesets imported in OMERO Plus and significantly help distributing the regeneration tasks in a parallel environments.

The regen-memo-files.sh script has been updated to divide the SQL results in chunks as defined in the split command. The output of the command should be a number of input files equal to the number of parallel jobs to be started via parallel. The splitting is made using round robin distribution so that the total initialization time is as homogenously distributed as possible between the input files.

Finally a few additional cleanup were added to the regen-memo-files.sh script to remove CentOS 6 handling and the --batch-size option now superseded by the chunk split.

In terms of functional testing, the best environment would be an OMERO instance containing a reasonable number imported filesets with heteregenous initialization times i.e. from small individual fake files to large plates. The execution time of regen-memo-files.sh should be compared with and without these changes. The outcome of the memo file regeneration itself (successes/failures/skips) should be unchanged. With these changes, the overall execution time should be reduced and the parallel jobs should roughly start and finish at the same time.

Similarly to what omero fs importtime does, this uses the difference between the image creation timestamp and the end of the upload job associated with the fileset to estimate the server-side time spent in initializing the reader. The SQL results are sorted in decreasing order of the initialization time

The using of the chunks option with round robin distribution should create as many input files as there will be jobs and ensure the projected regeneration times are as equally distributed as possible

stick · 2024-09-12T20:32:39Z

The regen-memo-files.sh script has been updated to divide the SQL results in chunks as defined in the split command. The output of the command should be a number of input files equal to the number of parallel jobs to be started via parallel. The splitting is made using round robin distribution so that the total initialization time is as homogenously distributed as possible between the input files.

So if I understand this correctly --jobs 4 for example results in 4 files with a round-robin distribution of images to process. Thereby 4 running processes that are processing their list serially.

Originally we had concerns over garbage collection with long running java processes, as well as startup cost of each process, which is why we're not using parallel to start a process for each individual image.

If this is the intent to only have large lists for each processor there's very little utility in using parallel at all since we're not managing a list that's larger than available processes.

sbesson · 2024-09-13T08:06:57Z

So if I understand this correctly --jobs 4 for example results in 4 files with a round-robin distribution of images to process. Thereby 4 running processes that are processing their list serially.

Yes that's the current implementation

Originally we had concerns over garbage collection with long running java processes, as well as startup cost of each process, which is why we're not using parallel to start a process for each individual image.

In the IDR process for regenerating memo cache files which is still using a combination of GNU parallel + omero render test, starting one process per fileset was certainly found to be a non-starter for the 6M of individual TIFF files from the HPA project. The IDR strategy includes the extra overhead of initializing the OMERO client which does not apply to the micro-service implementation but I concur startup costs are very real.
I had not considered garbage collection and it would be something worth testing.

If this is the intent to only have large lists for each processor there's very little utility in using parallel at all since we're not managing a list that's larger than available processes.

A possible compromise would be to split the CSV using round-robin distribution into Nx$JOBS input files where the default value of N is either a fixed value or auto-computed from the number of rows in the CSV and the value of $JOBS to ensure that the default input files have a maximum number of entries (assuming we use the existing 500 by default).

sbesson · 2024-09-19T08:39:09Z

Last commits and especially 7e22384 should address the concerns raised in #150 (comment) around the potential for growing input files.

The utility now splits the original CSV into N * jobs files using round robin distribution. N is computed using the value of --batch-size, 500 by default, which is now interpreted as the maximum number of lines in each split file. For instance:

a CSV with 1000 entries should be split into 4 files of 250 lines with --jobs 4
a CSV with 1000 entries should be split into 8 files of 125 lines with --jobs 4 --batch-size 200
a CSV with 2000 entries should be split into 4 files of 500 lines with --jobs 4
a CSV with 2400 entries should be split into 8 files of 300 lines with --jobs 4
a CSV with 2400 entries should be split into 8 files of 300 lines with --jobs 8

sbesson · 2025-01-08T08:42:28Z

@chris-allan have you have a chance to have a look at these changes?

…Id_order

sbesson · 2025-01-20T09:24:57Z

The impact of this change was tested in a development instance with 4 CPUs/ 8GB of RAM with a database including 3797 memo files to regenerate (3686 ok, 1 fail, 110 skipped). The memo regeneration utility was executed 3 times in 3 multi-threaded conditions: --jobs 2, --jobs 3 and no job option i.e. 4 CPUs after cleaning the cache between each iteration.

	without #150	with #150
2 CPUs	47min 6s +/- 59s	41min 1s +/- 26s
3 CPUs	45min 27s +/- 1min 19s	32min 21s +/- 43s
4 CPUs	48min 7s +/- 2min 8s	DNF

DNF means that that the memo regeneration did not run to completion in any of the three execution as the process ran out of memory after ~12min and was terminated by the OOM killer.

In summary, there are 2 conflicting effects:

the usage of the round-robin distribution allows a better distribution of the computational tasks and improves the overall performance of the regeneration job taking advantage of the multiple threads
the changes create the conditions for putting the system under heavy memory load and increase the chances of running out of memory for high number of threads

My assumption is that the latter observation is a consequence of the ordering of the filesets to regenerate in reverse order of initialization time. This means the longest (and possibly the most memory and or I/O intensive) regeneration processes are scheduled to be started concurrently at the onset of the memo regeneration process.

/cc @stick @chris-allan @melissalinkert

sbesson · 2025-01-23T17:36:08Z

Good discussion with @stick and @chris-allan regarding #150 (comment).

As per

omero-ms-image-region/src/dist/regen-memo-files.sh

Line 54 in aa533d6

    
           export JAVA_OPTS="-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=rslt.${DATESTR} -Xmx2g -Dlogback.configurationFile=${MEMOIZER_HOME}/logback-memoizer.xml -Dprocessname=memoizer"

The individual memo regeneration processes are memory constrained up to 2G. For 4 concurrent jobs, this means the total memory that might be allocated by Java is 8GB, which is exactly the total available memory of the system. Considering any overhead and other running processes, the fact that the execution leads to an OOM with --jobs 4 is not a surprise with this particular configuration. Actually, we should treat the fact that the previous regeneration process with the same amount of concurrency did not ran out of memory as luck.

Overall, the outstanding next step is probably to compute and expose the maximum amount of memory that would be allocated by the command and review our internal documentation of the different options and the trade-offs

sbesson added 4 commits September 9, 2024 14:00

memo_regenerator.sql: use LF for line endings

9b36191

Remove CentOS 7 parallel options

0442453

Split SQL into $JOBS input files using round robin distribution

442341d

The using of the chunks option with round robin distribution should create as many input files as there will be jobs and ensure the projected regeneration times are as equally distributed as possible

sbesson requested review from stick and chris-allan September 10, 2024 13:50

Compute the number of files to split into to never exceed BATCH_SIZE

7e22384

sbesson force-pushed the memoregenerator_setId_order branch from ce85355 to 8fb7fb4 Compare September 19, 2024 07:46

Review utility usage

08b1ad2

sbesson force-pushed the memoregenerator_setId_order branch from 8fb7fb4 to 08b1ad2 Compare September 19, 2024 07:52

stick approved these changes Sep 19, 2024

View reviewed changes

sbesson added 2 commits January 13, 2025 12:03

Merge remote-tracking branch 'origin/master' into memoregenerator_set…

bff8ca4

…Id_order

Reinclude --db option to the documentation string

68cfcfc

chris-allan approved these changes Jan 17, 2025

View reviewed changes

chris-allan merged commit aa533d6 into glencoesoftware:master Jan 17, 2025
3 checks passed

sbesson deleted the memoregenerator_setId_order branch January 18, 2025 07:37

sbesson mentioned this pull request Jan 24, 2025

Memo file regeneration: documentation and memory usage #152

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve fileset distribution during the memo file regeneration #150

Improve fileset distribution during the memo file regeneration #150

sbesson commented Sep 10, 2024

stick commented Sep 12, 2024

sbesson commented Sep 13, 2024

sbesson commented Sep 19, 2024

sbesson commented Jan 8, 2025

sbesson commented Jan 20, 2025 •

edited

Loading

sbesson commented Jan 23, 2025

Improve fileset distribution during the memo file regeneration #150

Improve fileset distribution during the memo file regeneration #150

Conversation

sbesson commented Sep 10, 2024

stick commented Sep 12, 2024

sbesson commented Sep 13, 2024

sbesson commented Sep 19, 2024

sbesson commented Jan 8, 2025

sbesson commented Jan 20, 2025 • edited Loading

sbesson commented Jan 23, 2025

sbesson commented Jan 20, 2025 •

edited

Loading