Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve fileset distribution during the memo file regeneration #150

Merged

Conversation

sbesson
Copy link
Member

@sbesson sbesson commented Sep 10, 2024

Fixes #148

This implements the strategy suggested in the issue above to improve both the predictability and the performance of the memo file regeneration.

The SQL query generating the list inputs has been updated with a new column containing the initial reader initialization time per fileset and sort the results using these values in descending order. The initialization times are computed in a very similar fashion tor to omero fs importtime by querying and taking the difference between the image creation timestamp and the fileset upload job timestamp. As for omero fs importtime, these values reflect the server-side reader initialization at import time and are unaware of any change since maybe it be on the software side like Bio-Formats reader performance improvements or on the storage side like moving the data through a tiered storage. Nevertheless, this should be a good initial proxy measure allowing to classify the different filesets imported in OMERO Plus and significantly help distributing the regeneration tasks in a parallel environments.

The regen-memo-files.sh script has been updated to divide the SQL results in chunks as defined in the split command. The output of the command should be a number of input files equal to the number of parallel jobs to be started via parallel. The splitting is made using round robin distribution so that the total initialization time is as homogenously distributed as possible between the input files.

Finally a few additional cleanup were added to the regen-memo-files.sh script to remove CentOS 6 handling and the --batch-size option now superseded by the chunk split.

In terms of functional testing, the best environment would be an OMERO instance containing a reasonable number imported filesets with heteregenous initialization times i.e. from small individual fake files to large plates. The execution time of regen-memo-files.sh should be compared with and without these changes. The outcome of the memo file regeneration itself (successes/failures/skips) should be unchanged. With these changes, the overall execution time should be reduced and the parallel jobs should roughly start and finish at the same time.

Similarly to what omero fs importtime does, this uses the difference
between the image creation timestamp and the end of the upload job
associated with the fileset to estimate the server-side time spent
in initializing the reader.
The SQL results are sorted in decreasing order of the initialization
time
The using of the chunks option with round robin distribution should
create as many input files as there will be jobs and ensure the
projected regeneration times are as equally distributed as possible
@stick
Copy link
Member

stick commented Sep 12, 2024

The regen-memo-files.sh script has been updated to divide the SQL results in chunks as defined in the split command. The output of the command should be a number of input files equal to the number of parallel jobs to be started via parallel. The splitting is made using round robin distribution so that the total initialization time is as homogenously distributed as possible between the input files.

So if I understand this correctly --jobs 4 for example results in 4 files with a round-robin distribution of images to process. Thereby 4 running processes that are processing their list serially.

Originally we had concerns over garbage collection with long running java processes, as well as startup cost of each process, which is why we're not using parallel to start a process for each individual image.

If this is the intent to only have large lists for each processor there's very little utility in using parallel at all since we're not managing a list that's larger than available processes.

@sbesson
Copy link
Member Author

sbesson commented Sep 13, 2024

So if I understand this correctly --jobs 4 for example results in 4 files with a round-robin distribution of images to process. Thereby 4 running processes that are processing their list serially.

Yes that's the current implementation

Originally we had concerns over garbage collection with long running java processes, as well as startup cost of each process, which is why we're not using parallel to start a process for each individual image.

In the IDR process for regenerating memo cache files which is still using a combination of GNU parallel + omero render test, starting one process per fileset was certainly found to be a non-starter for the 6M of individual TIFF files from the HPA project. The IDR strategy includes the extra overhead of initializing the OMERO client which does not apply to the micro-service implementation but I concur startup costs are very real.
I had not considered garbage collection and it would be something worth testing.

If this is the intent to only have large lists for each processor there's very little utility in using parallel at all since we're not managing a list that's larger than available processes.

A possible compromise would be to split the CSV using round-robin distribution into Nx$JOBS input files where the default value of N is either a fixed value or auto-computed from the number of rows in the CSV and the value of $JOBS to ensure that the default input files have a maximum number of entries (assuming we use the existing 500 by default).

@sbesson sbesson force-pushed the memoregenerator_setId_order branch from ce85355 to 8fb7fb4 Compare September 19, 2024 07:46
@sbesson sbesson force-pushed the memoregenerator_setId_order branch from 8fb7fb4 to 08b1ad2 Compare September 19, 2024 07:52
@sbesson
Copy link
Member Author

sbesson commented Sep 19, 2024

Last commits and especially 7e22384 should address the concerns raised in #150 (comment) around the potential for growing input files.

The utility now splits the original CSV into N * jobs files using round robin distribution. N is computed using the value of --batch-size, 500 by default, which is now interpreted as the maximum number of lines in each split file. For instance:

  • a CSV with 1000 entries should be split into 4 files of 250 lines with --jobs 4
  • a CSV with 1000 entries should be split into 8 files of 125 lines with --jobs 4 --batch-size 200
  • a CSV with 2000 entries should be split into 4 files of 500 lines with --jobs 4
  • a CSV with 2400 entries should be split into 8 files of 300 lines with --jobs 4
  • a CSV with 2400 entries should be split into 8 files of 300 lines with --jobs 8

@sbesson
Copy link
Member Author

sbesson commented Jan 8, 2025

@chris-allan have you have a chance to have a look at these changes?

@chris-allan chris-allan merged commit aa533d6 into glencoesoftware:master Jan 17, 2025
3 checks passed
@sbesson sbesson deleted the memoregenerator_setId_order branch January 18, 2025 07:37
@sbesson
Copy link
Member Author

sbesson commented Jan 20, 2025

The impact of this change was tested in a development instance with 4 CPUs/ 8GB of RAM with a database including 3797 memo files to regenerate (3686 ok, 1 fail, 110 skipped). The memo regeneration utility was executed 3 times in 3 multi-threaded conditions: --jobs 2, --jobs 3 and no job option i.e. 4 CPUs after cleaning the cache between each iteration.

without #150 with #150
2 CPUs 47min 6s +/- 59s 41min 1s +/- 26s
3 CPUs 45min 27s +/- 1min 19s 32min 21s +/- 43s
4 CPUs 48min 7s +/- 2min 8s DNF

DNF means that that the memo regeneration did not run to completion in any of the three execution as the process ran out of memory after ~12min and was terminated by the OOM killer.

In summary, there are 2 conflicting effects:

  • the usage of the round-robin distribution allows a better distribution of the computational tasks and improves the overall performance of the regeneration job taking advantage of the multiple threads
  • the changes create the conditions for putting the system under heavy memory load and increase the chances of running out of memory for high number of threads

My assumption is that the latter observation is a consequence of the ordering of the filesets to regenerate in reverse order of initialization time. This means the longest (and possibly the most memory and or I/O intensive) regeneration processes are scheduled to be started concurrently at the onset of the memo regeneration process.

/cc @stick @chris-allan @melissalinkert

@sbesson
Copy link
Member Author

sbesson commented Jan 23, 2025

Good discussion with @stick and @chris-allan regarding #150 (comment).

As per

export JAVA_OPTS="-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=rslt.${DATESTR} -Xmx2g -Dlogback.configurationFile=${MEMOIZER_HOME}/logback-memoizer.xml -Dprocessname=memoizer"

The individual memo regeneration processes are memory constrained up to 2G. For 4 concurrent jobs, this means the total memory that might be allocated by Java is 8GB, which is exactly the total available memory of the system. Considering any overhead and other running processes, the fact that the execution leads to an OOM with --jobs 4 is not a surprise with this particular configuration. Actually, we should treat the fact that the previous regeneration process with the same amount of concurrency did not ran out of memory as luck.

Overall, the outstanding next step is probably to compute and expose the maximum amount of memory that would be allocated by the command and review our internal documentation of the different options and the trade-offs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Predictability during memo file regeneration and general evolution
3 participants