Recommended way of running ocrmypdf with memory limits #1386

andersfylling · 2024-08-26T10:17:54Z

andersfylling
Aug 26, 2024

Hello!
I've had some OOM scenarios and I'm wondering what's the recommended way of running ocrmypdf to avoid OOM. Currently I execute it via a subprocess, but from what I'm seeing there's the possibility that ocrmypdf will spin up it's own subprocesses that may disregard the configured memory limits?

Essentially I want to treat OOMs as a result on the same level as successful runs and save it to my database, but wrapping up ocrmypdf in a subprocess does not seem to help. Any recommendations?

The following does not seem to limit the memory usage and eventually cause my container to OOM:

def ocrmypdf_ocr_process(
    env_vars: dict[str, str], memory_limit: int, **kwargs: Any
) -> None:
	soft_limit = memory_limit
	hard_limit = resource.RLIM_INFINITY
	process_heap_max_size = resource.RLIMIT_DATA
	resource.setrlimit(process_heap_max_size, (soft_limit, hard_limit))
	os.environ.update(env_vars)
	ocrmypdf.ocr(**kwargs)

p = multiprocessing.Process(target=ocrmypdf_ocr_process, args=(env_vars, memory_limit), kwargs=kwargs)
p.start()
p.join()

jbarlow83 · 2024-08-27T07:44:41Z

jbarlow83
Aug 27, 2024
Maintainer

Memory limits do not ask a process to limit its memory usage; they ask the operating system to intervene when a process exceeds the memory limits you specify.

Limiting the number of jobs may help. On a N core system, usually you get best results by running sqrt(N) parallel ocrmypdf processes (or containers) each with --jobs set to sqrt(N). If you process large files, provide a lot of /tmp storage.

Another option is a retry mechanism. If a process is killed with OOM, its return code will be 137. Retry with --jobs 1 since this will tend to reduce memory usage (and throughput).

It's possible that the recently reported quadratic regression on large page count files #1378 also has quadratic memory usage. I have not investigated that.

Without specifics (e.g., this test file (attached) has n pages and is x MB, and uses y GB of RAM and z GB of temporary storage on my system, with configuration C) no specific answers can be given.

If the file contains pages with very large images then you may need to use some of the tools for managing large images in --help, such as --skip-big, --tesseract-downsample-large-images, etc.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommended way of running ocrmypdf with memory limits #1386

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Recommended way of running ocrmypdf with memory limits #1386

andersfylling Aug 26, 2024

Replies: 1 comment

jbarlow83 Aug 27, 2024 Maintainer

andersfylling
Aug 26, 2024

jbarlow83
Aug 27, 2024
Maintainer