Having trouble merging PDFs with PyMuPDF #3477

Evan0000000000 · 2024-05-14T15:56:15Z

Evan0000000000
May 14, 2024

Hi all, I'm new to using PyMuPDF and found some confusing results compared to another python library, hoping to find some answers or at least a lead to investigate along. This is related to my work so I unfortunately can't provide samples of the PDFs but I can try my best to answer questions about how they're built.

Scenario: I have 12 PDFs ranging in size from 75-105mb in size, in the neighborhood of 100k pages each of text and graphics, subsetted fonts, with many elements (seemingly as many as possible) stored in the crossref table. No toc, annotations or etc are present that I know of. These PDFs are created outside my control so I can try to answer further questions about construction but I have limited insight.

Goal: Merging them -into one PDF; order is unimportant as long as each source PDF's pages are contiguous in the output.

Results: Writing a small script to use PyPDF's PdfWriter.append() functionality ran in 35.5 minutes and produced a PDF of 1.2GB.

Using PyMuPDF from the commandline ran for over 8 hours and eventually failed to produce a PDF

command: python -m fitz join -output combined.pdf pdf1, pdf2,[...]pdf12

I'm re running the pymupdf command to validate what went on there but I'm unclear what would cause such a difference in processing time regardless, would welcome any thoughts or suggestions.

Additionally I've noticed that using pymupdf at the command line to optimize/garbage collect these input PDFs also takes quite a long time, I'm onsure if that might be relevant to what we're seeing here.

JorjMcKie · 2024-05-15T07:07:08Z

JorjMcKie
May 15, 2024
Maintainer

The batch module indeed uses some standard parameters and makes some assumptions of which not all are exposed - in order not to overload the API:
If you are willing to invest a "small script" as you do for other packages, then things are easy to adjust and let PyMuPDF be much faster for this task than anyone else. For example:

do not copy TOC annotations or links
do not compress the result output PDF on save.

import fitz  # you can use pymupdf now too!
pdfs=("pdf1","pdf2","pdf3","pdf4","pdf5","pdf6","pdf7","pdf8","pdf9","pdf10","pdf11","pdf12")
out = fitz.open()
for filename in pdfs:
    pdf = fitz.open(filename)
    out.insert_pdf(pdf, annots=False, links=False)
    pdf.close()
out.save("output.pdf")

2 replies

Evan0000000000 May 15, 2024
Author

Thank you Jorj, I admittedly didn't think to check if additional options were not exposed by the command line. Let me verify and post the time and I'll mark your response the answer.

Compression is an obvious performance point, but I'm guessing then that maintaining annotations and links through the merge is also always a performance hit but especially so with the size of files I'm dealing with?

JorjMcKie May 15, 2024
Maintainer

Absolutely! Links and annotations are performance hogs in this context.
And of course compression in the end. Especially so, because the method variant doc.ez_save() is being used in the batch module.
Since a version or two, this performs extra compression by also compressing object definitions.
So definitely don't do it with these crazy amounts of pages.

Evan0000000000 · 2024-05-16T17:05:24Z

Evan0000000000
May 16, 2024
Author

Apologies for the delay getting back @JorjMcKie, with your notes here's what I found for two PDFs out of my sample set (two simply for timeliness' sake)

Environment is:
VM w/ RHEL 7.9
Python 3.7.4
PyPDF2 3.0.1
PyMuPDF 1.22.5

PyPDF 384.17 seconds; 268.13 seconds inserting, 115.81 seconds saving, output is 193,704,304 bytes
PyMuPDF 3820.88; 3818.89 seconds inserting, 1.99 seconds saving, output pdf is 188,846,915 bytes

Some observations:

I didn't realize earlier but python 3.7 does of course mean I'm several releases behind. I skimmed the changelog and it seems like much has been done since then so I'm guessing that may be relevant.
With show_progress on I can see that the rate of page insertion seems consistent through both files
I imagine show_progress has a performance penalty, currently doing another run without it to see, but I would assume it's not a whole order of magnitude

That said, if the answer is likely to be "your PDFs are probably kind of screwed up and way too large to be in scope for what we intend" then I accept that (and tend to agree) but I'd be willing to help if there's further information I could provide to see what's going on.

Here's the code I used to arrive at this:

from PyPDF2 import PdfWriter
from pathlib import Path
import fitz
from time import perf_counter

working_path = Path.cwd() / 'in'

pdfs = [working_path / '1.pdf',
        working_path / '2.pdf'
        ]

out = PdfWriter()
print('starting pypdf run')
pypdf_insert_start = perf_counter()
for f in pdfs:
    print(f'Appending {f.name}')
    out.append(f)
pypdf_save_start = perf_counter()
out.write(working_path / 'pypdf_combined.pdf')
pypdf_save_end = perf_counter()

pypdf_time = round(pypdf_save_end - pypdf_insert_start, 2)
pypdf_insert_time = round(pypdf_save_start - pypdf_insert_start, 2)
pypdf_save_time = round(pypdf_save_end - pypdf_save_start, 2)

print(f'PyPDF finished in {pypdf_time} seconds')
print(f'Insert time: {pypdf_insert_time}')
print(f'Save time: {pypdf_save_time}')
input('...')

out = fitz.open()
print('starting mupdf run')
mupdf_insert_start = perf_counter()
for f in pdfs:
    print(f'Appending {f.name}')
    pdf = fitz.open(f)
    out.insert_pdf(pdf, annots=0, links=0, show_progress=1)
    pdf.close()
mupdf_save_start = perf_counter()
out.save(working_path / "pymupdf_combined.pdf", deflate=0)
mupdf_save_end = perf_counter()

mupdf_time = round(mupdf_save_end - mupdf_insert_start, 2)
mupdf_insert_time = round(mupdf_save_start - mupdf_insert_start, 2)
mupdf_save_time = round(mupdf_save_end - mupdf_save_start, 2)

print(f'pymupdf finished in {mupdf_time} seconds')
print(f'Insert time: {mupdf_insert_time}')
print(f'Save time: {mupdf_save_time}')
print('Done')```

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Having trouble merging PDFs with PyMuPDF #3477

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Having trouble merging PDFs with PyMuPDF #3477

Evan0000000000 May 14, 2024

Replies: 2 comments · 2 replies

JorjMcKie May 15, 2024 Maintainer

Evan0000000000 May 15, 2024 Author

JorjMcKie May 15, 2024 Maintainer

Evan0000000000 May 16, 2024 Author

Evan0000000000
May 14, 2024

Replies: 2 comments 2 replies

JorjMcKie
May 15, 2024
Maintainer

Evan0000000000 May 15, 2024
Author

JorjMcKie May 15, 2024
Maintainer

Evan0000000000
May 16, 2024
Author