Having trouble merging PDFs with PyMuPDF #3477
Replies: 2 comments 2 replies
-
The batch module indeed uses some standard parameters and makes some assumptions of which not all are exposed - in order not to overload the API:
import fitz # you can use pymupdf now too!
pdfs=("pdf1","pdf2","pdf3","pdf4","pdf5","pdf6","pdf7","pdf8","pdf9","pdf10","pdf11","pdf12")
out = fitz.open()
for filename in pdfs:
pdf = fitz.open(filename)
out.insert_pdf(pdf, annots=False, links=False)
pdf.close()
out.save("output.pdf") |
Beta Was this translation helpful? Give feedback.
-
Apologies for the delay getting back @JorjMcKie, with your notes here's what I found for two PDFs out of my sample set (two simply for timeliness' sake) Environment is: PyPDF 384.17 seconds; 268.13 seconds inserting, 115.81 seconds saving, output is 193,704,304 bytes Some observations:
That said, if the answer is likely to be "your PDFs are probably kind of screwed up and way too large to be in scope for what we intend" then I accept that (and tend to agree) but I'd be willing to help if there's further information I could provide to see what's going on. Here's the code I used to arrive at this:
|
Beta Was this translation helpful? Give feedback.
-
Hi all, I'm new to using PyMuPDF and found some confusing results compared to another python library, hoping to find some answers or at least a lead to investigate along. This is related to my work so I unfortunately can't provide samples of the PDFs but I can try my best to answer questions about how they're built.
Scenario: I have 12 PDFs ranging in size from 75-105mb in size, in the neighborhood of 100k pages each of text and graphics, subsetted fonts, with many elements (seemingly as many as possible) stored in the crossref table. No toc, annotations or etc are present that I know of. These PDFs are created outside my control so I can try to answer further questions about construction but I have limited insight.
Goal: Merging them -into one PDF; order is unimportant as long as each source PDF's pages are contiguous in the output.
Results: Writing a small script to use PyPDF's PdfWriter.append() functionality ran in 35.5 minutes and produced a PDF of 1.2GB.
Using PyMuPDF from the commandline ran for over 8 hours and eventually failed to produce a PDF
command:
python -m fitz join -output combined.pdf pdf1, pdf2,[...]pdf12
I'm re running the pymupdf command to validate what went on there but I'm unclear what would cause such a difference in processing time regardless, would welcome any thoughts or suggestions.
Additionally I've noticed that using pymupdf at the command line to optimize/garbage collect these input PDFs also takes quite a long time, I'm onsure if that might be relevant to what we're seeing here.
Beta Was this translation helpful? Give feedback.
All reactions