processed PDF is 15x slower to search in qpdfview #1419

fenugrec · 2024-11-04T01:56:44Z

fenugrec
Nov 4, 2024

Something is strange and I'm not sure what the problem could possibly be.

I'm processing files with --redo-ocr --sidecar . A typical file has 1300 pages, 54MB before, 59MB after processing. The sidecar is a 1.9MB txt file.

The original PDF does have some OCR already; if I extract the text with pdf2txt.py from pdfminer, it produces a 1.7MB text file. So the amount of text is roughly similar, by that metric.

When I open the original PDF in qpdfview, a search takes less than 2 seconds. In the processed PDF, searching for the same text takes 28-30 seconds !

Is there something in PDF/A structure that would explain this ? And some way to improve the situation ? This effect was seen in 16.4.2 and 16.6.0 so probably not a random bug.

fenugrec · 2024-11-04T03:14:20Z

fenugrec
Nov 4, 2024
Author

Additional data point : with --output-type pdf text search now takes 6 seconds.

0 replies

jbarlow83 · 2024-11-04T03:25:14Z

jbarlow83
Nov 4, 2024
Maintainer

I'd need to see evidence that this affects all viewers not just qpdfview.

3 replies

fenugrec Nov 4, 2024
Author

Agreed. Further investigation probably points to the 'poppler' backend, used by qpdfview and Evince. Found a few relevant old bug reports.

It's interesting that PDF/A would be slower to parse, but I think this may also be the reader's fault. I will run a few more tests in the next few days before closing this.

jbarlow83 Nov 4, 2024
Maintainer

Probably not PDF/A but some other aspect of text handling if I had to guess.

fenugrec Nov 5, 2024
Author

I made a half-baked attempt at profiling evince and qpdfviewer (both based on poppler). I think the difference I'm seeing is due to some inexplicable repeated calls to color-management functions (cmsReverseToneCurveEx, although I may not have enough debug symbols to refine this hypothesis).
I'm guessing the 'original document' vs re-ocr speed difference is also related to color handling ? a sysprof run shows doesn't even show calls to cmsReverseToneCurveEx with the original file. I haven't seen color-management options in ocrmypdf but maybe its backend tools have some default behaviour I could look at ? Would prefer a solution on the generation side (can even do some local patching) rather than attempting to fix poppler.

references to other tickets, in case someone else hits this and also wrongly suspects ocrmypdf :

https://gitlab.freedesktop.org/poppler/poppler/-/issues/1472
https://gitlab.freedesktop.org/poppler/poppler/-/issues/58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

processed PDF is 15x slower to search in qpdfview #1419

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

processed PDF is 15x slower to search in qpdfview #1419

fenugrec Nov 4, 2024

Replies: 2 comments · 3 replies

fenugrec Nov 4, 2024 Author

jbarlow83 Nov 4, 2024 Maintainer

fenugrec Nov 4, 2024 Author

jbarlow83 Nov 4, 2024 Maintainer

fenugrec Nov 5, 2024 Author

fenugrec
Nov 4, 2024

Replies: 2 comments 3 replies

fenugrec
Nov 4, 2024
Author

jbarlow83
Nov 4, 2024
Maintainer

fenugrec Nov 4, 2024
Author

jbarlow83 Nov 4, 2024
Maintainer

fenugrec Nov 5, 2024
Author