How To OCR #4200

xiaolibuzai-ovo · 2025-01-06T12:19:36Z

xiaolibuzai-ovo
Jan 6, 2025

get_textpage_ocr uses Tesseract. How does it perform OCR on a PDF? Does it convert it to an image？

Jan 6, 2025

Sure it does. It uses page.get_pixmap(dpi=DPI) with a DPI value provided by you. Then it internally uses Pixmap.pdfocr_tobytes() whicg creates an in-memory 1-page PDF which contains the OCR-ed text layer. From this, a normal TextPage is populated.
This TextPage can then be used for all the usual text extraction variants.
We are not including the TextPage creation as an option directly in get_text() (which would have been possible), because the OCR process is a long-lasting thing. So we are enforcing to make a separate TextPage which can be re-used multiple times.

View full answer

JorjMcKie · 2025-01-06T12:27:01Z

JorjMcKie
Jan 6, 2025
Maintainer

Sure it does. It uses page.get_pixmap(dpi=DPI) with a DPI value provided by you. Then it internally uses Pixmap.pdfocr_tobytes() whicg creates an in-memory 1-page PDF which contains the OCR-ed text layer. From this, a normal TextPage is populated.
This TextPage can then be used for all the usual text extraction variants.
We are not including the TextPage creation as an option directly in get_text() (which would have been possible), because the OCR process is a long-lasting thing. So we are enforcing to make a separate TextPage which can be re-used multiple times.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How To OCR #4200

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How To OCR #4200

xiaolibuzai-ovo Jan 6, 2025

Replies: 1 comment

JorjMcKie Jan 6, 2025 Maintainer

xiaolibuzai-ovo
Jan 6, 2025

JorjMcKie
Jan 6, 2025
Maintainer