How To OCR #4200
-
get_textpage_ocr uses Tesseract. How does it perform OCR on a PDF? Does it convert it to an image? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Sure it does. It uses |
Beta Was this translation helpful? Give feedback.
Sure it does. It uses
page.get_pixmap(dpi=DPI)
with a DPI value provided by you. Then it internally usesPixmap.pdfocr_tobytes()
whicg creates an in-memory 1-page PDF which contains the OCR-ed text layer. From this, a normalTextPage
is populated.This TextPage can then be used for all the usual text extraction variants.
We are not including the TextPage creation as an option directly in
get_text()
(which would have been possible), because the OCR process is a long-lasting thing. So we are enforcing to make a separate TextPage which can be re-used multiple times.