PyMuPDF-1.22.2: extractText show unreadible output from specific pdf #2378

zdenop · 2023-04-28T06:44:03Z

zdenop
Apr 28, 2023

Describe the bug (mandatory)

I am not able to get a text from this specific pdf. Pdf is correctly displayed by Adobe Reader (also in Chrome, Firefox)
AD-3.HANO.pdf

To Reproduce (mandatory)

import fitz

fname = "AD-3.HANO.pdf"

with fitz.open(fname) as doc:
    for page_id in range(doc.page_count):
        page = doc.load_page(page_id)
        content = page.get_textpage()
        print(f"Page {page_id1}:")
        print(content.extractText())

Expected behavior (optional)

Get text out (as with other pdf files)

Screenshots (optional)

Your configuration (mandatory)

Windows 10 64bit
Python 3.9.13 64bit, Python 3.11.2
pymupdf-1.22.2 , installation method: (wheel).

Answered by JorjMcKie

Apr 28, 2023

The PDF uses non-standard encoding which makes it impossible to extract text - not only for (Py-) MuPDF, but also for Adobe Acrobat, Nitro 5, and other PDF viewers.
Confirm this by selecting some text and paste it in some word processor document.

Please be aware that showing text and extracting it are feature that do not necessarily be connected - as is the case here.
So all you can do is OCR-ing.

View full answer

JorjMcKie · 2023-04-28T07:49:17Z

JorjMcKie
Apr 28, 2023
Maintainer

This is no bug, but a missing feature of the font - intended or not.
Moving this to "Discussions" in case you have more questions.

0 replies

JorjMcKie · 2023-04-28T11:52:01Z

JorjMcKie
Apr 28, 2023
Maintainer

The PDF uses non-standard encoding which makes it impossible to extract text - not only for (Py-) MuPDF, but also for Adobe Acrobat, Nitro 5, and other PDF viewers.
Confirm this by selecting some text and paste it in some word processor document.

Please be aware that showing text and extracting it are feature that do not necessarily be connected - as is the case here.
So all you can do is OCR-ing.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyMuPDF-1.22.2: extractText show unreadible output from specific pdf #2378

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

PyMuPDF-1.22.2: extractText show unreadible output from specific pdf #2378

zdenop Apr 28, 2023

Describe the bug (mandatory)

To Reproduce (mandatory)

Expected behavior (optional)

Screenshots (optional)

Your configuration (mandatory)

Replies: 2 comments

JorjMcKie Apr 28, 2023 Maintainer

JorjMcKie Apr 28, 2023 Maintainer

zdenop
Apr 28, 2023

JorjMcKie
Apr 28, 2023
Maintainer

JorjMcKie
Apr 28, 2023
Maintainer