PDFs with bad font definitions #1380
-
When converting this file to html, all the � for the first page, is it because the font information or the encoding of the system, is there a way to retrieve the text as it looks, for example download the font files? Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
did you try to output normal text via |
Beta Was this translation helpful? Give feedback.
-
Hm, this file has severe internal problem. XPDF cannot convert it to text either. |
Beta Was this translation helpful? Give feedback.
-
You know the new OCR interface of PyMuPDF v1.19.1? tp = page.get_textpage_ocr(dpi=300, full=True) # OCR the full page at a decent resolution
html = page.get_text("html", textpage=tp) # use the OCRed version
out = "page.html", "w")
out.write(html)
out.close() Delivers a quite decent html page. |
Beta Was this translation helpful? Give feedback.
Hm, this file has severe internal problem. XPDF cannot convert it to text either.
Throw it away 😎
Or OCR it.