PDFs with bad font definitions #1380

Yichen-fqyd · 2021-11-08T16:11:39Z

Yichen-fqyd
Nov 8, 2021

When converting this file to html, all the � for the first page, is it because the font information or the encoding of the system, is there a way to retrieve the text as it looks, for example download the font files?

Thanks!
MLIL1370ML_A_9_2_29+(2).pdf
s

Answered by JorjMcKie

Nov 8, 2021

Hm, this file has severe internal problem. XPDF cannot convert it to text either.
Throw it away 😎
Or OCR it.

View full answer

JorjMcKie · 2021-11-08T19:19:56Z

JorjMcKie
Nov 8, 2021
Maintainer

did you try to output normal text via mutool draw -o file.txt file.pdf?
This indicates that MuPDF cannot interpret the unicode delivered by the font. There can be nothing done about it.
You could OCR the page or the problematic spans, e.g. as shown in this Jupyter notebook.

1 reply

Yichen-fqyd Nov 8, 2021
Author

Also another interesting thing about this file, when the file is converted to image using pymupdf, the result looks fine, but if convert this file through other screenshot pdf2image conversion, the image doesn't make sense. Attached two converted images from the same exact file

JorjMcKie · 2021-11-08T19:40:48Z

JorjMcKie
Nov 8, 2021
Maintainer

Hm, this file has severe internal problem. XPDF cannot convert it to text either.
Throw it away 😎
Or OCR it.

0 replies

JorjMcKie · 2021-11-08T22:03:26Z

JorjMcKie
Nov 8, 2021
Maintainer

You know the new OCR interface of PyMuPDF v1.19.1?
The above page could be converted - at least to some extent - to HTML when OCRing before extracting HTML text:

tp = page.get_textpage_ocr(dpi=300, full=True)  # OCR the full page at a decent resolution
html = page.get_text("html", textpage=tp)  # use the OCRed version
out = "page.html", "w")
out.write(html)
out.close()

Delivers a quite decent html page.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDFs with bad font definitions #1380

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

PDFs with bad font definitions #1380

Yichen-fqyd Nov 8, 2021

Replies: 3 comments · 1 reply

JorjMcKie Nov 8, 2021 Maintainer

Yichen-fqyd Nov 8, 2021 Author

JorjMcKie Nov 8, 2021 Maintainer

JorjMcKie Nov 8, 2021 Maintainer

Yichen-fqyd
Nov 8, 2021

Replies: 3 comments 1 reply

JorjMcKie
Nov 8, 2021
Maintainer

Yichen-fqyd Nov 8, 2021
Author

JorjMcKie
Nov 8, 2021
Maintainer

JorjMcKie
Nov 8, 2021
Maintainer