Skip to content

Page.get_text() get messy text. #2134

Answered by JorjMcKie
buptyyf asked this question in Q&A
Discussion options

You must be logged in to vote

As a background read I recommend this article on Artifex' blogging page.

A PDF creator may choose fonts that contain no information about how to back-translate the visual appearance of characters to their originating Unicode value.
This so-called CMAP (Character Map) may be missing - by error or on purpose.
The only way out (see the article) is OCRing the page, or parts of it - as it seems advisable in your case. Take a look at this demo script.

Replies: 2 comments 2 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
2 replies
@buptyyf
Comment options

@JorjMcKie
Comment options

Answer selected by JorjMcKie
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants
Converted from issue

This discussion was converted from issue #2133 on December 16, 2022 10:10.