get_text() returns incorrect text #2447

kopugara · 2023-06-02T11:28:10Z

kopugara
Jun 2, 2023

Hello.
I am trying to parse a large PDF document into a single Excel table for further processing.
A page from one of those documents is attached: fragment.pdf
The minimal code I use to extract text from a single page:

import fitz

doc = fitz.open('fragment.pdf')
page = doc[0]
text = page.get_text()

print(text)

and then I get this:

eaben
poll
sorgang
saluta
Buchung
H OTIRM
pbmA bcÜtzeitüÄerweisìnÖ von
Aajfo AopiAklsfC rka wfkhA AopiAklsfC
MNKMTK
MNKMTK
serwendìnÖszwecâL hìndenreferenz
ONTOUTMV
H OTIRM
pbmA ÜÄerweisìnÖ von
eans meter pcÜìcÜmann ìnd taätraìd pcÜìcÜmann
MNKMTK
MNKMTK
serwendìnÖszwecâL hìndenreferenz
QVMSKOOMMSNOTMM
ABtA eAkp mbqbo pCerCejAkkI rka
tAiqoAra pCerCejAkk
H OTIRM
pbmA ÜÄerweisìnÖ von
BbvqriiAe hbibp
MNKMTK
MNKMTK
serwendìnÖszwecâL hìndenreferenz
BesteääìnÖsnìmmerW ONSORNMR
H OTIRM
pbmA ÜÄerweisìnÖ von
crancesca jaiìri
MNKMTK
MNKMTK
serwendìnÖszwecâL hìndenreferenz
ONSSMQRQ
H OTIRM
pbmA ÜÄerweisìnÖ von
cranz jìndäe ìnd jartina oìcâertJjìndäe
MNKMTK
MNKMTK
serwendìnÖszwecâL hìndenreferenz
ob QVMSKOOMMRVSNPV ONSSOSOU
ABtA jrkaibI coAkw
H OTIRM
pbmA ÜÄerweisìnÖ von
aawid gan pzycÜ
MNKMTK
MNKMTK
serwendìnÖszwecâL hìndenreferenz
aeìtscÜe Banâ
H OTIRM
pbmA ÜÄerweisìnÖ von
Carmen kìÖäiscÜ
MNKMTK
MNKMTK
serwendìnÖszwecâL hìndenreferenz
BesteääìnÖsnìmmer ONSSQRQP
H OTIRM
pbmA ÜÄerweisìnÖ von
qesfasäassie teädeÖerÖsÜ deÄreÖersÜ
MNKMTK
MNKMTK
serwendìnÖszwecâL hìndenreferenz
ONTOOQNM
ABtA teädeÖerÖsÜ deÄreÖersÜ qesfasäassie
H OTIRM
pbmA ÜÄerweisìnÖ von
ropriA bokA eAkkCebk iAjj
MNKMTK
MNKMTK
serwendìnÖszwecâL hìndenreferenz
BesteäänrKONTPNTQM
jlBKNUNKrbKRPPPNU
H OTIRM
pbmA ÜÄerweisìnÖ von
ceääiscÜI ptepÜanie
MNKMTK
MNKMTK
serwendìnÖszwecâL hìndenreferenz
BesteääìnÖsnìmmer ONSMUMQU
IBAk
von
peite
Auszug
abPV PMMT MMOQ MNNT MPUM MM
RTS
O
Q
MMMMMMMMMP L MSRMMVMM L OMOOMTPM

The problem is that the extracted text does not match the text in PDF, it does not display something like ? or TOFU symbols.
Are there any suggestions why this happens and is there any solution for it?

JorjMcKie · 2023-06-02T11:52:52Z

JorjMcKie
Jun 2, 2023
Maintainer

The problem here is that the font uses non-standard encoding, which leads to incorrect backtranslation of the glyphs (character appearance in viewers) to the originating character unicodes.
This happens quite often, and sometimes on purpose by the PDF creator.
The only solution would be to OCR the page, unfortunately.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_text() returns incorrect text #2447

{{title}}

Replies: 1 comment

{{title}}

Select a reply

get_text() returns incorrect text #2447

kopugara Jun 2, 2023

Replies: 1 comment

JorjMcKie Jun 2, 2023 Maintainer

kopugara
Jun 2, 2023

JorjMcKie
Jun 2, 2023
Maintainer