Skip to content

Function recover_quad() gives incoherent result with OCRed text #2178

Answered by JorjMcKie
OrianeN asked this question in Q&A
Discussion options

You must be logged in to vote

This is not bug, but goes directly back to peculiarities of the font used by TesseractOCR (GlyphLessFont).
And then the principle problem vinculated with the OCR process as such:

  • The bbox of a recognized word closely wraps the word.
  • The fontsize is roughly equal to the bbox height. This alone gives imprecise information! E.g. if, for two consecutive words (with originally identical font properties), one word contains characters using the space below the baseline (e.g., "y"), and the other word does not, then different font sizes will be reported by OCR! And because you do not know the original font, you have no way to correct this.

The code behind recover_quad has to make assumption to …

Replies: 3 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by JorjMcKie
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
not a bug not a bug / user error / unable to reproduce
2 participants
Converted from issue

This discussion was converted from issue #2176 on January 18, 2023 12:24.