Function recover_quad() gives incoherent result with OCRed text #2178

OrianeN · 2023-01-17T14:39:34Z

OrianeN
Jan 17, 2023

Please provide all mandatory information!

Describe the bug (mandatory)

A clear and concise description of what the bug is.

I'm trying to extract the quad corresponding to an OCRed span in an image, but it seems to me that the result is incoherent with the OCR dict values (.extractDICT()) and what I can see in the image.

To Reproduce (mandatory)

Explain the steps to reproduce the behavior, For example, include a minimal code snippet, example files, etc.

Here's the image I'm working on:

I extracted the text information in a dict with:

import fitz
doc = fitz.Document(filename)
ocred_page = doc[0].get_textpage_ocr(language="eng", full=True)
ocr_dict = ocred_page.extractDICT()

The Quad I would like to get is the one corresponding to the word "SHARING", in the 2nd line:

ocr_dict["blocks"][0]["lines"][1]

{'spans': [{'size': 39.067378997802734,
   'flags': 12,
   'font': 'GlyphLessFont',
   'color': 0,
   'ascender': 1.0,
   'descender': -0.00048828125,
   'text': 'SHARING',
   'origin': (15.0, 107.0),
   'bbox': (15.0, 40.0, 94.72999572753906, 107.03271484375)},
  {'size': 25.890541076660156,
   'flags': 12,
   'font': 'GlyphLessFont',
   'color': 0,
   'ascender': 1.0,
   'descender': -0.00048828125,
   'text': ' &',
   'origin': (94.72999572753906, 107.0),
   'bbox': (94.72999572753906, 65.0, 112.9800033569336, 116.0205078125)},
  {'size': 29.698484420776367,
   'flags': 12,
   'font': 'GlyphLessFont',
   'color': 0,
   'ascender': 1.0,
   'descender': -0.00048828125,
   'text': ' SKILLS',
   'origin': (112.9800033569336, 116.0),
   'bbox': (112.9800033569336, 74.0, 186.0, 116.0205078125)}],
 'wmode': 0,
 'dir': (1.0, 0.0),
 'bbox': (15.0, 40.0, 186.0, 116.0205078125)}

In order to get the corresponding Quad, I use the function fitz.recover_quad():

span = ocr_dict["blocks"][0]["lines"][1]["spans"][0]
quad = fitz.recover_quad(line_dir=ocr_dict["blocks"][0]["lines"][1]["dir"], span=span)

Quad(Point(15.0, 67.946259977296), Point(94.72999572753906, 40.0), Point(15.0, 107.03271484375), Point(94.72999572753906, 79.086454866454))

Here's a plot of the Quad:

from shapely.geometry import Polygon
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd

poly1 = Polygon( [(p.x, p.y) for p in quad] )

myPoly = gpd.GeoSeries(poly1)
myPoly.plot()
plt.show()

Expected behavior (optional)

Describe what you expected to happen (if not obvious).

I find it strange that the quad is not straight, as the line_dir (1.0, 0.0) corresponds to a 0° angle to the x-axis.

I looked into the code of function fitz.recover_bbox_quad() and the documentation of Rect, and I'm wondering if there is not something wrong here:

PyMuPDF/fitz/utils.py

Lines 4899 to 4903 in 5948fb4

    
           if hc >= 0 and hs <= 0:  # quadrant 1 
        
               ul = bbox.bl - (0, hc) 
        
               ur = bbox.tr + (hs, 0) 
        
               ll = bbox.bl - (hs, 0) 
        
               lr = bbox.tr + (0, hc)

The upper left point is getting lower than the bottom left point of the bbox and than the lower left point of the quad, is that expected ?

Your configuration (mandatory)

Operating system, potentially version and bitness: Windows 10 64-bits version 22H2
Python version, bitness: Python 3.8.10 (tags/v3.8.10:3d8993a, May 3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)] on win32
PyMuPDF version, installation method (wheel or generated from source): installed via pip install PyMuPDF - version 1.21.1

Answered by JorjMcKie

Jan 18, 2023

This is not bug, but goes directly back to peculiarities of the font used by TesseractOCR (GlyphLessFont).
And then the principle problem vinculated with the OCR process as such:

The bbox of a recognized word closely wraps the word.
The fontsize is roughly equal to the bbox height. This alone gives imprecise information! E.g. if, for two consecutive words (with originally identical font properties), one word contains characters using the space below the baseline (e.g., "y"), and the other word does not, then different font sizes will be reported by OCR! And because you do not know the original font, you have no way to correct this.

The code behind recover_quad has to make assumption to …

View full answer

JorjMcKie · 2023-01-18T00:19:39Z

JorjMcKie
Jan 18, 2023
Maintainer

This is not bug, but goes directly back to peculiarities of the font used by TesseractOCR (GlyphLessFont).
And then the principle problem vinculated with the OCR process as such:

The bbox of a recognized word closely wraps the word.
The fontsize is roughly equal to the bbox height. This alone gives imprecise information! E.g. if, for two consecutive words (with originally identical font properties), one word contains characters using the space below the baseline (e.g., "y"), and the other word does not, then different font sizes will be reported by OCR! And because you do not know the original font, you have no way to correct this.

The code behind recover_quad has to make assumption to be successful:

bbox.height is equal to fontsize * (asc - dsc), where asc, dsc are the font properties ascender and descender.
They are not correctly reported / being given for the GlyphLessFont.

So bottom line: always use fitz.Tools().set_small_glyph_heights(True) in these circumstances.
This will avoid many cases of overly large or inconsistent bbox heights.
If you can, do not use the recover quad functions if possible.
In your case, with line["dir"] = (1, 0), the quad fitz.Rect(bbox).quad is directly available. Similar is true for other multiple of 90°.

I tried your example with setting small glyph height to true, and things worked more or less correctly.

0 replies

OrianeN · 2023-01-18T10:00:29Z

OrianeN
Jan 18, 2023
Author

Thank you very much for this very rapid and detailed answer !

0 replies

JorjMcKie · 2023-01-18T12:22:40Z

JorjMcKie
Jan 18, 2023
Maintainer

Thank you for that nice feedback!
Assuming your permission, I am going to convert your post to a discussion item - so others can find and profit from it.
You made me thinking about including appropriate caveat comments in the documentation ...
Thanks for your so well-prepared post!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Function recover_quad() gives incoherent result with OCRed text #2178

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Function recover_quad() gives incoherent result with OCRed text #2178

OrianeN Jan 17, 2023

Describe the bug (mandatory)

To Reproduce (mandatory)

Expected behavior (optional)

Your configuration (mandatory)

Replies: 3 comments

JorjMcKie Jan 18, 2023 Maintainer

OrianeN Jan 18, 2023 Author

JorjMcKie Jan 18, 2023 Maintainer

OrianeN
Jan 17, 2023

JorjMcKie
Jan 18, 2023
Maintainer

OrianeN
Jan 18, 2023
Author

JorjMcKie
Jan 18, 2023
Maintainer