-
Please provide all mandatory information! Describe the bug (mandatory)A clear and concise description of what the bug is. I'm trying to extract the quad corresponding to an OCRed span in an image, but it seems to me that the result is incoherent with the OCR dict values ( To Reproduce (mandatory)Explain the steps to reproduce the behavior, For example, include a minimal code snippet, example files, etc. Here's the image I'm working on: I extracted the text information in a dict with: import fitz
doc = fitz.Document(filename)
ocred_page = doc[0].get_textpage_ocr(language="eng", full=True)
ocr_dict = ocred_page.extractDICT() The Quad I would like to get is the one corresponding to the word "SHARING", in the 2nd line: ocr_dict["blocks"][0]["lines"][1]
In order to get the corresponding Quad, I use the function span = ocr_dict["blocks"][0]["lines"][1]["spans"][0]
quad = fitz.recover_quad(line_dir=ocr_dict["blocks"][0]["lines"][1]["dir"], span=span)
Here's a plot of the Quad: from shapely.geometry import Polygon
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd
poly1 = Polygon( [(p.x, p.y) for p in quad] )
myPoly = gpd.GeoSeries(poly1)
myPoly.plot()
plt.show() Expected behavior (optional)Describe what you expected to happen (if not obvious). I find it strange that the quad is not straight, as the line_dir (1.0, 0.0) corresponds to a 0° angle to the x-axis. I looked into the code of function Lines 4899 to 4903 in 5948fb4 The upper left point is getting lower than the bottom left point of the bbox and than the lower left point of the quad, is that expected ? Your configuration (mandatory)
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
This is not bug, but goes directly back to peculiarities of the font used by TesseractOCR (GlyphLessFont).
The code behind
So bottom line: always use I tried your example with setting small glyph height to true, and things worked more or less correctly. |
Beta Was this translation helpful? Give feedback.
-
Thank you very much for this very rapid and detailed answer ! |
Beta Was this translation helpful? Give feedback.
-
Thank you for that nice feedback! |
Beta Was this translation helpful? Give feedback.
This is not bug, but goes directly back to peculiarities of the font used by TesseractOCR (GlyphLessFont).
And then the principle problem vinculated with the OCR process as such:
The code behind
recover_quad
has to make assumption to …