Skip to content

extract highlighted from overlapping text #2669

Answered by JorjMcKie
StevenSong asked this question in Q&A
Discussion options

You must be logged in to vote

Here is a successful approach that at least should demo my point:

# make a list of sub-rectangles of the annotation
a_rects=[]
for i in range(len(a.vertices),4):
    points = a.vertices[i:i+4]
    rect = fitz.Quad(points).rect
    a_rects.append(rect)

# now extract the text and only include spans contained in one of the
# annot sub rectangles (for 80% at least):
for b in page.get_text("dict")["blocks"]:
    for line in b["lines"]:
        for span in line["spans"]:
            bbox = fitz.Rect(span["bbox"])
            area=abs(bbox)*0.8  # compute span area size (80% of it)
            for r in a_rects:
                if abs(bbox & r) >= area:  if there is sufficient overlap:
         …

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@JorjMcKie
Comment options

Answer selected by StevenSong
@StevenSong
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants