extract highlighted from overlapping text #2669

StevenSong · 2023-09-13T18:07:26Z

StevenSong
Sep 13, 2023

Ive got some pdfs with overlapping text. It looks like this example I made:

As it says, I only want the text that is highlighted and not the overlaying text. This seems like it should be possible since when I go to manually highlight the text, it follows the flow of the paragraph underneath.

I've tried various things inspired by some older threads in this repo, like extracting using the vertices of the annotation but no luck. Here's the code I've tried so far:

import fitz
doc = fitz.open('test.pdf')
page = doc[0]
annots = list(page.annots())
highlight = annots[0]

# attempt 1 using the whole annot rect
words = page.get_text("words", clip=highlight.rect)
print(" ".join(w[4] for w in words))
# From this document I want to only extract this paragraph which is highlighted ot want this overlaying text which is not
# above output contains part of overlaying text

# attempt 2 using the annot vertices
words = page.get_text("words")
extracted = []
for i in range(0, len(highlight.vertices), 4):
    r = fitz.Quad(highlight.vertices[i:i+4]).rect
    for w in words:
        if fitz.Rect(w[:4]) in r:
            extracted.append(w[4])
print(" ".join(w for w in extracted))
# this document I to only extract paragraph want this overlaying text which is
# oddly missing bits and still includes overlaying text

Is there a way to accomplish what I'm looking for?

Answered by JorjMcKie

Sep 13, 2023

Here is a successful approach that at least should demo my point:

# make a list of sub-rectangles of the annotation
a_rects=[]
for i in range(len(a.vertices),4):
    points = a.vertices[i:i+4]
    rect = fitz.Quad(points).rect
    a_rects.append(rect)

# now extract the text and only include spans contained in one of the
# annot sub rectangles (for 80% at least):
for b in page.get_text("dict")["blocks"]:
    for line in b["lines"]:
        for span in line["spans"]:
            bbox = fitz.Rect(span["bbox"])
            area=abs(bbox)*0.8  # compute span area size (80% of it)
            for r in a_rects:
                if abs(bbox & r) >= area:  if there is sufficient overlap:
         …

View full answer

JorjMcKie · 2023-09-13T20:26:50Z

JorjMcKie
Sep 13, 2023
Maintainer

That looks pretty messed up indeed.
The primary problem is that the highlighted text and the highlight annotation(s) (may be more than 1) have no "physical" relationship: the highlights could be exactly the same if there were not text!
So the annot knows nothing about what it may it or may not highlight.

In additon, highlighting software is sometimes stingy when it comes to defining the area it covers. Therefor, better make sure to use fitz.TOOLS.set_small_glyph_heights(True) when extracting text covered by the annot rect.
If there is other text also overlapping the annot rect ... that is too bad.
One way to separate it out, is by looking at text properties: font name, font size, etc.
It also looks like, that in your example text spans highlighted are always contained in some annot rect.
Whereas the the wild text is not fully contained in any highlighted area ... another option to differentiate.

2 replies

JorjMcKie Sep 13, 2023
Maintainer

Here is a successful approach that at least should demo my point:

# make a list of sub-rectangles of the annotation
a_rects=[]
for i in range(len(a.vertices),4):
    points = a.vertices[i:i+4]
    rect = fitz.Quad(points).rect
    a_rects.append(rect)

# now extract the text and only include spans contained in one of the
# annot sub rectangles (for 80% at least):
for b in page.get_text("dict")["blocks"]:
    for line in b["lines"]:
        for span in line["spans"]:
            bbox = fitz.Rect(span["bbox"])
            area=abs(bbox)*0.8  # compute span area size (80% of it)
            for r in a_rects:
                if abs(bbox & r) >= area:  if there is sufficient overlap:
                    print(span["text"])  # accept that text

                    
From this document I 
want to only extract 
this paragraph which 
is highlighted

Answer selected by StevenSong

StevenSong Sep 14, 2023
Author

the suggestion to only take spans which are almost entirely contained works perfectly for my use case, thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract highlighted from overlapping text #2669

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

extract highlighted from overlapping text #2669

StevenSong Sep 13, 2023

Replies: 1 comment · 2 replies

JorjMcKie Sep 13, 2023 Maintainer

JorjMcKie Sep 13, 2023 Maintainer

StevenSong Sep 14, 2023 Author

StevenSong
Sep 13, 2023

Replies: 1 comment 2 replies

JorjMcKie
Sep 13, 2023
Maintainer

JorjMcKie Sep 13, 2023
Maintainer

StevenSong Sep 14, 2023
Author