extract highlighted from overlapping text #2669
-
Ive got some pdfs with overlapping text. It looks like this example I made:
As it says, I only want the text that is highlighted and not the overlaying text. This seems like it should be possible since when I go to manually highlight the text, it follows the flow of the paragraph underneath. I've tried various things inspired by some older threads in this repo, like extracting using the vertices of the annotation but no luck. Here's the code I've tried so far: import fitz
doc = fitz.open('test.pdf')
page = doc[0]
annots = list(page.annots())
highlight = annots[0]
# attempt 1 using the whole annot rect
words = page.get_text("words", clip=highlight.rect)
print(" ".join(w[4] for w in words))
# From this document I want to only extract this paragraph which is highlighted ot want this overlaying text which is not
# above output contains part of overlaying text
# attempt 2 using the annot vertices
words = page.get_text("words")
extracted = []
for i in range(0, len(highlight.vertices), 4):
r = fitz.Quad(highlight.vertices[i:i+4]).rect
for w in words:
if fitz.Rect(w[:4]) in r:
extracted.append(w[4])
print(" ".join(w for w in extracted))
# this document I to only extract paragraph want this overlaying text which is
# oddly missing bits and still includes overlaying text Is there a way to accomplish what I'm looking for? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
That looks pretty messed up indeed. In additon, highlighting software is sometimes stingy when it comes to defining the area it covers. Therefor, better make sure to use |
Beta Was this translation helpful? Give feedback.
Here is a successful approach that at least should demo my point: