How to extract texts between two coordinates in a page? #3959

StephenZKCurry · 2024-10-17T02:08:39Z

StephenZKCurry
Oct 17, 2024

I want to extract texts between two coordinates on a page use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position, mimics dragging a cursor highlights text in a PDF, How can I do that?

JorjMcKie · 2024-10-17T10:46:14Z

JorjMcKie
Oct 17, 2024
Maintainer

You can supply an arbitrary rectangle ("clip") inside which your desired text lives. If you only have top and bottom values, make a rectangle clip = pymupdf.Rect(0, top, page.rect.width, bottom).
Then execute text = page.get_text(sort=True, clip=clip).
This will (pymupdf v1.24.11+) extract the text in reading order.

0 replies

patsnap-liujin · 2024-12-16T02:29:00Z

patsnap-liujin
Dec 16, 2024

clip = pymupdf.Rect(0, top, page.rect.width, bottom)
text = page.get_text(sort=True, clip=clip).
This will (pymupdf v1.24.11+) extract the text in reading order.

Thanks for the answer! @JorjMcKie I have some further questions😄. The page.get_text get the text intersecting the rect, but is there some encapsulated method or function to get the text fully contained in the rect ? I have read textbox-extract-1.py and code an example below for my useage, and I'm looking for some practice more efficient and standard. Can you give some suggestions?

# page = doc[page_num]
# my_rect = fitz.Rect(bbox)
current_textpage = page.get_textpage()
words = page.get_text("words",clip=my_rect, textpage=current_textpage) # here is the point: when using param `textpage` to reduce execution time significantly, the param clip is ignored and the result `words` contains whole page but I want the content only in my_rect.
fullly_containded_words = [w for w in words if fitz.Rect(w[:4]) in my_rect] # if I have many rects, the loop will  execute many times
fullly_containded_text = make_text(fullly_containded_words) # make_text is a function from textbox-extract-1.py

1 reply

patsnap-liujin Dec 16, 2024

After read the source code about Page.get_textbbox, I find the function JM_rects_overlap is the judgment of overlap. I change it from

def JM_rects_overlap(a, b):
    if (0
            or a.x0 >= b.x1
            or a.y0 >= b.y1
            or a.x1 <= b.x0
            or a.y1 <= b.y0
            ):
        return 0
    return 1

to

def JM_rects_overlap(a, b):
    if (a.x0 < (b.x1+b.x0)/2 < a.x1
        and a.y0 < (b.y1+b.y0)/2 < a.y1):
        return 1
    return 0

and it works as I want. The change is judge from center point ,not edge point.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to extract texts between two coordinates in a page? #3959

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to extract texts between two coordinates in a page? #3959

StephenZKCurry Oct 17, 2024

Replies: 2 comments · 1 reply

JorjMcKie Oct 17, 2024 Maintainer

patsnap-liujin Dec 16, 2024

patsnap-liujin Dec 16, 2024

StephenZKCurry
Oct 17, 2024

Replies: 2 comments 1 reply

JorjMcKie
Oct 17, 2024
Maintainer

patsnap-liujin
Dec 16, 2024