How to extract texts between two coordinates in a page? #3959
Replies: 2 comments 1 reply
-
You can supply an arbitrary rectangle ("clip") inside which your desired text lives. If you only have top and bottom values, make a rectangle |
Beta Was this translation helpful? Give feedback.
-
Thanks for the answer! @JorjMcKie I have some further questions😄. The # page = doc[page_num]
# my_rect = fitz.Rect(bbox)
current_textpage = page.get_textpage()
words = page.get_text("words",clip=my_rect, textpage=current_textpage) # here is the point: when using param `textpage` to reduce execution time significantly, the param clip is ignored and the result `words` contains whole page but I want the content only in my_rect.
fullly_containded_words = [w for w in words if fitz.Rect(w[:4]) in my_rect] # if I have many rects, the loop will execute many times
fullly_containded_text = make_text(fullly_containded_words) # make_text is a function from textbox-extract-1.py |
Beta Was this translation helpful? Give feedback.
-
I want to extract texts between two coordinates on a page use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position, mimics dragging a cursor highlights text in a PDF, How can I do that?
Beta Was this translation helpful? Give feedback.
All reactions