Replies: 10 comments 19 replies
-
You may want to use the table finder: In [1]: import fitz
In [2]: doc=fitz.open("Presentation1.pdf")
In [3]: page = doc[0]
In [4]: tabs = page.find_tables() # detect the tables
In [5]: tab = tabs[0] # get first table on page
In [6]: tab.extract() # read all cell content
Out[6]:
[['丕', '石', '右', '布', '夯', '龙', '戊'],
['丕', '石', '右', '布', '夯', '龙', '戊'],
['', None, None, None, None, None, None]]
In [7]: # also works as a pandas DataFrame:
In [8]: df = tab.to_pandas()
In [9]: df
Out[9]:
丕 石 右 布 夯 龙 戊
0 丕 石 右 布 夯 龙 戊
1 None None None None None None
In [10]: # you can also extract row and cell coordinates
In [11]: row = tab.rows[0]
In [12]: row.bbox
Out[12]: (0.0, 0.0, 369.3599853515625, 50.4000244140625)
In [14]: row.cells[0]
Out[14]: (0.0, 0.0, 52.768001556396484, 50.4000244140625) For background on using PyMuPDF's table finder look at blog1 and blog2.
|
Beta Was this translation helpful? Give feedback.
-
Thank you. |
Beta Was this translation helpful? Give feedback.
-
There is error. Also how to crop the image of the cell? File "F:/pycharm2020.2/RapidStructure-0.0.0/exportLine_txt_png_tables.py", line 20, in |
Beta Was this translation helpful? Give feedback.
-
Cannot see cell bbox property |
Beta Was this translation helpful? Give feedback.
-
I cannot get bboxes for the cells |
Beta Was this translation helpful? Give feedback.
-
Thank you. |
Beta Was this translation helpful? Give feedback.
-
Ok. Thank you. |
Beta Was this translation helpful? Give feedback.
-
How to read a table cells by rows and columns? '''
''' |
Beta Was this translation helpful? Give feedback.
-
Try tabs=page.find_tables()
tab=tabs[0]
for e in tab.extract():
print(e)
['。作者', None, None, '件事”。“只要', '看20多份报']
['pulseawry', '认为,', '。冰壶运动需要', None, None]
[',还有师父咸鱼', '564.77', '8506.00', '', '32.20']
['Ssubs', '', '', '', '4862.66']
[None, '7', '119.57', '满了湿润和温', '68']
tabs=page.find_tables()
tab.row_count, tab.col_count
(5, 5) For each of these cell content items you will find the corresponding cell coordinates like this for row in tab.rows:
print(row.cells) # a list of rectangle tuples When encountering joined cells, the table finder will dissolve them and alsways return the maximum values for rows / columns. It will use for row in tab.rows:
print([tuple(fitz.IRect(cell)) if cell!= None else None for cell in row.cells])
[(66, 243, 562, 271), None, None, (562, 243, 728, 300), (728, 243, 894, 300)]
[(66, 271, 231, 300), (231, 271, 397, 300), (397, 271, 562, 300), None, None]
[(66, 300, 231, 329), (231, 300, 397, 329), (397, 300, 562, 329), (562, 300, 728, 329), (728, 300, 894, 329)]
[(66, 329, 231, 387), (231, 329, 397, 358), (397, 329, 562, 358), (562, 329, 728, 358), (728, 329, 894, 358)]
[None, (231, 358, 397, 387), (397, 358, 562, 387), (562, 358, 728, 387), (728, 358, 894, 387)] If you compare the text and the cell output you will see the logic used to dissolve the joined cells. |
Beta Was this translation helpful? Give feedback.
-
textbox.pdf
Presentation1.pdf
For first line, I put a line of table or textbox below image,
but when I use below code to extract characters, can only get image for whole line, how to split cells?
'''
import fitz
from PIL import Image
import cv2
import numpy
path = r'L:/jp5/Presentation1.pdf'
path = r'L:/jp5/textbox.pdf'
doc = fitz.open(path)
j = 0
for i in range(0, doc.page_count):
page = doc[i]
for block in page.get_text("dict", flags=fitz.TEXTFLAGS_TEXT)["blocks"]:
for line in block["lines"]:
j = j + 1
bbox = line["bbox"] # the line bbox
text = " ".join([span["text"] for span in line["spans"]]) # text in line
if len(text.strip()) > 2:
print(bbox)
print(text)
pix = page.get_pixmap(clip=bbox, dpi=300) # pixmap of line bbox
# pix.save(...)"page%s-%s.png" % (page, xref)
# pix.save(r'F:\0mupdflines\' + "page-%i_%i.png" % (page.number, j)) # store image as a PNG
doc.close()
'''
Beta Was this translation helpful? Give feedback.
All reactions