table lines and cells being extracted in a weird way #1831
-
I was trying to detect and extract tables from the drawings of a page but some times fitz extracts the table lines (that make up the table cells) as many rectangles that have really small area (so not the cell as a rectangle) and it does that without an obvious patters also sometimes the rectangle extract isn't actually a line but just a point. I created this notebook here the coordinates of the rectangle indicate that this a line or a point I was trying to use lines or rectangles to extract cells
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
This depends on the PDF creator software: You can also not rely on the false assumption that tables having grid lines encircle each (or any) of their cells by some rectangle. There may just be lines (or as mentioned, those thin rectangles). Or there may be a mixture of both. Whatever you can imagine will be found somewhere. |
Beta Was this translation helpful? Give feedback.
-
Was just kidding, no worries. Then convert back to lists the x_values and y_values: cells = []
for i in range(len(y_values)-1):
line = []
for j in range(len(x_values)-1):
cell = fitz.Rect(x_values[j], y_values[i], x_values[j+1], y_values[i+1])
line.append(cell)
cells.append(line) If I didn't mess up something too badly, you then should be able to access each cell as But here is a little present, that handles the mess of your example table quite well 😎: |
Beta Was this translation helpful? Give feedback.
This depends on the PDF creator software:
E.g. Word exports lines always as thin rectangles, and so do other office products.
You must use some logic that treats rects thinner than x points like lines.
Also please read the docu again: since some PyMuPDF versions, multiple "re" items may occur within the same path.
You can also not rely on the false assumption that tables having grid lines encircle each (or any) of their cells by some rectangle. There may just be lines (or as mentioned, those thin rectangles). Or there may be a mixture of both. Whatever you can imagine will be found somewhere.
So to find the cell rectangles you must compute the crossing points of horizontal and vertical (p…