Skip to content

table lines and cells being extracted in a weird way #1831

Answered by JorjMcKie
AttentiveNader asked this question in Q&A
Discussion options

You must be logged in to vote

This depends on the PDF creator software:
E.g. Word exports lines always as thin rectangles, and so do other office products.
You must use some logic that treats rects thinner than x points like lines.
Also please read the docu again: since some PyMuPDF versions, multiple "re" items may occur within the same path.

You can also not rely on the false assumption that tables having grid lines encircle each (or any) of their cells by some rectangle. There may just be lines (or as mentioned, those thin rectangles). Or there may be a mixture of both. Whatever you can imagine will be found somewhere.
So to find the cell rectangles you must compute the crossing points of horizontal and vertical (p…

Replies: 2 comments 3 replies

Comment options

You must be logged in to vote
3 replies
@AttentiveNader
Comment options

@JorjMcKie
Comment options

@AttentiveNader
Comment options

Answer selected by AttentiveNader
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
not a bug not a bug / user error / unable to reproduce
2 participants
Converted from issue

This discussion was converted from issue #1830 on July 24, 2022 13:28.