PyMuPDF has Added Table Recognition! #2600
JorjMcKie
started this conversation in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The newest PyMuPDF version 1.23.0 includes the new feature to automatically identify tables on
Document
pages.The feature is implemented as one new Page method
find_tables()
. It returns a list of detectedTable
objects. For every table, its overall boundary box (bbox), header, rows, columns and associated cell text and cell boundary boxes can be extracted.The feature is fully integrated and does not introduce dependencies to other packages or external software. Tables and their content and attributes are standard Python objects like lists or tuples.
In contrast to similar solutions of other Python packages, we do not require huge packages like pandas. But we do offer a
Table
methodto_pandas()
that exports the table to a pandas DataFrame. In the end, you will in many cases want to use DataFrames to produce other formats like Excel, CSV, JSON and many more.Table recognition is a new feature. It was developed as an extended port of similar solutions in other products. Therefore, this first version is already very mature.
Nevertheless, we strive to further enhance it in future versions. Although not probable, this may entail minor changes to the API (e.g. method
.find_tables()
). We therefore recommend to view the feature as still being somewhat "experimental".As always, we value your feeback as a user and encourage you to try out Table recognition and extraction.
Please do have a look at new example scripts in this repository folder. So far, we have been making Jupyter notebooks, because we felt this to be the most appropriate, user-friendly way to explain this rather complex topic.
Please also be aware that table detection and extraction works for all supported document types - not just for PDF and with now changes. So whether you have tables on PDF, XPS, EPUB or MOBI pages: the same API will work.
We have also published an article "Table Recognition and Extraction With PyMuPDF" on Artifex' blog website https://artifex.com/blog/table-recognition-extraction-from-pdfs-pymupdf-python.
Beta Was this translation helpful? Give feedback.
All reactions