PyMuPDF has Added Table Recognition! #2600

JorjMcKie · 2023-08-20T08:55:00Z

JorjMcKie
Aug 20, 2023
Maintainer

The newest PyMuPDF version 1.23.0 includes the new feature to automatically identify tables on Document pages.

The feature is implemented as one new Page method find_tables(). It returns a list of detected Table objects. For every table, its overall boundary box (bbox), header, rows, columns and associated cell text and cell boundary boxes can be extracted.

The feature is fully integrated and does not introduce dependencies to other packages or external software. Tables and their content and attributes are standard Python objects like lists or tuples.
In contrast to similar solutions of other Python packages, we do not require huge packages like pandas. But we do offer a Table method to_pandas() that exports the table to a pandas DataFrame. In the end, you will in many cases want to use DataFrames to produce other formats like Excel, CSV, JSON and many more.

Table recognition is a new feature. It was developed as an extended port of similar solutions in other products. Therefore, this first version is already very mature.
Nevertheless, we strive to further enhance it in future versions. Although not probable, this may entail minor changes to the API (e.g. method .find_tables()). We therefore recommend to view the feature as still being somewhat "experimental".

As always, we value your feeback as a user and encourage you to try out Table recognition and extraction.

Please do have a look at new example scripts in this repository folder. So far, we have been making Jupyter notebooks, because we felt this to be the most appropriate, user-friendly way to explain this rather complex topic.

Please also be aware that table detection and extraction works for all supported document types - not just for PDF and with now changes. So whether you have tables on PDF, XPS, EPUB or MOBI pages: the same API will work.

We have also published an article "Table Recognition and Extraction With PyMuPDF" on Artifex' blog website https://artifex.com/blog/table-recognition-extraction-from-pdfs-pymupdf-python.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyMuPDF has Added Table Recognition! #2600

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

PyMuPDF has Added Table Recognition! #2600

JorjMcKie Aug 20, 2023 Maintainer

Replies: 0 comments

JorjMcKie
Aug 20, 2023
Maintainer