Expose PDF objects as Python objects? #2022

yossizahn · 2022-11-05T22:19:57Z

yossizahn
Nov 5, 2022

Hi,
Would it make sense as an enhancement to expose PDF objects as Python objects?
Currently the interface for accessing PDF objects seems somewhat cluncky and incomplete.
The current API is composed of:

doc.xref_object which returns a string (need to use string manipulation, or do custom parsing to be useful)
doc.xref_get_key which thankfully supports nested dicts but doesn't provide a way to get fields from dicts contained in arrays (e.g. doc.xref_get_key(n, '/KeyOfArrayVal/[0]/Key') is not supported).
It returns a tuple of (type, val) where val is again a plain string.
doc.xref_get_keys which allows discovering the keys of a dict but unfortunately doesn't support nested dicts. I don't currently know how to enumerate the keys of a nested dict.

I would like to suggest as an improvement, to expose the objects as Python objects, using a class for each object type and allowing indexing on non-primitives to access nested valuse.

Answered by JorjMcKie

Nov 6, 2022

Thanks for submitting this!

There are a lot of analogies between Python's object model and that what is called an "object" in PDF. So your idea does have tempting aspects.

There are however good reasons why we don't want to do this:

MuPDF supports a range of different document types. If you count them and include images, you will end up with more than a dozen. This plethora of types is not set in stone, support for the MOBI e-book format for example has been added in version 1.21.0.
PDF is just one type among many others. The overall strategy is to abstract from the differences between these document types and to keep a large set of common, universally applicable code. Text extraction a…

View full answer

JorjMcKie · 2022-11-06T11:35:10Z

JorjMcKie
Nov 6, 2022
Maintainer

Thanks for submitting this!

There are a lot of analogies between Python's object model and that what is called an "object" in PDF. So your idea does have tempting aspects.

There are however good reasons why we don't want to do this:

MuPDF supports a range of different document types. If you count them and include images, you will end up with more than a dozen. This plethora of types is not set in stone, support for the MOBI e-book format for example has been added in version 1.21.0.
PDF is just one type among many others. The overall strategy is to abstract from the differences between these document types and to keep a large set of common, universally applicable code. Text extraction and searching is a good example: it works for any type, without code change.
PDF surely is the most popular format - so it does deserve extra treatment to some extent. But we also want to keep this under control. I believe we already have gone very far with PyMuPDF's special support for PDF - further than most packages, that support PDF and nothing else.
The PDF file format is basically ASCII strings (optionally with interspersed binary content) that follows some protocol as laid out in the PDF specification. But there is no technical enforcement of its internal integrity. It is fair to say, that all sorts of things can be wrong inside a PDF - and most problems are not detectable until one actually tries to access the relevant file portion: successfully opening a PDF does not mean it is a valid document.
This is not just a theoretical risk: PDF is sadly famous for its many examples with invalid structures, either because of creation against the specs, incomplete downloads, circular references and what not.
Mapping the "dirty" internal PDF file structure to the very clean Python object model would have to cope with all these risks - and we haven't yet talked about fundamental compatibility problems, which do exist and would require a deep analysis.
Some "economic" considerations: What would be achieved beyond the currently possible things? Who is this for - how frequent are situations like the one you describe?
While you make a good point about the inability to enumerate sub-dictionary keys, implementing this sort of thing would quickly lead to an explosion of consequential efforts: a parent of a sub-dictionaries is not always another dictionary: items of an array may also be dictionaries - and those are present as strings and are thus not interpretable using MuPDF's C code: Python code would have to be created to step down into a potentially endless chain of other sub-dicionaries. And excluding this would be ugly, wouldn't it.
Today, what you can access via doc.xref_get_key() is also updateable via doc.xref_set_key(). Assuming we do have a Python structure resembling some PDF object structure: the way back to the PDF file after updates to the Python data is impossible - at least by any reasonable effort.
PyMuPDF internally uses xref_get_key() / xref_set_key() in several situations to avoid specialized C code. Because of the way how these methods are implemented, a very high performance can still be maintained. Extensions like the ideas discussed here would inevitably mean introducing incalculable slowdowns.

There still exist a few, selected options for the experts who know what they are doing:
If you really need to unfold a dictionary's sub-structure, create a new xref using the string returned by doc.xref_get_key(). Then access the list of its top level keys via doc.xref_get_keys(temp_xref).
But do not save the PDF afterwards and also do not update that temp_xref ...
You can of course use a thus returned key to step down in the original dictionary.

7 replies

JorjMcKie Nov 6, 2022
Maintainer

I know pikepdf. Looking at its response time opening a large PDF underpins my previous comments: this is what happens if you dive into object structures and turn them into Python ...

But there is more help available if you invest a little in some home-grown functions. Suppose you have a received a response ("dict", dstring) from doc.xref_get_key(xref, "some/key/chain"). Then you can do this:

t, dstring = doc.xref_get_key(page.xref, "Resources")

print(dstring)
<</Font<</R8 12 0 R/R10 13 0 R/R12 14 0 R/R14 15 0 R/R17 16 0 R/R20 17 0 R/R23 18 0 R/R27 19 0 R>>
/ProcSet[/PDF/Text]/ExtGState<</R7 20 0 R>>>>

txref = doc.get_new_xref()  # temp xref
doc.update_object(txref, dstring)  # copy sub-dict into it
detail_keys = doc.xref_get_keys(txref)  # extract its keys
doc._deleteObject(txref)  # remove temp object again
print(detail_keys)  # demo output
('Font', 'ProcSet', 'ExtGState')

# now "Resources/Font" etc. can be further inpected

JorjMcKie Nov 6, 2022
Maintainer

To avoid unnecessary grow of temporary xref numbers, you can reuse the first one over and over again: its content will be completely overwritten by each update_object().

JorjMcKie Nov 6, 2022
Maintainer

Things are a little more complicated if you have an array containing direct dict objects: [item 1 <<dict>> item3 ...].
Note that <<dict>> in turn may be nested, etc.

JorjMcKie Nov 6, 2022
Maintainer

but the above recipe is usable here also if you correctly locate the dict delimiters

yossizahn Nov 7, 2022
Author

Thanks for your answers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose PDF objects as Python objects? #2022

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Expose PDF objects as Python objects? #2022

yossizahn Nov 5, 2022

Replies: 1 comment · 7 replies

JorjMcKie Nov 6, 2022 Maintainer

JorjMcKie Nov 6, 2022 Maintainer

JorjMcKie Nov 6, 2022 Maintainer

JorjMcKie Nov 6, 2022 Maintainer

JorjMcKie Nov 6, 2022 Maintainer

yossizahn Nov 7, 2022 Author

yossizahn
Nov 5, 2022

Replies: 1 comment 7 replies

JorjMcKie
Nov 6, 2022
Maintainer

JorjMcKie Nov 6, 2022
Maintainer

JorjMcKie Nov 6, 2022
Maintainer

JorjMcKie Nov 6, 2022
Maintainer

JorjMcKie Nov 6, 2022
Maintainer

yossizahn Nov 7, 2022
Author