How to rebuild a page instance and put it back to the pdf with the results of page.get_text("json")? #1461

lidh15 · 2021-12-15T16:18:55Z

lidh15
Dec 15, 2021

the case is simply from I need to remove some blocks in some pages of the pdf file. It's not difficult to get the json string and manipulate the blocks in the dict, but after that I want to recreate the pdf file, is there any API available?

Answered by JorjMcKie

Dec 15, 2021

Don't take the JSON string, take the "dict" format instead. JSON is derived from it.

You cannot write back to the same page - instead make a new page in a new PDF and write (a subset of) the contents of the dictioanry to it.
There is an example set of scripts for font replacement, where you can probably learn a lot from.

View full answer

JorjMcKie · 2021-12-15T17:36:42Z

JorjMcKie
Dec 15, 2021
Maintainer

Don't take the JSON string, take the "dict" format instead. JSON is derived from it.

You cannot write back to the same page - instead make a new page in a new PDF and write (a subset of) the contents of the dictioanry to it.
There is an example set of scripts for font replacement, where you can probably learn a lot from.

1 reply

lidh15 Dec 17, 2021
Author

So another question is that how to find which block is on the top? I was able to remove a block, but I need to know which block is on the top thus gonna be shown and then I can remove the others.

lidh15 · 2021-12-17T04:20:55Z

lidh15
Dec 17, 2021
Author

And I have another curious topic, how to transfer the image block we get into a PIL image instance like return by PIL.Image.open?

12 replies

JorjMcKie Dec 20, 2021
Maintainer

If you modify the JSON output of font replacement like this, the resulting PDF will look good ... and exhibit text that was hidden under those inline images 😎.

[
  {
    "oldfont": [
      "\u008b\u00cc\u00e5,Bold",
      "\ufffd\ufffd\ufffd,Bold"
    ],
    "newfont": "cjk",  # the built-in universal font supporting Chinese
    "info": "Not embedded!"
  },
  {
    "oldfont": [
      "\u00bf\u00ac\u00cc\u00e5_GB2312,Bold",
      "\ufffd\ufffd\ufffd\ufffd_GB2312,Bold"
    ],
    "newfont": "cjk",
    "info": "Not embedded!"
  },
  {
    "oldfont": [
      "\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd,Bold",
      "\u00bb\u00aa\u0087\u00bf\u00ac\u00cc\u00e5,Bold"
    ],
    "newfont": "cjk",
    "info": "Not embedded!"
  }
]

lidh15 Dec 20, 2021
Author

Thank you, I followed this and it worked, but I can only open the modified pdf with Chrome, when I tried to open it with adobe pdf reader it said that there were errors in the page and adobe acrobat cannot render the page correctly. The reader somehow opened it but the text written by "fitz.text_writer" and the images inserted by "page.insert_image" were not seen. Is there any clues for the reason why this happened?

JorjMcKie Dec 20, 2021
Maintainer

Try to do it again with the non-hacky method (using redactions). It is "official" and not that error-prone.
My Adobe has no complaints, so you probably made some kind of error.
Otherwise I know that Adobe is very picky.

lidh15 Dec 21, 2021
Author

Is that related to the arguments of doc.save()?
I keep using the settings from the repl-font.py, and I'm wondering how it makes differences when I have different value for "deflate" and "garbage"? What arguments combination can provide the smallest file size?

JorjMcKie Dec 21, 2021
Maintainer

What arguments combination can provide the smallest file size?

There is a save alias called "easy save", ez_save() with some different default settings garbage=3, deflate=True. Which is a good balance between size reduction and save speed. Use garbage=4 to achieve the top compression (with MuPDF).

JorjMcKie · 2021-12-17T04:25:29Z

JorjMcKie
Dec 17, 2021
Maintainer

how to find which block is on the top?

When MuPDF builds its TextPage object (which I wrap with the same-named Python object), it applies some heuristics when it creates its block-line-span-character hierarchy.
This process leaves the original sequence of the underlying PDF painting instructions intact - at least roughly.

Some minor deviations may happen depending on the flags parameter, which may cause space characters to be generated or general whitespace be replaced by one or more spaces.

So, in the generated dictionary get_text("dict", ...), the sequence of the text pieces should reflect the sequence of the corresponding original painting commands. If text bboxes overlap, one might conclude that a later one overlaps an earlier one.

I also have page.get_bboxlog() which is a list of rectangles flagged with the rectangle "type" (text, drawing, image). Their sequence indeed is the painting sequence.
This is a more complete picture - i.e. not only regarding text, but all sort of objects potentially overlapping each other.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to rebuild a page instance and put it back to the pdf with the results of page.get_text("json")? #1461

{{title}}

Replies: 3 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to rebuild a page instance and put it back to the pdf with the results of page.get_text("json")? #1461

lidh15 Dec 15, 2021

Replies: 3 comments · 13 replies

JorjMcKie Dec 15, 2021 Maintainer

lidh15 Dec 17, 2021 Author

lidh15 Dec 17, 2021 Author

JorjMcKie Dec 20, 2021 Maintainer

lidh15 Dec 20, 2021 Author

JorjMcKie Dec 20, 2021 Maintainer

lidh15 Dec 21, 2021 Author

JorjMcKie Dec 21, 2021 Maintainer

JorjMcKie Dec 17, 2021 Maintainer

lidh15
Dec 15, 2021

Replies: 3 comments 13 replies

JorjMcKie
Dec 15, 2021
Maintainer

lidh15 Dec 17, 2021
Author

lidh15
Dec 17, 2021
Author

JorjMcKie Dec 20, 2021
Maintainer

lidh15 Dec 20, 2021
Author

JorjMcKie Dec 20, 2021
Maintainer

lidh15 Dec 21, 2021
Author

JorjMcKie Dec 21, 2021
Maintainer

JorjMcKie
Dec 17, 2021
Maintainer