How to remove the contents of the images? #1346

yusheng0104 · 2021-10-27T20:12:16Z

yusheng0104
Oct 27, 2021

Hi,

Is there a method could remove all the contents of an image in a pdf file? Pymupdf is sometimes too powerful and picks up some words from images. And my goal is analyzing the text not in the image.

Thanks

Answered by JorjMcKie

Oct 27, 2021

I think I don't understand your problem.
There are several ways how to ignore images and only extract text: page method get_text has an option string as argument one, and a keyword parameter flags, which together control this:

option = "text" will only extract text
switching off TEXT_PRESERVE_IMAGES from flags will only extract text for the other option values ("blocjs", "dict" etc.)

Please be more specific.

View full answer

JorjMcKie · 2021-10-27T23:36:12Z

JorjMcKie
Oct 27, 2021
Maintainer

I think I don't understand your problem.
There are several ways how to ignore images and only extract text: page method get_text has an option string as argument one, and a keyword parameter flags, which together control this:

option = "text" will only extract text
switching off TEXT_PRESERVE_IMAGES from flags will only extract text for the other option values ("blocjs", "dict" etc.)

Please be more specific.

3 replies

yusheng0104 Oct 27, 2021
Author

Thanks. You've answered my question. I was bothered by some weird text from the images.

JorjMcKie Oct 27, 2021
Maintainer

Ok, good! That was easy then 😎

yusheng0104 Oct 28, 2021
Author

Could also exclude the metadata of the images? In the documentation, it says TEXT_PRESERVE_IMAGES will return metadata.
Thanks,

JorjMcKie · 2021-10-28T01:36:16Z

JorjMcKie
Oct 28, 2021
Maintainer

If bit TEXT_PRESERVE_IMAGES is switched off, no image information at all will be included in any output type. Also no meta information.

4 replies

yusheng0104 Oct 28, 2021
Author

Thanks a lot. Could you please teach me how to switch TEXT_PRESERVE_IMAGES off?

JorjMcKie Oct 28, 2021
Maintainer

Use the Python "OR" operator "|" to set single bits of the flags argument.
Depending on the text method used, there are default combinations used, if you do not specify flags - see here.
Otherwise use this list to determine the bits you need:

flags = (fitz.TEXT_PRESERVE_LIGATURES |
    fitz.TEXT_PRESERVE_WHITESPACE
    )

Then use flags in your text extraction or search function. The above example will ignore all images.

yusheng0104 Oct 28, 2021
Author

Thanks. I tried get_text(options = "text", flags =3), but still got text from the images.

JorjMcKie Oct 28, 2021
Maintainer

I am sure there is some misconception: this extraction format does never access image information!
Can you give me the file?
If you have confidentiality concerns, you may use my private e-mail [email protected].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to remove the contents of the images? #1346

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to remove the contents of the images? #1346

yusheng0104 Oct 27, 2021

Replies: 2 comments · 7 replies

JorjMcKie Oct 27, 2021 Maintainer

yusheng0104 Oct 27, 2021 Author

JorjMcKie Oct 27, 2021 Maintainer

yusheng0104 Oct 28, 2021 Author

JorjMcKie Oct 28, 2021 Maintainer

yusheng0104 Oct 28, 2021 Author

JorjMcKie Oct 28, 2021 Maintainer

yusheng0104 Oct 28, 2021 Author

JorjMcKie Oct 28, 2021 Maintainer

yusheng0104
Oct 27, 2021

Replies: 2 comments 7 replies

JorjMcKie
Oct 27, 2021
Maintainer

yusheng0104 Oct 27, 2021
Author

JorjMcKie Oct 27, 2021
Maintainer

yusheng0104 Oct 28, 2021
Author

JorjMcKie
Oct 28, 2021
Maintainer

yusheng0104 Oct 28, 2021
Author

JorjMcKie Oct 28, 2021
Maintainer

yusheng0104 Oct 28, 2021
Author

JorjMcKie Oct 28, 2021
Maintainer