PDF type: scanned or digitally created? #1853

Amitdedhia6 · 2022-08-03T05:21:42Z

Amitdedhia6
Aug 3, 2022

Hi

This is actually a query. I am new to PDF world. Is there a way to know whether a PDF document is a scanned document (i.e. it was created by scanning a physical paper doc), or a digitally created document, or an OCR document (i.e. a scanned document with text in it that was generated by OCR process)?

Thanks
Amit

Answered by JorjMcKie

Aug 3, 2022

This is detectable, sometimes with uncertainties, though.

You can have pages with regular text and other pages that are scanned images and may or may not contain text generated by some OCR engine.
You can have PDFs consisting exclusively of image pages - no text whatsoever.
You can have pages with regular text plus images on the page for which there also exists OCRed text.

Text generated by OCR engines is stored hidden. There again exist several options:

store the text "underneath" the scanned image with otherwise regular properties, i.e. a regular font like Helvetica and color black
store the text "hidden" using a special PDF attribute, "text rendering mode 3" - command 3 Tr.

Some sc…

View full answer

JorjMcKie · 2022-08-03T07:42:14Z

JorjMcKie
Aug 3, 2022
Maintainer

This is detectable, sometimes with uncertainties, though.

You can have pages with regular text and other pages that are scanned images and may or may not contain text generated by some OCR engine.
You can have PDFs consisting exclusively of image pages - no text whatsoever.
You can have pages with regular text plus images on the page for which there also exists OCRed text.

Text generated by OCR engines is stored hidden. There again exist several options:

store the text "underneath" the scanned image with otherwise regular properties, i.e. a regular font like Helvetica and color black
store the text "hidden" using a special PDF attribute, "text rendering mode 3" - command 3 Tr.

Some scanners (hardware) have a built-in OCR feature and thus produces an OCR PDF. In this case there may exist information in the PDF metadata fields (name of the producer).
OCR software may also proudly record itself in the metadata.

If there exists hidden text (3 Tr), it mostly means it was generated by some OCR. But it may also be that the PDF creator wanted to store "secret" / unreadable information - with whatever intention.

Presumably you are sufficiently confused by now.

There are the following checks you can make:

In [5]: doc=fitz.open("ocr.pdf")  # open your PDF
In [6]: doc.metadata  # look at the metadata
Out[6]:
{'format': 'PDF 1.7',
 'title': 'Untitled',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': 'ocrmypdf 9.6.0+dfsg / Tesseract OCR-PDF 4.1.1',  # <== OCRed with Tesseract!
 'producer': 'pikepdf 1.10.3+dfsg',
 'creationDate': "20220803030920-04'00'",
 'modDate': "20220803070920+00'00'",
 'trapped': '',
 'encryption': None}
In [7]: page=doc[0]  # look at some page
In [8]: page.get_fonts()  # check the text fonts used by it
Out[8]: [(8, 'ttf', 'Type0', 'VMIDCL+GlyphLessFont', 'R9', 'Identity-H')]
In [9]: # Tesseract uses this special font: "GlyphLessFont"

You can check whether the page is (almost) completely covered by some image:

In [10]: # look at the page rectangle:
In [11]: page.rect
Out[11]: Rect(0.0, 0.0, 595.0, 842.0)
In [12]: # check which images are on that page:
In [13]: page.get_images()
Out[13]: [(15, 0, 2481, 3508, 8, 'DeviceRGB', '', 'R12', 'FlateDecode')]
In [14]: # check area covered by image at xref 15:
In [15]: page.get_image_rects(15)
Out[15]: [Rect(0.0, 0.35089111328125, 595.0, 841.64892578125)]

Here we see that the image at xref 15 covers more or less the full page.

We can also check whether there exists hidden text on the page and inside which rectangles:

In [16]: # check which objects on the page cover which areas:
In [17]: page.get_bboxlog()
Out[17]:
[('ignore-text',
  (69.7477035522461, 36.211856842041016, 525.49169921875, 768.096923828125)),
 ('fill-image', (0.0, 0.35089111328125, 595.0, 841.64892578125))]

This page has 2 objects:

'ignore-text' - indicates hidden text (3 Tr), a strong indicator for OCRed text
'fill-image' - an image, covering the full page (almost)
The sequence of these items also means, that the text is painted first, then the image. And because the image boundary box (bbox) contains the text bbox - thus making it invisible, we have a second strong indicator for OCR.

For comparison, look at the following example, which contains regular text and an image, which do (practically) not overlap each other:

In [21]: page.get_bboxlog()
Out[21]:
[('fill-text',
  (102.67399597167969,
   211.9237823486328,
   113.07799530029297,
   300.5159912109375)),
 ('fill-image', (300.0, 300.0, 400.0, 400.0))]

"fill-text" means regular, visible text. For a full list of bbox types please look at the documentation of Page.get_bboxlog().

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF type: scanned or digitally created? #1853

{{title}}

Replies: 1 comment

{{title}}

Select a reply

PDF type: scanned or digitally created? #1853

Amitdedhia6 Aug 3, 2022

Replies: 1 comment

JorjMcKie Aug 3, 2022 Maintainer

Amitdedhia6
Aug 3, 2022

JorjMcKie
Aug 3, 2022
Maintainer