PDF type: scanned or digitally created? #1853
-
Hi This is actually a query. I am new to PDF world. Is there a way to know whether a PDF document is a scanned document (i.e. it was created by scanning a physical paper doc), or a digitally created document, or an OCR document (i.e. a scanned document with text in it that was generated by OCR process)? Thanks |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
This is detectable, sometimes with uncertainties, though.
Text generated by OCR engines is stored hidden. There again exist several options:
Some scanners (hardware) have a built-in OCR feature and thus produces an OCR PDF. In this case there may exist information in the PDF metadata fields (name of the producer). If there exists hidden text ( Presumably you are sufficiently confused by now. There are the following checks you can make: In [5]: doc=fitz.open("ocr.pdf") # open your PDF
In [6]: doc.metadata # look at the metadata
Out[6]:
{'format': 'PDF 1.7',
'title': 'Untitled',
'author': '',
'subject': '',
'keywords': '',
'creator': 'ocrmypdf 9.6.0+dfsg / Tesseract OCR-PDF 4.1.1', # <== OCRed with Tesseract!
'producer': 'pikepdf 1.10.3+dfsg',
'creationDate': "20220803030920-04'00'",
'modDate': "20220803070920+00'00'",
'trapped': '',
'encryption': None}
In [7]: page=doc[0] # look at some page
In [8]: page.get_fonts() # check the text fonts used by it
Out[8]: [(8, 'ttf', 'Type0', 'VMIDCL+GlyphLessFont', 'R9', 'Identity-H')]
In [9]: # Tesseract uses this special font: "GlyphLessFont" You can check whether the page is (almost) completely covered by some image: In [10]: # look at the page rectangle:
In [11]: page.rect
Out[11]: Rect(0.0, 0.0, 595.0, 842.0)
In [12]: # check which images are on that page:
In [13]: page.get_images()
Out[13]: [(15, 0, 2481, 3508, 8, 'DeviceRGB', '', 'R12', 'FlateDecode')]
In [14]: # check area covered by image at xref 15:
In [15]: page.get_image_rects(15)
Out[15]: [Rect(0.0, 0.35089111328125, 595.0, 841.64892578125)] Here we see that the image at xref 15 covers more or less the full page. We can also check whether there exists hidden text on the page and inside which rectangles: In [16]: # check which objects on the page cover which areas:
In [17]: page.get_bboxlog()
Out[17]:
[('ignore-text',
(69.7477035522461, 36.211856842041016, 525.49169921875, 768.096923828125)),
('fill-image', (0.0, 0.35089111328125, 595.0, 841.64892578125))] This page has 2 objects:
For comparison, look at the following example, which contains regular text and an image, which do (practically) not overlap each other: In [21]: page.get_bboxlog()
Out[21]:
[('fill-text',
(102.67399597167969,
211.9237823486328,
113.07799530029297,
300.5159912109375)),
('fill-image', (300.0, 300.0, 400.0, 400.0))] "fill-text" means regular, visible text. For a full list of bbox types please look at the documentation of |
Beta Was this translation helpful? Give feedback.
This is detectable, sometimes with uncertainties, though.
Text generated by OCR engines is stored hidden. There again exist several options:
3 Tr
.Some sc…