Page.get_text() get messy text. #2134

buptyyf · 2022-12-16T10:02:18Z

buptyyf
Dec 16, 2022

Describe the bug (mandatory)

I get text of this test pdf, but some text show �. How to resolve it? Thanks in advance.

>>> doc[0].get_text()
'JANUARY/FEBRUARY 2023 \n����� ���  \n� �������  \n�������\n���� \n������\n������ \n�����\n���  \n������\n�e  \nWorld \nPutin \nMade\n'

Answered by JorjMcKie

Dec 16, 2022

As a background read I recommend this article on Artifex' blogging page.

A PDF creator may choose fonts that contain no information about how to back-translate the visual appearance of characters to their originating Unicode value.
This so-called CMAP (Character Map) may be missing - by error or on purpose.
The only way out (see the article) is OCRing the page, or parts of it - as it seems advisable in your case. Take a look at this demo script.

View full answer

JorjMcKie · 2022-12-16T10:09:55Z

JorjMcKie
Dec 16, 2022
Maintainer

A typical "Discussions" post. Let me convert this first.

0 replies

JorjMcKie · 2022-12-16T10:17:35Z

JorjMcKie
Dec 16, 2022
Maintainer

As a background read I recommend this article on Artifex' blogging page.

A PDF creator may choose fonts that contain no information about how to back-translate the visual appearance of characters to their originating Unicode value.
This so-called CMAP (Character Map) may be missing - by error or on purpose.
The only way out (see the article) is OCRing the page, or parts of it - as it seems advisable in your case. Take a look at this demo script.

2 replies

buptyyf Dec 16, 2022
Author

I try this demo script and set environment variable "TESSDATA_PREFIX", but get some errors. I use for reference your method.

import fitz
import time
from os import environ

environ["TESSDATA_PREFIX"] = "/Users/buptyyf/Documents/workspace/read/wrpypipeline/wrpdfpipeline/tests/ocr-tessdata"

mat = fitz.Matrix(5, 5)  # high resolution matrix
ocr_time = 0
pix_time = 0


def get_tessocr(page, bbox):
    """Return OCR-ed span text using Tesseract.
    Args:
        page: fitz.Page
        bbox: fitz.Rect or its tuple
    Returns:
        The OCR-ed text of the bbox.
    """
    global ocr_time, pix_time, tess, mat
    # Step 1: Make a high-resolution image of the bbox.
    t0 = time.perf_counter()
    pix = page.get_pixmap(
        matrix=mat,
        clip=bbox,
    )
    t1 = time.perf_counter()
    ocrpdf = fitz.open("pdf", pix.pdfocr_tobytes())
    ocrpage = ocrpdf[0]
    text = ocrpage.get_text()
    if text.endswith("\n"):
        text = text[:-1]
    t2 = time.perf_counter()
    ocr_time += t2 - t1
    pix_time += t1 - t0
    return text


doc = fitz.open("./testdata/FA 2023 Jan-Feb.pdf")
pages = [page for page in doc][0:30]
ocr_count = 0
for page in doc:
    blocks = page.get_text("dict", flags=0)["blocks"]
    for b in blocks:
        for l in b["lines"]:
            for s in l["spans"]:
                text = s["text"]
                if chr(65533) in text:  # invalid characters encountered!
                    # invoke OCR
                    ocr_count += 1
                    print("before: '%s'" % text)
                    text1 = text.lstrip()
                    sb = " " * (len(text) - len(text1))  # leading spaces
                    text1 = text.rstrip()
                    sa = " " * (len(text) - len(text1))  # trailing spaces
                    new_text = sb + get_tessocr(page, s["bbox"]) + sa
                    print(" after: '%s'" % new_text)

print("-------------------------")
print("OCR invocations: %i." % ocr_count)
print(
    "Pixmap time: %g (avg %g) seconds."
    % (round(pix_time, 5), round(pix_time / ocr_count, 5))
)
print(
    "OCR time: %g (avg %g) seconds."
    % (round(ocr_time, 5), round(ocr_time / ocr_count, 5))
)

/Users/buptyyf/Documents/workspace/read/wrpypipeline/wrpdfpipeline/tests/test_reflow.py in line 29, in get_tessocr(page, bbox)
     [245](file:///Users/buptyyf/Documents/workspace/read/wrpypipeline/wrpdfpipeline/tests/test_reflow.py?line=244) pix = page.get_pixmap(
     [246](file:///Users/buptyyf/Documents/workspace/read/wrpypipeline/wrpdfpipeline/tests/test_reflow.py?line=245)     matrix=mat,
     [247](file:///Users/buptyyf/Documents/workspace/read/wrpypipeline/wrpdfpipeline/tests/test_reflow.py?line=246)     clip=bbox,
     [248](file:///Users/buptyyf/Documents/workspace/read/wrpypipeline/wrpdfpipeline/tests/test_reflow.py?line=247) )
     [249](file:///Users/buptyyf/Documents/workspace/read/wrpypipeline/wrpdfpipeline/tests/test_reflow.py?line=248) t1 = time.perf_counter()
---> [250](file:///Users/buptyyf/Documents/workspace/read/wrpypipeline/wrpdfpipeline/tests/test_reflow.py?line=249) ocrpdf = fitz.open("pdf", pix.pdfocr_tobytes())
     [251](file:///Users/buptyyf/Documents/workspace/read/wrpypipeline/wrpdfpipeline/tests/test_reflow.py?line=250) ocrpage = ocrpdf[0]
     [252](file:///Users/buptyyf/Documents/workspace/read/wrpypipeline/wrpdfpipeline/tests/test_reflow.py?line=251) text = ocrpage.get_text()

File ~/.pyenv/versions/3.10.7/envs/wrpdf_proj/lib/python3.10/site-packages/fitz/fitz.py:6933, in Pixmap.pdfocr_tobytes(self, compress, language)
   [6921](file:///Users/buptyyf/.pyenv/versions/3.10.7/envs/wrpdf_proj/lib/python3.10/site-packages/fitz/fitz.py?line=6920) """Save pixmap as an OCR-ed PDF page.
   [6922](file:///Users/buptyyf/.pyenv/versions/3.10.7/envs/wrpdf_proj/lib/python3.10/site-packages/fitz/fitz.py?line=6921) 
   [6923](file:///Users/buptyyf/.pyenv/versions/3.10.7/envs/wrpdf_proj/lib/python3.10/site-packages/fitz/fitz.py?line=6922) Args:
   (...)
   [6930](file:///Users/buptyyf/.pyenv/versions/3.10.7/envs/wrpdf_proj/lib/python3.10/site-packages/fitz/fitz.py?line=6929)     Tesseract's language support data.
...
-> [6933](file:///Users/buptyyf/.pyenv/versions/3.10.7/envs/wrpdf_proj/lib/python3.10/site-packages/fitz/fitz.py?line=6932)     raise RuntimeError("No OCR support: TESSDATA_PREFIX not set")
   [6934](file:///Users/buptyyf/.pyenv/versions/3.10.7/envs/wrpdf_proj/lib/python3.10/site-packages/fitz/fitz.py?line=6933) EnsureOwnership(self)
   [6935](file:///Users/buptyyf/.pyenv/versions/3.10.7/envs/wrpdf_proj/lib/python3.10/site-packages/fitz/fitz.py?line=6934) from io import BytesIO

RuntimeError: No OCR support: TESSDATA_PREFIX not set

JorjMcKie Dec 16, 2022
Maintainer

You did not read the documentation!
You cannot succesfully set TESSDATA_PREFIX inside the script - must be done at OS level.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Page.get_text() get messy text. #2134

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Page.get_text() get messy text. #2134

buptyyf Dec 16, 2022

Describe the bug (mandatory)

Replies: 2 comments · 2 replies

JorjMcKie Dec 16, 2022 Maintainer

JorjMcKie Dec 16, 2022 Maintainer

buptyyf Dec 16, 2022 Author

JorjMcKie Dec 16, 2022 Maintainer

buptyyf
Dec 16, 2022

Replies: 2 comments 2 replies

JorjMcKie
Dec 16, 2022
Maintainer

JorjMcKie
Dec 16, 2022
Maintainer

buptyyf Dec 16, 2022
Author

JorjMcKie Dec 16, 2022
Maintainer