Extracting Chinese some texts are not right. #2367

buptyyf · 2023-04-23T06:34:39Z

buptyyf
Apr 23, 2023

Please provide all mandatory information!

Describe the bug (mandatory)

In one pdf, pymupdf extracts some chinese character not right.

code

doc = fitz.open(pdf_file_path)
doc[1].get_text()

Many characters are not right.

How can I recognize and fix it?

Answered by JorjMcKie

Apr 23, 2023

I have another question. Why pdf reader in mac or chrome can read this pdf correctly?

Showing this PDF is not the problem. You talked about text extraction. If you create a page pixmap with PyMuPDF, you will get the right picture.
But if using e.g. Adobe Acrobat or any other PDF viewer and then selecting the text with the cursor you will get the same wrong result.

View full answer

JorjMcKie · 2023-04-23T06:39:40Z

JorjMcKie
Apr 23, 2023
Maintainer

This is a "Discussions" item, so I will first covert it.

0 replies

JorjMcKie · 2023-04-23T07:33:20Z

JorjMcKie
Apr 23, 2023
Maintainer

This problem is caused by the fonts themselves. A text extraction software is dependent on the font's back-translation information, which is contained in the /ToUnicode table or directly given as a standard character encoding scheme, e.g. /WinAnsiEncoding.
Any wrong information there cannot be detected as such.
The only thing you can try is using OCR, sorry.

3 replies

buptyyf Apr 23, 2023
Author

So, I think it's no solution. I have no way to detect this problem, so that I can use OCR to solve it.

buptyyf Apr 23, 2023
Author

I have another question. Why pdf reader in mac or chrome can read this pdf correctly?

JorjMcKie Apr 23, 2023
Maintainer

I have another question. Why pdf reader in mac or chrome can read this pdf correctly?

Showing this PDF is not the problem. You talked about text extraction. If you create a page pixmap with PyMuPDF, you will get the right picture.
But if using e.g. Adobe Acrobat or any other PDF viewer and then selecting the text with the cursor you will get the same wrong result.

Answer selected by JorjMcKie

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting Chinese some texts are not right. #2367

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Extracting Chinese some texts are not right. #2367

buptyyf Apr 23, 2023

Describe the bug (mandatory)

Replies: 2 comments · 3 replies

JorjMcKie Apr 23, 2023 Maintainer

JorjMcKie Apr 23, 2023 Maintainer

buptyyf Apr 23, 2023 Author

buptyyf Apr 23, 2023 Author

JorjMcKie Apr 23, 2023 Maintainer

buptyyf
Apr 23, 2023

Replies: 2 comments 3 replies

JorjMcKie
Apr 23, 2023
Maintainer

JorjMcKie
Apr 23, 2023
Maintainer

buptyyf Apr 23, 2023
Author

buptyyf Apr 23, 2023
Author

JorjMcKie Apr 23, 2023
Maintainer