Strange character converted as space #1329
-
I have a sample pdf which has the special character very much looks like ti, but when extract data from json format, only space character is extracted at the corresponding position, I am wondering is there a way to distinguish the this special character? From my experiment, it is not drawing as well. Or any encoding information I could obtain to distinguish this pdf file from others? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 4 replies
-
This is a so-called ligature. These are frequent character combinations which are encoded as one single glyph in fonts which support that. |
Beta Was this translation helpful? Give feedback.
-
If you want, you can send me the file - or even better just the page, and I will have a look. |
Beta Was this translation helpful? Give feedback.
-
@Yichen-fqyd - I saw you using JSON as output option. Any specific reason for this? You know that this actually is DICT plus conversion to json format using the |
Beta Was this translation helpful? Give feedback.
-
@Yichen-fqyd - I had a look at the file: strange indeed! # choose high enough dpi, e.g. dpi = 300
dpi = 300
zoom = 300 / 72 # zoom factor, 72 dpi is standard!
matrix =fitz.Matrix(zoom, zoom)
pix = page.get_pixmap(matrix=matrix)
pdfbytes = pix.pdfocr_tobytes()
ocr_pdf = fitz.open("pdf", pdfbytes) # this is a 1-page PDF with OCR-ed text
ocr_page = ocr_pdf[0] # read that page.
print(ocr_page.get_text()) # this contains the original text including all the "ti" correctly |
Beta Was this translation helpful? Give feedback.
This is a so-called ligature. These are frequent character combinations which are encoded as one single glyph in fonts which support that.
As usual: not all font support them or all of them.
I know that MuPDF supports 6 ligatures: fi, fl, ffi, ffl, ff, st. I am afraid there is no support for this ligature "ti" in MuPDF.
But you can try to let MuPDF decompose ligatures: this is one of the option bits in the flags integer in text extractions: switch off "TEXT_PRESERVE_LIGATURES" in those flags, e.g. set flags=0 and see what happens.