Some text characters in the content stream are present in a different format when the content stream is extracted using doc.xref_stream().decode() #1368

meghanaviyyapu · 2021-11-04T08:17:30Z

meghanaviyyapu
Nov 4, 2021

Instead of "fi" I observed /037.Can you please let me know why some characters are being present in different format in the stream.

Answered by JorjMcKie

Nov 4, 2021

Some fonts support so-called ligatures. These are single glyphs that represent more than one character. MuPDF supports these 7:

>>> for i in range(7):
	print(chr(0xfb00 + i))

	
ﬀ
ﬁ
ﬂ
ﬃ
ﬄ
ﬅ
ﬆ
>>>

So in your font, 0xFB01 = "fi" is represented by the glyph id 0o37 (as an octal number).

View full answer

JorjMcKie · 2021-11-04T08:36:02Z

JorjMcKie
Nov 4, 2021
Maintainer

Some fonts support so-called ligatures. These are single glyphs that represent more than one character. MuPDF supports these 7:

>>> for i in range(7):
	print(chr(0xfb00 + i))

	
ﬀ
ﬁ
ﬂ
ﬃ
ﬄ
ﬅ
ﬆ
>>>

So in your font, 0xFB01 = "fi" is represented by the glyph id 0o37 (as an octal number).

1 reply

meghanaviyyapu Nov 4, 2021
Author

Okay Thank you for the explanation. Can you please let me know if there are any glyphs apart from the above seven.

JorjMcKie · 2021-11-04T09:21:08Z

JorjMcKie
Nov 4, 2021
Maintainer

Strictly speaking, many languages have ligatures as part of their normal alphabet like the German "umlauts" ä. ö, ü, etc. which came from ae, oe, ue, or ß => ss, or think of the many special Scandinavian letters.
The Devanagari script used in the Indian subcontinent has lots and lots of them.
So there is no final answer to your question but the above seven or even just the first 6 are the usual ones in the extended Latin alphabet.

4 replies

meghanaviyyapu Nov 4, 2021
Author

Okay thank you .

JorjMcKie Nov 4, 2021
Maintainer

what most people do not know: "w" originated as a ligature from "v + v", and "&" is a ligature made from "et" (Latin for "and").

meghanaviyyapu Nov 4, 2021
Author

Using doc.xref_set_key(), I am trying to add tags to PDF but can you let me know what value should be provided to /K for text elements like links, headings.

JorjMcKie Nov 4, 2021
Maintainer

Sorry, I have no experience in this area.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some text characters in the content stream are present in a different format when the content stream is extracted using doc.xref_stream().decode() #1368

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Some text characters in the content stream are present in a different format when the content stream is extracted using doc.xref_stream().decode() #1368

meghanaviyyapu Nov 4, 2021

Replies: 2 comments · 5 replies

JorjMcKie Nov 4, 2021 Maintainer

meghanaviyyapu Nov 4, 2021 Author

JorjMcKie Nov 4, 2021 Maintainer

meghanaviyyapu Nov 4, 2021 Author

JorjMcKie Nov 4, 2021 Maintainer

meghanaviyyapu Nov 4, 2021 Author

JorjMcKie Nov 4, 2021 Maintainer

meghanaviyyapu
Nov 4, 2021

Replies: 2 comments 5 replies

JorjMcKie
Nov 4, 2021
Maintainer

meghanaviyyapu Nov 4, 2021
Author

JorjMcKie
Nov 4, 2021
Maintainer

meghanaviyyapu Nov 4, 2021
Author

JorjMcKie Nov 4, 2021
Maintainer

meghanaviyyapu Nov 4, 2021
Author

JorjMcKie Nov 4, 2021
Maintainer