Fully embedded font is extracted only partially if it occupies more than one objects #2111

sedimentation-fault · 2022-12-08T19:23:45Z

sedimentation-fault
Dec 8, 2022

Description

I have a PDF where the PDF reader was telling me that some fonts were fully embedded, so I decided to test my freshly installed PyMuPDF 1.21.0-rc2 on it. It told me that it "saved 16 fonts to" my working directory, but looking there revealed only 11 of them. Looking at the list of fonts from pdffonts, it became clear why: some fonts that were "fully" embedded were using more than one objects to store partial pieces of them. A check with mupdf confirmed that PyMuPDF extracted only the last object onto font-name.cff, probably because the output file (font-name.cff) was the same for all pieces/objects of font with name font-name.

How To Reproduce

You must have one of those PDFs that embed a font fully by storing pieces of it in multiple objects. Let's say x.pdf is one of them. pdffonts lists the fonts of x.pdf as follows:

pdffonts x.pdf

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
Times-Roman                          Type 1C           MacRoman         yes no  no     567  0
CAJCNM+intiri                        Type 1C           Custom           yes yes yes    593  0
CAJCMM+intirr                        Type 1C           Custom           yes yes yes    595  0
CAJCON+intirsc                       Type 1C           Custom           yes yes yes    594  0
intirsc                              Type 1C           WinAnsi          yes no  no     596  0
DFMPPD+Springnew-Regular             Type 1C           Custom           yes yes yes    576  0
Times-Roman                          Type 1C           Custom           yes no  no     569  0
Times-Italic                         Type 1C           Custom           yes no  no     568  0
Times-Bold                           Type 1C           WinAnsi          yes no  no     321  0
Times-Roman                          Type 1C           Custom           yes no  no     306  0
Times-Italic                         Type 1C           Custom           yes no  no     315  0
Times-BoldItalic                     Type 1C           WinAnsi          yes no  no     350  0
MT2SYS                               Type 1C           Custom           yes no  yes    379  0
MT2MIT                               Type 1C           MacRoman         yes no  no     392  0
MT2MIS                               Type 1C           WinAnsi          yes no  no     332  0
Times-Bold                           Type 1C           WinAnsi          yes no  no     311  0

You can see that, for example, Times-Roman has 'no' in the "sub" (subsetted) column, meaning it is NOT subsetted - therefore it is fully embedded. You can also see that it occupied objects with numbers 567, 569 and 306.

mutool shows practically the same with its 'info' command:

mutool info -F x.pdf

Fonts (16):
        2       (820 0 R):      Type1 'Times-Roman' MacRomanEncoding (567 0 R)
        3       (819 0 R):      Type1 'CAJCNM+intiri' (593 0 R)
        3       (819 0 R):      Type1 'CAJCMM+intirr' (595 0 R)
        3       (819 0 R):      Type1 'CAJCON+intirsc' (594 0 R)
        3       (819 0 R):      Type1 'intirsc' WinAnsiEncoding (596 0 R)
        4       (301 0 R):      Type1 'DFMPPD+Springnew-Regular' (576 0 R)
        5       (298 0 R):      Type1 'Times-Roman' WinAnsiEncoding (569 0 R)
        5       (298 0 R):      Type1 'Times-Italic' WinAnsiEncoding (568 0 R)
        6       (295 0 R):      Type1 'Times-Bold' WinAnsiEncoding (321 0 R)
        6       (295 0 R):      Type1 'Times-Roman' WinAnsiEncoding (306 0 R)
        6       (295 0 R):      Type1 'Times-Italic' WinAnsiEncoding (315 0 R)
        10      (292 0 R):      Type1 'Times-BoldItalic' WinAnsiEncoding (350 0 R)
        22      (47 0 R):       Type1 'MT2SYS' (379 0 R)
        30      (798 0 R):      Type1 'MT2MIT' MacRomanEncoding (392 0 R)
        207     (620 0 R):      Type1 'MT2MIS' WinAnsiEncoding (332 0 R)
        220     (134 0 R):      Type1 'Times-Bold' WinAnsiEncoding (311 0 R)

We could use mutool extract x.pdf, but that is not user-friendly, as it a) extracts both all fonts and all images and b) it extracts fonts as font-XXXX.cff (or, possibly, font-XXXX.ttf), where XXXX has no relation to the font object number (contrary to what its documentation claims), so you practically don't know which file is which font, unless you open each one of them, or at least read its metadata somehow.

Enter PyMuPDF which promises to a) extract all fonts and b) give the extracted files sensible names.Alas, trying it on our x.pdf results on just 11 fonts - contrary to the claimed 16:

python -m fitz extract -fonts x.pdf

saved 16 fonts to ...

ls -l | awk -e '{print $5,$9}'

24759 CAJCMM+intirr.cff
25860 CAJCNM+intiri.cff
23898 CAJCON+intirsc.cff
1295 DFMPPD+Springnew-Regular.cff
217 MT2MIS.cff
506 MT2MIT.cff
286 MT2SYS.cff
17076 Times-Bold.cff
18332 Times-BoldItalic.cff
18302 Times-Italic.cff
24847 Times-Roman.cff

(first column is file size)

What has happened? Comparing this to the output of mutool extract

mutool extract x.pdf*
...
ls -l font-* | awk -e '{print $5, $9}'

217 font-0330.cff
506 font-0522.cff
286 font-0534.cff
18332 font-0549.cff
18302 font-0557.cff
17076 font-0558.cff
25078 font-0561.cff
26087 font-0564.cff
1295 font-0574.cff
23496 font-0579.cff
24759 font-0583.cff
23898 font-0587.cff
25860 font-0591.cff
24847 font-0599.cff

and looking carefully at the file sizes (first column), we see that Times-Roman.cff (the Times-Roman font as extracted by PyMuPDF) is exactly font-0599.cff (a font extracted by mutool, whose object number is NOT 599 (there is no font object with such a number in x.pdf)) - but this is only one of the three pieces (objects) that store parts of Times-Roman!

Your configuration (mandatory)

Operating system: Gentoo
Python version: 3.10
PyMuPDF version: 1.21.0-rc2, installation method: generated from source, using installed mupdf 1.21.0.

More precisely:

python -c 'import sys; import fitz; print(sys.version, "\n", sys.platform, "\n", fitz.__doc__)'

3.10.0 (default, Feb 11 2022, 00:50:04) [GCC 11.2.0] 
 linux 
 
PyMuPDF 1.21.0rc2: Python bindings for the MuPDF 1.21.0 library.
Version date: 2022-11-07 00:00:01.
Built for Python 3.10 on linux (64-bit).

Answered by sedimentation-fault

Dec 11, 2022

I can confirm that the change in main.py

--- ./fitz/__main__.py.orig     2022-11-07 19:21:52.000000000 +0100
+++ ./fitz/__main__.py  2022-12-08 19:53:12.000000000 +0100
@@ -512,7 +512,7 @@
                     if ext == "n/a" or not buffer:
                         continue
                     outname = os.path.join(
-                        out_dir, fontname.replace(" ", "-") + "." + ext
+                        out_dir, f"{fontname.replace(' ', '-')}-{xref}.{ext}"
                     )
                     outfile = open(outname, "wb")
                     outfile.write(buffer)

solves the problem of overwriting previously extracted font files, since now each font filename contains a …

View full answer

JorjMcKie · 2022-12-08T19:41:02Z

JorjMcKie
Dec 8, 2022
Maintainer

This seems to be a duplicate of #2109 - however I am not really sure.
As I wrote there:
Please provide a reproducer file and a clear description of result vs. your expectation.

0 replies

sedimentation-fault · 2022-12-08T20:06:14Z

sedimentation-fault
Dec 8, 2022
Author

Please delete 2109 - I tried to correct it while it was on its way to the server and didn't realize it had already been created. Sorry about that.

I had also written my expectation - but got deleted somehow during my writing...

My expectation is that PyMuPDF assembles the three pieces into one. If that's too difficult (or not possible due to limitations of mupdf/mutool), at the very minimum use suffixes to write the various pieces in their own files, e.g. Times-Roman-1.cff, Times-Roman-2.cff, Times-Roman-3.cff.

I cannot provide an even more clear description of the result than what I have already done.
To put it in other words: the Times-Roman font consists of three objects. PyMuPDF does this:

extract Times Roman object 1 to Times-Roman.cff
extract Times Roman object 2 to Times-Roman.cff - at this point Times-Roman.cff is overwritten by the contents of object 2
extract Times Roman object 3 to Times-Roman.cff - at this point Times-Roman.cff is overwritten by the contents of object 3

Then it tells the user that "all 16" fonts have been extracted, but the user sees only, say, 11 - because some .cff files were overwritten in the extraction process.

I will try to send you the PDF to your outlook account, as I cannot put it publicly here.

0 replies

sedimentation-fault · 2022-12-08T20:10:25Z

sedimentation-fault
Dec 8, 2022
Author

BTW, Times Roman object 1/2/3 are not fully embedded Times-Roman fonts. They are parts of the font that together form a fully embedded font. So overwriting Times-Roman.cff each time with the next object that happens to say "I keep data for 'Times-Roman' font" destroys parts of the font that we want to extract.

0 replies

JorjMcKie · 2022-12-08T20:32:08Z

JorjMcKie
Dec 8, 2022
Maintainer

My expectation is that PyMuPDF assembles the three pieces into one. If that's too difficult (or not possible due to limitations of mupdf/mutool), at the very minimum use suffixes to write the various pieces in their own files, e.g. Times-Roman-1.cff, Times-Roman-2.cff, Times-Roman-3.cff.

This is either not possible or clearly beyond the intended scope of PyMuPDF. Features like this one should be looked for in dedicated font packages like fontTools.
But in the general case, your expection goes beyond what is possible - it is not a "restriction" of whatever tool.

What I suspect is really your problem instead: You extract font names without their subset identifier ABCDEF+. So your script treats different subsets of the same font as one and thus overwrites a previously extracted other subset font.
So why don't you do fitz.TOOLS.set_subset_fontnames(True) before the first text extraction - if this is what you were doing: unfortunately you didn't mention that, so I am forced to guess.

For the time being, I will convert this post from an issue to a "Discussions" item.

0 replies

sedimentation-fault · 2022-12-08T23:05:26Z

sedimentation-fault
Dec 8, 2022
Author

I did not write any script, I just used fitz the way it is supposed to be run:

python -m fitz extract -fonts x.pdf

I wrote this command in my post above, together with its output. I would expect this command to extract the fonts correctly, that's all.

0 replies

sedimentation-fault · 2022-12-08T23:32:48Z

sedimentation-fault
Dec 8, 2022
Author

And, in doing the above, I was just following the instructions at
Module fitz — PyMuPDF 1.21.0 documentation - Read the Docs
in section "Extracting Fonts and Images", where it even goes further to say that

Except for output directory creation, this feature[1] is functionally equivalent to and obsoletes this script.

This suggests that now you don't even have to write a script to extract fonts (or images), it is all done by the extract command of the fitz module. Well, if is is really so, then I do expect it to extract fonts correctly, to not overwrite any .cff and to present me the (fully embedded) font as a complete file and not as some part of it that happened to be the last one it encountered.
This IS a bug!

[1] i.e. the command python -m fitz extract .

1 reply

JorjMcKie Dec 8, 2022
Maintainer

I finally got it.
The real problem seems to be that your PDF includes different fonts under the same name but under different xrefs.
To cope with this, that extraction method must include the xref in the name as a differentiator.
So please do this (always required for a bug submission):
Enter a new (bug) issue named something like "'fitz' module: Include xref in font name", and supply an example PDF to reproduce the problem.

FYI Fonts cannot be distributed across different PDF objects. Every font item is a font in its own right - even if it is a subset font, which means no more no less that it contains a subset of characters of its original.

sedimentation-fault · 2022-12-09T10:08:11Z

sedimentation-fault
Dec 9, 2022
Author

and supply an example PDF to reproduce the problem.

I have sent you an example PDF to your outlook address already - didn't you get it?

FYI Fonts cannot be distributed across different PDF objects. Every font item is a font in its own right - even if it is a subset font, which means no more no less that it contains a subset of characters of its original.

So assuming fitz adds those xrefs to the filenames, so that it outputs

Times-Roman-XYZ.cff
Times-Roman-YZX.cff
Times-Roman-ZXY.cff

You say that it will not be possible for fitz to combine them into one and just output their union in a Times-Roman.cff?

When I open such a "partial" Times-Roman-ZXY.cff file in fontforge, it shows me the glyphs in their correct places, while the other places remain empty. When I open, say, Times-Roman-YZX.cff, it will show me some other glyphs in their correct places and the rest empty. It should be possible to combine the two into one, under a new "encoding" that would be the union of the two encodings. Even if the two subfonts might contain common glyphs, it should be possible to tell it to "keep the glyphs from the first font and only add glyphs from the second in the empty "seats" (read: code points).

2 replies

JorjMcKie Dec 9, 2022
Maintainer

First of all: yes I have made that change and the fonts will be saved using this scheme:

Times-Roman-XYZ.cff
Times-Roman-YZX.cff
Times-Roman-ZXY.cff

Also verified its workings with the test file you submitted - thanks!
In my e-mail response I also explained, why merging subsets fonts together again may or may not be possible. You sketched a situation where this indeed may be possible.
In any case, this is not within (Py-) MuPDF's scope! Check out specialized software - a good point to start with may be FontForge, but I am really not sure.

JorjMcKie Dec 9, 2022
Maintainer

The change within __main__.py is a trivial one. Please drop me a note if you want it. You can manually copy it over to your installation to try out without regenerating ...

sedimentation-fault · 2022-12-09T11:28:22Z

sedimentation-fault
Dec 9, 2022
Author

Got your personal reply, thanks!

Just post the output of diff -u here and I will patch my version of __main__.py with it. I guess it's the same code that adds the xref's to the img-XXX filenames, in case one extracts an image object, instead of a font object. That would at least output exactly as many fonts as there are font objects (right now, on the test file, it says it extracted 16 fonts, but we see only 11...).

Thank you for all clarifications.

3 replies

JorjMcKie Dec 9, 2022
Maintainer

main.zip
Here is the update ...

JorjMcKie Dec 9, 2022
Maintainer

it does extract 16 fonts

sedimentation-fault Dec 9, 2022
Author

Thanks!

sedimentation-fault · 2022-12-11T11:03:56Z

sedimentation-fault
Dec 11, 2022
Author

I can confirm that the change in main.py

--- ./fitz/__main__.py.orig     2022-11-07 19:21:52.000000000 +0100
+++ ./fitz/__main__.py  2022-12-08 19:53:12.000000000 +0100
@@ -512,7 +512,7 @@
                     if ext == "n/a" or not buffer:
                         continue
                     outname = os.path.join(
-                        out_dir, fontname.replace(" ", "-") + "." + ext
+                        out_dir, f"{fontname.replace(' ', '-')}-{xref}.{ext}"
                     )
                     outfile = open(outname, "wb")
                     outfile.write(buffer)

solves the problem of overwriting previously extracted font files, since now each font filename contains a suffix that is its object number (xref).

I would have preferred to mark Jorj's answer as the answer to this question, but somehow I could only mark my own answers...

To anybody who is interested in how one can continue from here:

One could use the merge functionality of fonttools:

fonttools merge Times-Roman-*.

HOWEVER...fonttools will complain that the fonts are neither TTF, nor OTF... :-( Therefore one has to first convert those .cff files

Times-Roman-306.cff
Times-Roman-567.cff
Times-Italic-568.cff

to OTF (I read that OTF was specifically designed to accommodate CFF files and it should be "just" a matter or writing an "OTF wrapper" around them...), then merge them to one with

fonttools merge Times-Roman-*.otf

I have not found any command-line tool to accomplish the CFF-to-OTF conversion, although the Internet is full of online converters that claim to be able to do so. Just as a proof-of-concept, I tried fonforge:

I opened each one of the CFFs in fontforge and told it to create a font out of it (File --> Create Font, choose "OpenType (CFF)" as the font type, accept all default settings ). It did it, but not before it warned me that they contained errors:

For example, for Times-Roman-567.cff, as extracted by PyMuPDF:

The font contains errors.

Self Intersecting

Missing Points at Extrema

Glyph contains overlapped hints (in the same hintmask)

Bad Private Dictionary

Would you like to review the errors or save the font anyway?

I chose to save them and passed the .otf files to fonttool to merge them:

fonttools merge Times-Roman-*.otf

fonttools did it, but warned me that it dropped all CMAPs from the three OTFs:

WARNING: Dropped cmap subtable from font '0':   format  0, platformID  1, platEncID  0
WARNING: Dropped cmap subtable from font '1':   format  0, platformID  1, platEncID  0
WARNING: Dropped cmap subtable from font '2':   format  0, platformID  1, platEncID  0

At this point, I have no idea whether the created merged.ttf is "better" than the originals (in the sense that it "contains them all and is usable"), or it misses vital components. But I guess this could be the way to go, if one chose to merge extracted fonts of the same family (e.g. Times-Roman) and different object numbers (xrefs).

2 replies

JorjMcKie Dec 11, 2022
Maintainer

Just as an aside:
Using fonts without a CMAP to write to e.g. a PDF will result in text that is not extractable: instead of the usual unicodes, the invalid unicode 0xFFFD (�) will be extracted.

sedimentation-fault Dec 11, 2022
Author

Background reading, for those interested in the Why:
A bit of font generation

sedimentation-fault · 2022-12-11T22:53:11Z

sedimentation-fault
Dec 11, 2022
Author

I was wondering why fonttools merge was giving me those warnings about dropped CMAPs:

WARNING: Dropped cmap subtable from font '0':   format  0, platformID  1, platEncID  0
...

Looking at the source of Lib/fontTools/merge/cmap.py from fonttools, the answer is there, even in the comments:

`# Only merge format 4 and 12 Unicode subtables, ignores all other subtables`

and in the code:

		for subtable in table.tables:
			properties = (subtable.format, subtable.platformID, subtable.platEncID)
			if properties in _CmapUnicodePlatEncodings.BMP:
				format4 = subtable
			elif properties in _CmapUnicodePlatEncodings.FullRepertoire:
				format12 = subtable
			else:
				log.warning(
					"Dropped cmap subtable from font '%s':\t"
					"format %2s, platformID %2s, platEncID %2s",
					fontIdx, subtable.format, subtable.platformID, subtable.platEncID
				)

which, in plain english says:

If the triple (format, platform ID, encoding ID) looks like 'Unicode BMP-only', the CMAP becomes its 'format4' table, if it looks like 'Unicode Full Repertoire', it becomes its 'format12' table - otherwise the CMAP is dropped and the above warning is logged.

In view of this information, I see that my three CFF fonts are 'format 0' (that was in the Warning messages already, but it is only now that I see it, after looking at the format (sic!) string of the warning in the code above!). 'format0' means 'old style 256-character font' (as explained in the background reading). So fonttools will not merge the CMAPs of such 'old style' fonts, either because it can't, or because it's illogical/impossible (not sure here about the true reason).

It seems that, in the case of my example file, you cannot go any further than extract the fonts with PyMUPDF and be happy about it! That's probably all that can be done in this case...

1 reply

JorjMcKie Dec 11, 2022
Maintainer

you are providing interesting information here, I must say - thank you!

JorjMcKie · 2022-12-13T01:05:24Z

JorjMcKie
Dec 13, 2022
Maintainer

A new version, pre-release 1.21.1rc1 has just been published.
It contains the respective fix.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fully embedded font is extracted only partially if it occupies more than one objects #2111

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 11 comments 9 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Fully embedded font is extracted only partially if it occupies more than one objects #2111

sedimentation-fault Dec 8, 2022

Description

How To Reproduce

Your configuration (mandatory)

Replies: 11 comments · 9 replies

JorjMcKie Dec 8, 2022 Maintainer

sedimentation-fault Dec 8, 2022 Author

sedimentation-fault Dec 8, 2022 Author

JorjMcKie Dec 8, 2022 Maintainer

sedimentation-fault Dec 8, 2022 Author

sedimentation-fault Dec 8, 2022 Author

JorjMcKie Dec 8, 2022 Maintainer

sedimentation-fault Dec 9, 2022 Author

JorjMcKie Dec 9, 2022 Maintainer

JorjMcKie Dec 9, 2022 Maintainer

sedimentation-fault Dec 9, 2022 Author

JorjMcKie Dec 9, 2022 Maintainer

JorjMcKie Dec 9, 2022 Maintainer

sedimentation-fault Dec 9, 2022 Author

sedimentation-fault Dec 11, 2022 Author

JorjMcKie Dec 11, 2022 Maintainer

sedimentation-fault Dec 11, 2022 Author

sedimentation-fault Dec 11, 2022 Author

JorjMcKie Dec 11, 2022 Maintainer

JorjMcKie Dec 13, 2022 Maintainer

sedimentation-fault
Dec 8, 2022

Replies: 11 comments 9 replies

JorjMcKie
Dec 8, 2022
Maintainer

sedimentation-fault
Dec 8, 2022
Author

sedimentation-fault
Dec 8, 2022
Author

JorjMcKie
Dec 8, 2022
Maintainer

sedimentation-fault
Dec 8, 2022
Author

sedimentation-fault
Dec 8, 2022
Author

JorjMcKie Dec 8, 2022
Maintainer

sedimentation-fault
Dec 9, 2022
Author

JorjMcKie Dec 9, 2022
Maintainer

JorjMcKie Dec 9, 2022
Maintainer

sedimentation-fault
Dec 9, 2022
Author

JorjMcKie Dec 9, 2022
Maintainer

JorjMcKie Dec 9, 2022
Maintainer

sedimentation-fault Dec 9, 2022
Author

sedimentation-fault
Dec 11, 2022
Author

JorjMcKie Dec 11, 2022
Maintainer

sedimentation-fault Dec 11, 2022
Author

sedimentation-fault
Dec 11, 2022
Author

JorjMcKie Dec 11, 2022
Maintainer

JorjMcKie
Dec 13, 2022
Maintainer