Fully embedded font is extracted only partially if it occupies more than one objects #2111
-
DescriptionI have a PDF where the PDF reader was telling me that some fonts were fully embedded, so I decided to test my freshly installed PyMuPDF 1.21.0-rc2 on it. It told me that it "saved 16 fonts to" my working directory, but looking there revealed only 11 of them. Looking at the list of fonts from pdffonts, it became clear why: some fonts that were "fully" embedded were using more than one objects to store partial pieces of them. A check with mupdf confirmed that PyMuPDF extracted only the last object onto font-name.cff, probably because the output file (font-name.cff) was the same for all pieces/objects of font with name font-name. How To ReproduceYou must have one of those PDFs that embed a font fully by storing pieces of it in multiple objects. Let's say x.pdf is one of them. pdffonts lists the fonts of x.pdf as follows:
You can see that, for example, Times-Roman has 'no' in the "sub" (subsetted) column, meaning it is NOT subsetted - therefore it is fully embedded. You can also see that it occupied objects with numbers 567, 569 and 306. mutool shows practically the same with its 'info' command:
We could use mutool extract x.pdf, but that is not user-friendly, as it a) extracts both all fonts and all images and b) it extracts fonts as font-XXXX.cff (or, possibly, font-XXXX.ttf), where XXXX has no relation to the font object number (contrary to what its documentation claims), so you practically don't know which file is which font, unless you open each one of them, or at least read its metadata somehow. Enter PyMuPDF which promises to a) extract all fonts and b) give the extracted files sensible names.Alas, trying it on our x.pdf results on just 11 fonts - contrary to the claimed 16:
(first column is file size) What has happened? Comparing this to the output of mutool extract
and looking carefully at the file sizes (first column), we see that Times-Roman.cff (the Times-Roman font as extracted by PyMuPDF) is exactly font-0599.cff (a font extracted by mutool, whose object number is NOT 599 (there is no font object with such a number in x.pdf)) - but this is only one of the three pieces (objects) that store parts of Times-Roman! Your configuration (mandatory)
More precisely:
|
Beta Was this translation helpful? Give feedback.
Replies: 11 comments 9 replies
-
This seems to be a duplicate of #2109 - however I am not really sure. |
Beta Was this translation helpful? Give feedback.
-
Please delete 2109 - I tried to correct it while it was on its way to the server and didn't realize it had already been created. Sorry about that. I had also written my expectation - but got deleted somehow during my writing... My expectation is that PyMuPDF assembles the three pieces into one. If that's too difficult (or not possible due to limitations of mupdf/mutool), at the very minimum use suffixes to write the various pieces in their own files, e.g. Times-Roman-1.cff, Times-Roman-2.cff, Times-Roman-3.cff. I cannot provide an even more clear description of the result than what I have already done.
Then it tells the user that "all 16" fonts have been extracted, but the user sees only, say, 11 - because some .cff files were overwritten in the extraction process. I will try to send you the PDF to your outlook account, as I cannot put it publicly here. |
Beta Was this translation helpful? Give feedback.
-
BTW, Times Roman object 1/2/3 are not fully embedded Times-Roman fonts. They are parts of the font that together form a fully embedded font. So overwriting Times-Roman.cff each time with the next object that happens to say "I keep data for 'Times-Roman' font" destroys parts of the font that we want to extract. |
Beta Was this translation helpful? Give feedback.
-
This is either not possible or clearly beyond the intended scope of PyMuPDF. Features like this one should be looked for in dedicated font packages like fontTools. What I suspect is really your problem instead: You extract font names without their subset identifier For the time being, I will convert this post from an issue to a "Discussions" item. |
Beta Was this translation helpful? Give feedback.
-
I did not write any script, I just used fitz the way it is supposed to be run:
I wrote this command in my post above, together with its output. I would expect this command to extract the fonts correctly, that's all. |
Beta Was this translation helpful? Give feedback.
-
And, in doing the above, I was just following the instructions at
This suggests that now you don't even have to write a script to extract fonts (or images), it is all done by the extract command of the fitz module. Well, if is is really so, then I do expect it to extract fonts correctly, to not overwrite any .cff and to present me the (fully embedded) font as a complete file and not as some part of it that happened to be the last one it encountered. [1] i.e. the command |
Beta Was this translation helpful? Give feedback.
-
I have sent you an example PDF to your outlook address already - didn't you get it?
So assuming fitz adds those xrefs to the filenames, so that it outputs
You say that it will not be possible for fitz to combine them into one and just output their union in a Times-Roman.cff? When I open such a "partial" Times-Roman-ZXY.cff file in fontforge, it shows me the glyphs in their correct places, while the other places remain empty. When I open, say, Times-Roman-YZX.cff, it will show me some other glyphs in their correct places and the rest empty. It should be possible to combine the two into one, under a new "encoding" that would be the union of the two encodings. Even if the two subfonts might contain common glyphs, it should be possible to tell it to "keep the glyphs from the first font and only add glyphs from the second in the empty "seats" (read: code points). |
Beta Was this translation helpful? Give feedback.
-
Got your personal reply, thanks! Just post the output of Thank you for all clarifications. |
Beta Was this translation helpful? Give feedback.
-
I can confirm that the change in main.py
solves the problem of overwriting previously extracted font files, since now each font filename contains a suffix that is its object number (xref). I would have preferred to mark Jorj's answer as the answer to this question, but somehow I could only mark my own answers... To anybody who is interested in how one can continue from here: One could use the merge functionality of fonttools:
HOWEVER...fonttools will complain that the fonts are neither TTF, nor OTF... :-( Therefore one has to first convert those .cff files
to OTF (I read that OTF was specifically designed to accommodate CFF files and it should be "just" a matter or writing an "OTF wrapper" around them...), then merge them to one with
I have not found any command-line tool to accomplish the CFF-to-OTF conversion, although the Internet is full of online converters that claim to be able to do so. Just as a proof-of-concept, I tried fonforge: I opened each one of the CFFs in fontforge and told it to create a font out of it (File --> Create Font, choose "OpenType (CFF)" as the font type, accept all default settings ). It did it, but not before it warned me that they contained errors: For example, for Times-Roman-567.cff, as extracted by PyMuPDF:
I chose to save them and passed the .otf files to fonttool to merge them:
fonttools did it, but warned me that it dropped all CMAPs from the three OTFs:
At this point, I have no idea whether the created merged.ttf is "better" than the originals (in the sense that it "contains them all and is usable"), or it misses vital components. But I guess this could be the way to go, if one chose to merge extracted fonts of the same family (e.g. Times-Roman) and different object numbers (xrefs). |
Beta Was this translation helpful? Give feedback.
-
I was wondering why
Looking at the source of Lib/fontTools/merge/cmap.py from fonttools, the answer is there, even in the comments:
and in the code:
which, in plain english says:
In view of this information, I see that my three CFF fonts are 'format 0' (that was in the Warning messages already, but it is only now that I see it, after looking at the format (sic!) string of the warning in the code above!). 'format0' means 'old style 256-character font' (as explained in the background reading). So fonttools will not merge the CMAPs of such 'old style' fonts, either because it can't, or because it's illogical/impossible (not sure here about the true reason). It seems that, in the case of my example file, you cannot go any further than extract the fonts with PyMUPDF and be happy about it! That's probably all that can be done in this case... |
Beta Was this translation helpful? Give feedback.
-
A new version, pre-release 1.21.1rc1 has just been published. |
Beta Was this translation helpful? Give feedback.
I can confirm that the change in main.py
solves the problem of overwriting previously extracted font files, since now each font filename contains a …