How to Extract Images from a PDF (v1.9.2)

New methods in v1.9.2 allow you to extract and save all images from a PDF as PNG files on a page-by-page basis with this little script. If an image has a CMYK colorspace, it will be converted to RGB first.

doc = fitz.open("file.pdf")
for i in range(len(doc)):
    for img in doc.getPageImageList(i):
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        if pix.n < 5:       # this is GRAY or RGB
            pix.writePNG("p%s-%s.png" % (i, xref))
        else:               # CMYK: convert to RGB first
            pix1 = fitz.Pixmap(fitz.csRGB, pix)
            pix1.writePNG("p%s-%s.png" % (i, xref))
            pix1 = None
        pix = None

The script runs very fast: it takes less than 2 seconds to extract the 180 images of Adobe's manual on a fast desktop PC. As a reminder: this is a PDF with 1310 pages, 30+ MB size and 330,000+ PDF objects.

The script is also contained in the demo directory, together with another version that will extract all images from a PDF - whether they are page-referenced or not.