-
Notifications
You must be signed in to change notification settings - Fork 563
How to Extract Images from a PDF (v1.9.2)
New methods in v1.9.2 allow you to extract and save all images from a PDF as PNG files on a page-by-page basis with this little script. If an image has a CMYK colorspace, it will be converted to RGB first.
doc = fitz.open("file.pdf")
for i in range(len(doc)):
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.n < 5: # this is GRAY or RGB
pix.writePNG("p%s-%s.png" % (i, xref))
else: # CMYK: convert to RGB first
pix1 = fitz.Pixmap(fitz.csRGB, pix)
pix1.writePNG("p%s-%s.png" % (i, xref))
pix1 = None
pix = None
The script runs very fast: it takes less than 2 seconds to extract the 180 images of Adobe's manual on a fast desktop PC. As a reminder: this is a PDF with 1310 pages, 30+ MB size and 330,000+ PDF objects.
The script is also contained in the demo directory, together with another version that will extract all images from a PDF - whether they are page-referenced or not.
HOWTO Button annots with JavaScript
HOWTO work with PDF embedded files
HOWTO extract text from inside rectangles
HOWTO extract text in natural reading order
HOWTO create or extract graphics
HOWTO create your own PDF Drawing
Rectangle inclusion & intersection
Metadata & bookmark maintenance