How to extract the all the images from the pdf page and also ignore the header & footer logo image? #136

thangarajdeivasikamani · 2024-06-03T18:15:08Z

thangarajdeivasikamani
Jun 3, 2024

Hello Team,

Consider my pdf sheet as like https://www.st.com/resource/en/datasheet/stm32f205rb.pdf.

In that I have used below code to extract the image. But I am not getting proper images actually available in the pdf.

import sys, pymupdf # import the bindings
fname = "stm32f103c8.pdf" # get filename from command line
doc = pymupdf.open(fname) # open document

iterate over the pages

for page in doc:
img_number = 0 # for enumerating images per page
# iterate over the image blocks
for block in page.get_text("dict")["blocks"]:
# skip if no image block
if block["type"] != 1:
continue
# build filename, like 'img17-3.jpg'
name = f"img{page.number}-{img_number}.{block['ext']}"
out = open(name, "wb")
out.write(block["image"]) # write the binary content
out.close()
img_number += 1 # increase image counter

Some time the reputative footer image logo, side image logo only consider as the image and extracting. Actual image extraction missing.

Even I tried with below code. It's not extracting the required images

https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/examples/extract-images/extract-from-xref.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to extract the all the images from the pdf page and also ignore the header & footer logo image? #136

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How to extract the all the images from the pdf page and also ignore the header & footer logo image? #136

Uh oh!

Uh oh!

thangarajdeivasikamani Jun 3, 2024

iterate over the pages

Replies: 0 comments

thangarajdeivasikamani
Jun 3, 2024