It's my understanding that pdfimages -all
extracts images from PDFs in their native formats.
Therefore, I expected that the JPG (lossy) images extracted from that command would have the same pixel information as the .ppm and .pbm files produced without the -all
option, as well as the PNG (lossless) files created when I right-click and save the image in Evince.
However, my use of the ImageMagick compare
command tells me that there are differences in the images contained within the JPG files compared to the other options above.
To reproduce, download the PDF in this link (https://fccid.io/document.php?id=2149405), use it as an argument for pdfimages
and pdfimages -all
and use the first .ppm file and the first .jpg file as arguments for compare
. When I do this, it produces an image file containing red to indicate a difference in the images.
Is there something that I don't understand? Is pdfimages
adding pixel information by default when it creates .ppm and .pbm files?
Best Answer
pdfimages -all
returns the exact file that was stored in the pdf.We can test this by doing a round-trip: starting with a jpg image, we add it to a pdf using LaTeX, extract it using
pdfimages -all
, and then compare it to the original. (The reason for using LaTeX will be explained later.)I have the first jpg image as extracted from your link and I named it
device.jpg
. Let's put it in a PDF file using LaTeX:Now, let's extract it using
pdfimages -all
and compare it with the original:The extracted jpg is byte-for-byte identical to the original.
Footnote: the reason for using LaTeX
The above test cannot be done using just any PDF creator. This is because not all PDF creators will put images into a PDF unmolested. For example, let's try ImageMagick's
convert
:convert
re-sampled the image to a smaller size before placing it in the pdf.Image accuracy was part of pdflatex's design goals. Other PDF creation software may, by default, "optimize" images before placing them in the PDF.
Update: ShreevatsaR points out that the img2pdf utility also provides a lossless method to convert images to PDF. Non-TeX users will also likely find it much simpler to use.