Bash – detect if PDF file is made from images

bashpdfshell-script

I'm trying to pre-process a huge amount of PDF files, many of them not actually text but images in order to move them to a proper location to OCR processing.

The problem is I've tried to detect if PDF is image based prior to OCR but no success so far.
Using "pdffonts filename" is supposed the correct approach but image only PDFs got fonts too!

Best Answer

pdfimages -list filename.pdf

Should do the trick. This gives you a list of images contained in the PDF file.

Related Solutions

Combine multiple PDF files into one (arranged in a matrix)

You could use the utility program pdfnup from the pdfjam suite.

pdfnup in.pdf --nup 3x3

should output the file in-nup.pdf with the pages of in.pdf arranged in a series of pages with a 3x3 matrix from the origin pdf.

You should merge all of you pdf files in an only one, also you must want to specify a paper size for the output file, see the pdfjam docs fot the details.

How to reverse-engineer a CUPS printer/print job

I'm answering a part of your questions only, because you seem to employ a sharp mind who only needs to be shown some hidden hooks to climb up the wall:

"Which PPD will it use?"

If a printqueue printername is locally installed (and if it is not a 'raw' queue), it will use the PPD /etc/cups/ppd/printername.ppd.
"Does it detect the format? How?"

Yes, it does. When you have debug logging enabled (line LogLevel debug in /etc/cups/cupsd.conf), you will see a line in the error_log reading "Auto-typing file...". (There will be no auto-typing, if the job already states a mime type, like in lp -d printername -o document-format=application/pdf my.pdf.)

The rules for classifying various MIME types are defined in /usr/share/cups/mime.types and in all other files which may be in the same directory with the suffix *.types. (You could put your own rules there too, to define your own custom MIME types which should be processed by your own custom filters...)
"What other decisions is the pipeline taking and what conversions?"
1. If the PPD doesn't have any line starting with one of the *cupsFilter: or cupsFilter2: keywords, then it assumes the final print device to be a PostScript printer. Hence it converts everything to PostScript, which does not get submitted as PostScript.
2. If there is one or more lines starting with the keyword *cupsFilter: or *cupsFilter2: it will read from these lines which MIME type the print device can consume and it will employ an appropriate filter chain to generate the respective MIME type.
3. The filters which can process certain MIME types are listed in /usr/share/cups/mime.convs and in all other files which may be in the same directory with the suffix *.convs. (You could put your own custom filters there for any MIME type you want to be processed by these filters...)
4. The *.convs files name the input as well as the output MIME types the respective filter can consume and produce, and what virtual "cost" (just an integer number) such a conversion will cause. When faced with different possible filtering chains which CUPS could construct to go from application/alpha to application/zeta it picks the one with the lowest total cost.
"Will it reconvert to PDF again?"

Most likely no. Unless you asked for a print option to be used for the original PDF that requires it: to print only a range of pages; to print 2 or more pages on one sheet of paper; to scale it; to reshuffle pages for booklet printing, etc. Then a pdftopdf filter may be applied, that converts application/pdf to application/vnd.cups-pdf.
"What did CUPS detect?"

See above: search for the string "Auto-typing file" in /var/log/error_log:
```
sudo grep -A 2 "Auto-typing file" /var/log/error_log
```
"What conversions did it do?"

See in error_log again and search for lines containing Started filter:
```
sudo grep "Started filter" /var/log/error_log
```
"Where can I fetch the intermediate outputs generated?"

You cannot do this directly. You'd have to manipulate each and every filter of CUPS to write out the intermediary format. (I can do it, I have a ready-made recipe for this, but you'd have to pay me to apply it.)

So fetching intermediate outputs might be out-of-scope for you, you can do something different: simulate the filtering chain CUPS would employ for any job.

You can discover how to do this by reading the man page of cupsfilter. You can also just list the filters CUPS would employ for any of the print queues:
```
cupsfilter           \
    --list-filters    \
    -d <printername>   \
    -i <inputmime/type> \
    -m <outputmime/type> \
    -o "number-up=4 page-ranges=3-5,7,11" \
     <filename>
```

Best Answer

Related Solutions

Combine multiple PDF files into one (arranged in a matrix)

How to reverse-engineer a CUPS printer/print job

Related Question