PDF Search – How to Find PDFs in Multiple Zip Files Containing a Search Term

findpdfsearchzip

Given:

  • A directory with 1..n ZIP files of random names (all ending with .zip)
    • Each zip file contains 1..n PDF files of random names (all ending with .pdf)
    • All PDFs are from the same source and are to some extend comparable formatted.
    • The PDFs are no prosa text but rather invoices, inventory lists etc. (aka forms and tables; The PDFs are searchable when I open them in an PDF viewer.)
  • A search term i.e. a stock item number or a invoice number

Wanted:

  • A way to find/list all the PDFs that contain the given search term.
  • preferably with existing linux tools.

Best Answer

You can convert the PDF to text and then apply grep on that text:

#!/bin/bash
for z in *.zip
do
  zipinfo -1 "$z" |  # Get the list of filenames in the zip file
    while IFS= read -r f
    do
      unzip -p "$z" "$f" | # Extract each PDF to standard output instead of a file
        pdftotext - - | # Then convert it to text, reading from stdin, writing to stdout
        grep -q 1234 && echo "$z -> $f" # And finally grep the text
    done
done