Linux – How to recursively identify non-searchable PDFs and copy them to a folder

batchlinuxpdfwindows 7

Further to an earlier post which provided a script solution:

From my question it may be possible to tell that I am a computer user and have no programming knowledge.

I have hundreds of searchable and unsearchable pdfs in various folders and subfolders on an external hard drive.

I have computers running Windows 7 and Ubuntu 14.04

How could I modify this script to specify the parent folder and also search subfolders, then generate a report identifying filenames and locations?

If wishes came true then this would be contained in a GUI and copy the text-less files into a common folder from where Abbyy Pro could batch OCR.

Best Answer

You should have probably posted this as a comment on the other question but - then again - you would have needed more reputation to do that.

@davidgo's script is already recursive (it will go through folders and subfolders). You would only have to modify echo "$each NOT searchable" to change what it does upon finding a non-searchable file. This should do the trick:


Edit: There were some issues with how the script handled spaces in filenames and some other problems here and there. I decided to overhaul davidgo's original script so you will see a few more changes than I said above.


#! /bin/bash

if [[ ! "$#" = "2" ]]
  then
      echo "Usage: $0 /path/to/PDFDirectory /path/to/TARGETDirectory"
      exit 1
fi

PDFDIRECTORY="$1"
TARGETDIR="$2"

while IFS= read -r -d $'\0' FILE; do
    PDFFONTS_OUT="$(pdffonts "$FILE" 2>/dev/null)"
    RET_PDFFONTS="$?"
    FONTS="$(( $(echo "$PDFFONTS_OUT" | wc -l) - 2 ))"
    if [[ ! "$RET_PDFFONTS" = "0" ]]
      then
          READ_ERROR=1
          echo "Error while reading $FILE. Skipping..."
          continue
    fi
    if [[ "$FONTS" = "0" ]]
      then
          echo "NOT SEARCHABLE: $FILE -- Copying to $TARGETDIR."
          cp -v "$FILE" "$TARGETDIR/${FILE##*/}"
      else
          echo "SEARCHABLE: $FILE"
      fi
done < <(find "$PDFDIRECTORY" -type f -name '*.pdf' -print0)

echo "Done."
if [[ "$READ_ERROR" = "1" ]]
  then
      echo "There were some errors."
fi

Save this script in a new empty file, name it something like copy_image_pdf and make it executable via the file properties (I am assuming you would do this on Ubuntu).

Then run it from a terminal while providing the PDF directory and the target directory where image PDF files should be copied, e.g.:

copy_image_pdf /media/data/pdffiles /media/data/pdffiles-to-be-ocred
Related Question