Linux – How to recursively identify non-searchable PDFs and copy them to a folder

batchlinuxpdfwindows 7

Further to an earlier post which provided a script solution:

From my question it may be possible to tell that I am a computer user and have no programming knowledge.

I have hundreds of searchable and unsearchable pdfs in various folders and subfolders on an external hard drive.

I have computers running Windows 7 and Ubuntu 14.04

How could I modify this script to specify the parent folder and also search subfolders, then generate a report identifying filenames and locations?

If wishes came true then this would be contained in a GUI and copy the text-less files into a common folder from where Abbyy Pro could batch OCR.

Best Answer

You should have probably posted this as a comment on the other question but - then again - you would have needed more reputation to do that.

@davidgo's script is already recursive (it will go through folders and subfolders). You would only have to modify echo "$each NOT searchable" to change what it does upon finding a non-searchable file. This should do the trick:

Edit: There were some issues with how the script handled spaces in filenames and some other problems here and there. I decided to overhaul davidgo's original script so you will see a few more changes than I said above.

#! /bin/bash

if [[ ! "$#" = "2" ]]
  then
      echo "Usage: $0 /path/to/PDFDirectory /path/to/TARGETDirectory"
      exit 1
fi

PDFDIRECTORY="$1"
TARGETDIR="$2"

while IFS= read -r -d $'\0' FILE; do
    PDFFONTS_OUT="$(pdffonts "$FILE" 2>/dev/null)"
    RET_PDFFONTS="$?"
    FONTS="$(( $(echo "$PDFFONTS_OUT" | wc -l) - 2 ))"
    if [[ ! "$RET_PDFFONTS" = "0" ]]
      then
          READ_ERROR=1
          echo "Error while reading $FILE. Skipping..."
          continue
    fi
    if [[ "$FONTS" = "0" ]]
      then
          echo "NOT SEARCHABLE: $FILE -- Copying to $TARGETDIR."
          cp -v "$FILE" "$TARGETDIR/${FILE##*/}"
      else
          echo "SEARCHABLE: $FILE"
      fi
done < <(find "$PDFDIRECTORY" -type f -name '*.pdf' -print0)

echo "Done."
if [[ "$READ_ERROR" = "1" ]]
  then
      echo "There were some errors."
fi

Save this script in a new empty file, name it something like copy_image_pdf and make it executable via the file properties (I am assuming you would do this on Ubuntu).

Then run it from a terminal while providing the PDF directory and the target directory where image PDF files should be copied, e.g.:

copy_image_pdf /media/data/pdffiles /media/data/pdffiles-to-be-ocred

Related Solutions

Linux – Batch-OCR many PDFs

I too have looked for a way to batch-OCR many PDFs in an automated manner, without much luck. In the end I have come up with a workable solution similar to yours, using Acrobat with a script as follows:

Copy all relevant PDFs to a specific directory.
Remove PDFs already containing text (assuming they are already OCRd or already text - not ideal I know, but good enough for now).
Use AutoHotKey to automatically run Acrobat, select the specific directory, and OCR all documents, appending "-ocr" to their filename.
Move the OCRd PDFs back to their original location, using the presence of a "-ocr.pdf" file to determine whether it was successful.

It is a bit Heath Robinson, but actually works pretty well.

How to automatically find non-searchable PDFs

I'm not sure if this is a 100% solution, but I came up with the following script which should get you a good part of the way if not the whole way (I have not gone through the spec) It should be run from the directory which has all the PDF's (it will search subdirectories).

#! /bin/bash

if [[ ! "$#" = "1" ]]
  then
      echo "Usage: $0 /path/to/PDFDirectory"
      exit 1
fi

PDFDIRECTORY="$1"

while IFS= read -r -d $'\0' FILE; do
    PDFFONTS_OUT="$(pdffonts "$FILE" 2>/dev/null)"
    RET_PDFFONTS="$?"
    FONTS="$(( $(echo "$PDFFONTS_OUT" | wc -l) - 2 ))"
    if [[ ! "$RET_PDFFONTS" = "0" ]]
      then
          READ_ERROR=1
          echo "Error while reading $FILE. Skipping..."
          continue
    fi
    if [[ "$FONTS" = "0" ]]
      then
          echo "NOT SEARCHABLE: $FILE"
      else
          echo "SEARCHABLE: $FILE"
    fi
done < <(find "$PDFDIRECTORY" -type f -name '*.pdf' -print0)

echo "Done."
if [[ "$READ_ERROR" = "1" ]]
  then
      echo "There were some errors."
fi

It works by looking for the number of fonts specified in each PDF. If the file does not have any fonts it is assumed to be comprised only of an image. (This might trip up on password protected files, I have no idea, don't have any to test against). If there is some stuff which is searchable and some stuff which is an image, this won't work - but it will probably be useful to seperate scanned image documents in a PDF container from "real" PDF's.

You can, of-course, comment out the part of the if-then-else loop which does not apply if you only want to print out the files which are not searchable.

Best Answer

Related Solutions

Linux – Batch-OCR many PDFs

How to automatically find non-searchable PDFs

Related Question