Bash – Delete .pdf files only if .xlsx files in directory have same filename

bashdirectoryfilesrmshell

I have folders with hundreds of pdf and xls(x) files that were mass exported from legal e-discovery systems. The filenames in these exports correspond to bates # such as ABCD_00000001.pdf, ABCD_00000002.pdf, … , ABCD_00002000.pdf. These mass exports include a blank pdf file for every single xls(x) file – with both having the exact same filename. E.g., ABCD_00000005.xlsx is the xlsx file that was produced in the ediscovery system and ABCD_00000005.pdf is an extraneous blank pdf file that was created in the mass export.

These extraneous .pdf files probably result from a user error on the part of the people running these mass exports, but I don't usually have control over that side of the process. So I wanted to know if any relatively straightforward way to delete these extraneous .pdf without forcing someone to go through them manually.

Best Answer

Loop over the pdf files, use parameter expansion to extract the basename:

#!/bin/bash
for pdf in *.pdf ; do
    basename=${pdf%.pdf}
    if [[ -f $basename.xls || -f $basename.xlsx ]] ; then
        rm "$pdf"
    fi
done

Update: I got the logic backwards, should be fixed now.

Related Question