How to compare the contents of .pdf files, excluding filenames from comparison

comparisondiff()winmerge

I usually use WinMerge to view the differences between files, but in this case it doesn't help. The files I'm comparing are known to have different filenames, which is creating false positives when 2 files with the same document inside have different filenames.

I have a folder full of many directories representing all the vendors my company does business with, and they include many .pdf files of receipts & invoices. It's the master vendor list. The invoices & receipts are named such that the names don't make sense without the surrounding directory structure to provide context. For example here we have "Vendors/Company Foo/Product Bar/Invoice#3.pdf"

Then I have another folder with many receipts & invoices in it, that used to be maintained separately from the master vendor list, and was supposed to include a manually-created copy of every receipt & invoice that was entered into the appropriate entry in master vendor directory structure. These receipts & invoices were to have been renamed so they're easier for the accountant to read & know what they refer to. For example here we have "Taxes/CompanyFoo ProductBar.pdf".

I've searched for files of type .pdf in the top-level folder of the master vendor list, so that my search results include receipts & invoices from all the vendors in the directory structure. Then I copied these .pdf files to another folder on my Desktop, so I can compare them. I compared those files to the files in the 'taxes' folder using WinMerge to see if any of the files in the 'taxes' folder don't exist in the 'master vendor' directories, and vice-versa.

But WinMerge counts files as different just because their filenames don't match. I need to know if the file content is different despite what the filename is.

There are hundreds of these files & if any are in the 'taxes' folder that aren't in their corresponding 'master vendor' directory, I need to rectify that & file them correctly.

Can someone recommend a tool that can do this?

Best Answer

If you have some kind of unix environment available (If you're on Windows, I suggest Cygwin) you can easily find duplicate files below the current directory with something like this:

find . -type f -exec md5sum '{}' '+' | sort | uniq -D -w 32

The output will be md5sum and name of every file that has at least one duplicate (same md5sum). Duplicates show up right after each other in alphabetical order. Exchange the . after find with the path you want to look under if it's not the current directory.

Edit:

Conversely, to get the files that have no duplicates, you can use

find . -type f -exec md5sum '{}' '+' | sort | uniq -u -w 32

That will only print files without any duplicate below the current directory.

Related Question