Md5sum on large files

bashfindmd5

Context:

I have a large terabyte drive with various types of large media files, ISO image files, etc. I would like to verify its contents using md5sum on the first megabyte due to speed/performance.

You can create a sum like this:

FILE=four_gig_file.iso
SUM=$(head -c 1M "$FILE" | md5sum)
printf "%s *%s\n" ${SUM%-} "$FILE" >>test.md5

How would you verify this as the first megabyte's signature is different
than the whole file's?

I've seen this done in other languages, but I am wondering
how to do it in Bash. I've experimented with various md5sum -c permutations involving pipes and whatnot.


Instead of using md5sum -c, would you have to recompute the hashes into a new file, then 'diff' them?

You can use a

find /directory/path/ -type f -print0 | xargs -0 md5sum blah blah

to work on a large number of files.

PS: Rsync is not an option

UPDATE 2: So as it stands —

Using head, find, and md5sum; one could then create a file from the source directory fairly quickly, then check it with diff on the other side after computing on the destination. Are there clever one-liners or scripts for this?

Best Answer

To verify contents by only sampling the first megabyte of a file will likely not detect if some of the larger files have been corrupted, damaged or altered in one way or another. The reason for that is you're only giving the hashing algorithm one megabyte of data when there might be hundreds of other megabytes that could be off. Even one bit in the wrong position would give a different signature.

If data integrity is what you want to verify, you're better off with the CRC32 algorithm. It's faster than MD5. Although it it is possible to forge/modify a file to appear to have the correct CRC32 signature, it is not likely that random bits of corruption will ever do that.

Update:

Here's a nice one-liner to do the 1 megabyte based md5 checksum on every file:

find ./ -type f -print0 | xargs -0 -n1 -I{} sh -c "echo '{}' >> output.md5 && head -c 1M '{}' | md5sum >> output.md5"

Replace md5sum with cksum if you feel like it. Notice that I chose to include the filename in the output. That's because the filename string does not get passed on when you're not giving md5sum the whole file.

Related Question