How to compare parts of files by hash

bashhashing

I have one successfully downloaded file and another failed download (only the first 100 MB of a large file) which I suspect is the same file.

To verify this, I'd like to check their hashes, but since I only have a part of the unsuccessfully downloaded file, I only want to hash the first few megabytes or so.

How do I do this?

OS would be windows, but I have cygwin and MinGW installed.

Best Answer

Creating hashes to compare files makes sense if you compare one file against many, or when comparing many files against each other.

It does not make sense when comparing two files only once: The effort to compute the hashes is at least as high as walking over the files and comparing them directly.

An efficient file comparison tool is cmp:

cmp --bytes $((100 * 1024 * 1024)) file1 file2 && echo "File fragments are identical"

You can also combine it with dd to compare arbitrary parts (not necessarily from the beginning) of two files, e.g.:

cmp \
    <(dd if=file1 bs=100M count=1 skip=1 2>/dev/null) \
    <(dd if=file2 bs=100M count=1 skip=1 2>/dev/null) \
&& echo "File fragments are identical"

Related Solutions

Md5sum on large files

To verify contents by only sampling the first megabyte of a file will likely not detect if some of the larger files have been corrupted, damaged or altered in one way or another. The reason for that is you're only giving the hashing algorithm one megabyte of data when there might be hundreds of other megabytes that could be off. Even one bit in the wrong position would give a different signature.

If data integrity is what you want to verify, you're better off with the CRC32 algorithm. It's faster than MD5. Although it it is possible to forge/modify a file to appear to have the correct CRC32 signature, it is not likely that random bits of corruption will ever do that.

Update:

Here's a nice one-liner to do the 1 megabyte based md5 checksum on every file:

find ./ -type f -print0 | xargs -0 -n1 -I{} sh -c "echo '{}' >> output.md5 && head -c 1M '{}' | md5sum >> output.md5"

Replace md5sum with cksum if you feel like it. Notice that I chose to include the filename in the output. That's because the filename string does not get passed on when you're not giving md5sum the whole file.

Windows – How to make sure a downloaded .iso matches a hash value

If you're using Windows, you can download a utility such as winhasher which will generate various types of checksums for your file. To verify the integrity of your file, compare the checksum to the one on the site you downloaded the software from. If it matches, you're good, if not the software is corrupt, has been tampered with, or a few other things.

To get the md5 sum using native utilities in Linux, use the md5sum command like so: md5sum <liveCDname>.iso and compare it to the one you found online. Alternatively, if there is a file such as MD5SUMS available for download on the server, you can download it to the same directory as the ISO and run md5sum -c MD5SUMS

Best Answer

Related Solutions

Md5sum on large files

Windows – How to make sure a downloaded .iso matches a hash value

Related Question