How to check if two gzipped files are equal

file-comparisongzip

I am trying to save space while doing a "dumb" backup by simply dumping data into a text file. My backup script is executed daily and looks like this:

Create a directory named after the backup date.
Dump some data into a text file "$name".
If the file is valid, gzip it: gzip "$name". Otherwise, rm "$name".

Now I want to add an additional step to remove a file if the same data was also available in the day before (and create symlink or hardlink).

At first I thought of using md5sum "$name", but this does not work because I also store the filename and creation date.

Does gzip have an option to compare two gzipped files and tell me whether they are equal or not? If gzip does not have such an option, is there another way to achieve my goal?

Best Answer

@deroberts answer is great, though I want to share some other information that I have found.

gzip -l -v

gzip-compressed files contain already a hash (not secure though, see this SO post):

$ echo something > foo
$ gzip foo
$ gzip -v -l foo.gz 
method  crc     date  time           compressed        uncompressed  ratio uncompressed_name
defla 18b1f736 Feb  8 22:34                  34                  10 -20.0% foo

One can combine the CRC and uncompressed size to get a quick fingerprint:

gzip -v -l foo.gz | awk '{print $2, $7}'

cmp

For checking whether two bytes are equal or not, use cmp file1 file2. Now, a gzipped file has some header with the data and footer (CRC plus original size) appended. The description of the gzip format shows that the header contains the time when the file was compressed and that the file name is a nul-terminated string that is appended after the 10-byte header.

So, assuming that the file name is constant and the same command (gzip "$name") is used, one can check whether two files are different by using cmp and skipping the first bytes including the time:

cmp -i 8 file1 file2

Note: the assumption that the same compression options is important, otherwise the command will always report the file as different. This happens because the compression options are stored in the header and may affect the compressed data. cmp just looks at raw bytes and do not interpret it as gzip.

If you have filenames of the same length, then you could try to calculate the bytes to be skipped after reading the filename. When the filenames are of different size, you could run cmp after skipping bytes, like cmp <(cut -b9- file1) <(cut -b10- file2).

zcmp

This is definitely the best way to go, it first compresses data and starts comparing the bytes with cmp (really, this is what is done in the zcmp (zdiff) shellscript).

One note, do not be afraid of the following note in the manual page:

When both files must be uncompressed before comparison, the second is uncompressed to /tmp. In all other cases, zdiff and zcmp use only a pipe.

When you have a sufficiently new Bash, compression will not use a temporary file, just a pipe. Or, as the zdiff source says:

# Reject Solaris 8's buggy /bin/bash 2.03.

Related Solutions

Bash – How to check if a gzipped file is empty

gzip -l foo.gz | awk 'NR==2 {print $2}' prints the size of the uncompressed data.

if LC_ALL=C gzip -l foo.gz | awk 'NR==2 {exit($2!=0)}'; then
  echo foo is empty
else
  echo foo is not empty
fi

Alternatively you can start uncompressing the data.

if [ -n "$(gunzip <foo.gz | head -c 1 | tr '\0\n' __)" ]; then
    echo "foo is not empty"
else
    echo "foo is empty"
fi

(If your system doesn't have head -c to extract the first byte, use head -n 1 to extract the first line instead.)

How to convert existing gz (gzip) files to rsyncable

#! /bin/bash

set -euo pipefail

##  TOKEN's creation time marks the time since last recompression
TOKEN=.lastRecompression   

if [ -f ${TOKEN} ]
then
    find -name '*.gz' -cnewer "${TOKEN}"
else
    # Process all compressed files if there is no token.
    find -name '*.gz'
fi | while read f
do
    # Do it in two steps
    gunzip < "$f" | gzip --rsyncable > "$f.tmp"

    # Preserve attributes
    cp "$f" "$f.tmp" --attributes-only

    # and rename atomically.
    # set -e ensures that a problem in the previous step 
    # will stop the full script. 
    mv -v "$f.tmp" "$f"
done

# Update the token
touch ${TOKEN}