I am trying to save space while doing a "dumb" backup by simply dumping data into a text file. My backup script is executed daily and looks like this:
- Create a directory named after the backup date.
- Dump some data into a text file
"$name"
. - If the file is valid, gzip it:
gzip "$name"
. Otherwise,rm "$name"
.
Now I want to add an additional step to remove a file if the same data was also available in the day before (and create symlink or hardlink).
At first I thought of using md5sum "$name"
, but this does not work because I also store the filename and creation date.
Does gzip
have an option to compare two gzipped files and tell me whether they are equal or not? If gzip
does not have such an option, is there another way to achieve my goal?
Best Answer
@deroberts answer is great, though I want to share some other information that I have found.
gzip -l -v
gzip-compressed files contain already a hash (not secure though, see this SO post):
One can combine the CRC and uncompressed size to get a quick fingerprint:
cmp
For checking whether two bytes are equal or not, use
cmp file1 file2
. Now, a gzipped file has some header with the data and footer (CRC plus original size) appended. The description of the gzip format shows that the header contains the time when the file was compressed and that the file name is a nul-terminated string that is appended after the 10-byte header.So, assuming that the file name is constant and the same command (
gzip "$name"
) is used, one can check whether two files are different by usingcmp
and skipping the first bytes including the time:Note: the assumption that the same compression options is important, otherwise the command will always report the file as different. This happens because the compression options are stored in the header and may affect the compressed data.
cmp
just looks at raw bytes and do not interpret it as gzip.If you have filenames of the same length, then you could try to calculate the bytes to be skipped after reading the filename. When the filenames are of different size, you could run
cmp
after skipping bytes, likecmp <(cut -b9- file1) <(cut -b10- file2)
.zcmp
This is definitely the best way to go, it first compresses data and starts comparing the bytes with
cmp
(really, this is what is done in thezcmp
(zdiff
) shellscript).One note, do not be afraid of the following note in the manual page:
When you have a sufficiently new Bash, compression will not use a temporary file, just a pipe. Or, as the
zdiff
source says: