How to get 100% identical compressed files, for source files that only differ in creation date

7-ziparchivingcompression

I want to be able to compress a file losslessly, and if the original file is identical to another user's file, I want both of our compressed files to match, even if the original file dates are different.

I want to use a maximum of 1GB of RAM while compressing. I'm leaning towards an asymmetric algorithm because the files I have are fairly large, and they take at least an hour to compress with LZMA1 "ultra" in 7-zip on a P4 machine with 1GB RAM and nothing else running. I think 7-zip and FreeARC can be used for my purposes. I've tried to find the commands I should be using, but I'm not having much luck.

edit: 100% identical files should be produced, even if the dates of creation are different. This should be possible through –nodates in Freearc, and with ???? in 7-zip. I'm looking for an equivalent command for 7-zip, and a way to standardize compression across multiple computers.

Best Answer

Create a couple of identical files:

$ echo hello > file1.test
$ echo hello > file2.test

gzip them...

$ gzip file1.test
$ gzip file2.test

observe timestamp field as the only difference:

$ hexdump file1.test.gz

0000000 8b1f 0808 TIME STMP 0300 6966 656c 2e31
0000010 6574 7473 cb00 cd48 c9c9 02e7 2000 3a30
0000020 0636 0000 0000                         

For more info on the timestamp, see the RFC

Now, you can either take an MD5 that starts after byte 8, zero these four bytes in your files and lose their timestamps, or extract the CRC16 from those gzips (also see the RFC for info on how to extract this)

Or, you could save without the timestamp:

$ echo test > file1.test
$ echo test > file2.test
$ gzip -n file1.test
$ gzip -n file2.test
$ md5sum file1.test.gz
cfe4ddf1c4c3891b4ff4a1269b42db82  file1.test.gz
$ md5sum file2.test.gz
cfe4ddf1c4c3891b4ff4a1269b42db82  file2.test.gz
Related Question