I am trying to determine if there are any potential issues using bzip2
to compress files that need to be 100% reproducible. Specifically: can metadata (name / inode, lastmod date, etc) or anything else cause identical file contents to produce a different checksum on the resulting .bz2
archive?
As an example, gzip is not by default deterministic unless -n
is used.
My crude tests so far suggest that bzip2 does indeed consistently produce identical files given identical input data (regardless of metadata, platform, filesystem, etc), but it would be nice to have more than anecdotal evidence.
Best Answer
bzip2
files only contain basic format signatures, compressed data and the information needed to decompress that data. They don’t contain any file-specific meta-data; instead, they rely on the compressed file’s metadata (thusfile.bz2
is uncompressed tofile
, with the timestamps offile.bz2
, regardless of the original file name and original timestamps).There is one part of the compression that can vary, the input randomisation; but that has been disabled in practice for a long time, and current versions of
bzip2
don’t randomise their input.As a result, the output of
bzip2
only depends on the input data and the compression level. The output is deterministic.I’m not sure you’ll find an authoritative source for all this; the best evidence I can offer is the absence of any mention of
bzip2
in the Debian reproducible builds notes.bzip2
is used in Debian, so if it did cause issues it would get a mention, in the same waygzip
does.