checksum bzip2 reproducible-build – Are Files Compressed with bzip2 Deterministic?

bzip2checksumreproducible-build

I am trying to determine if there are any potential issues using bzip2 to compress files that need to be 100% reproducible. Specifically: can metadata (name / inode, lastmod date, etc) or anything else cause identical file contents to produce a different checksum on the resulting .bz2 archive?

As an example, gzip is not by default deterministic unless -n is used.

My crude tests so far suggest that bzip2 does indeed consistently produce identical files given identical input data (regardless of metadata, platform, filesystem, etc), but it would be nice to have more than anecdotal evidence.

Best Answer

bzip2 files only contain basic format signatures, compressed data and the information needed to decompress that data. They don’t contain any file-specific meta-data; instead, they rely on the compressed file’s metadata (thus file.bz2 is uncompressed to file, with the timestamps of file.bz2, regardless of the original file name and original timestamps).

There is one part of the compression that can vary, the input randomisation; but that has been disabled in practice for a long time, and current versions of bzip2 don’t randomise their input.

As a result, the output of bzip2 only depends on the input data and the compression level. The output is deterministic.

I’m not sure you’ll find an authoritative source for all this; the best evidence I can offer is the absence of any mention of bzip2 in the Debian reproducible builds notes. bzip2 is used in Debian, so if it did cause issues it would get a mention, in the same way gzip does.

Related Question