Rsync Compression – Is the Rsync Block Compression Dictionary Reset for Each File?

compressionrsync

Does rsync -z compress blocks of each file without regard to the previous file, or is the compression dictionary reset for each file so that it is handled independently?

As an example, consider a compressible file one.txt and its identical copy being transferred to a remote server, where neither file yet exists:

cp -p one.txt two.txt
rsync -az one.txt two.txt remote:

Does the zlib compression layer treat one.txt and two.txt independently, or is the data transfer at that level simply a continuous stream, so it will have learned a useful compression dictionary for one.txt that it can apply to two.txt?

Alternatively, have I completely misunderstood the zlib compression algorithm, such that (for example) the dictionary is always reset for each new block?

I've tried looking at the rsync debug output rsync -avvvvz --debug=IO1,IO2,IO3,IO4 --msgs2stderr but I can't see anything that specifically relates to the compression layer.

(This is following up a comment thread on an answer of mine on ServerFault.)

Best Answer

rsync uses compression in token.c, and seemingly only there. It maintains deflate stream state in the tx_strm variable, and resets the stream state in send_deflated_token if the previous token is -1:

        if (last_token == -1) {
                /* initialization */
                if (!init_done) {
                        tx_strm.next_in = NULL;
                        tx_strm.zalloc = NULL;
                        tx_strm.zfree = NULL;
                        if (deflateInit2(&tx_strm, compression_level,
                                         Z_DEFLATED, -15, 8,
                                         Z_DEFAULT_STRATEGY) != Z_OK) {
                                rprintf(FERROR, "compression init failed\n");
                                exit_cleanup(RERR_PROTOCOL);
                        }
                        if ((obuf = new_array(char, OBUF_SIZE)) == NULL)
                                out_of_memory("send_deflated_token");
                        init_done = 1;
                } else
                        deflateReset(&tx_strm);

This is used from match.c, via the match function, used by hash_search and match_sums. These functions always ensure that they finish their processing with a call which leaves last_token set to -1, so that the next call will reset the deflate stream. All this is done file-by-file, so the deflate stream is always reset at the start of each file.

This means that the block compression dictionary is guaranteed to be reset for each file; it might be reset more often.

If rsync were to use data from previous files, it might be more interesting to extend its hash handling across files.

You can verify all this experimentally by syncing multiple copies of compressible files, as you suggest; the stats always show that the transferred size is equal to the compressed size of a single file, multiplied by the number of copies, so there is no de-duplication of one kind or another across files.