Does rsync -z
compress blocks of each file without regard to the previous file, or is the compression dictionary reset for each file so that it is handled independently?
As an example, consider a compressible file one.txt
and its identical copy being transferred to a remote server, where neither file yet exists:
cp -p one.txt two.txt
rsync -az one.txt two.txt remote:
Does the zlib
compression layer treat one.txt
and two.txt
independently, or is the data transfer at that level simply a continuous stream, so it will have learned a useful compression dictionary for one.txt
that it can apply to two.txt
?
Alternatively, have I completely misunderstood the zlib
compression algorithm, such that (for example) the dictionary is always reset for each new block?
I've tried looking at the rsync
debug output rsync -avvvvz --debug=IO1,IO2,IO3,IO4 --msgs2stderr
but I can't see anything that specifically relates to the compression layer.
(This is following up a comment thread on an answer of mine on ServerFault.)
Best Answer
rsync
uses compression intoken.c
, and seemingly only there. It maintains deflate stream state in thetx_strm
variable, and resets the stream state insend_deflated_token
if the previous token is -1:This is used from
match.c
, via thematch
function, used byhash_search
andmatch_sums
. These functions always ensure that they finish their processing with a call which leaveslast_token
set to -1, so that the next call will reset the deflate stream. All this is done file-by-file, so the deflate stream is always reset at the start of each file.This means that the block compression dictionary is guaranteed to be reset for each file; it might be reset more often.
If
rsync
were to use data from previous files, it might be more interesting to extend its hash handling across files.You can verify all this experimentally by syncing multiple copies of compressible files, as you suggest; the stats always show that the transferred size is equal to the compressed size of a single file, multiplied by the number of copies, so there is no de-duplication of one kind or another across files.