Why rsync does not do delta transfer

rsyncscp

I have binary file which has approx 77MB:

nupic@nupic-virtualbox:~/VboxSharedFolder/experiments/sync/exp2$ ls -lah src/
total 77M
drwxrwx--- 1 root vboxsf    0 Jun 21 13:31 .
drwxrwx--- 1 root vboxsf 4.0K Jun 21 16:21 ..
-rwxrwx--- 1 root vboxsf  77M May 27  2014 binary.bin

I've been playing with rsync and it's delta algorithm feature to see how it is working. Idea was to make small differences in binary file and see how much data were transferred using several methods. For those purposes I've made very simple script:

#!/bin/bash
# rsync does not trnansfers delta over local by default
sed 's%\x00\x00\x00\x20\x66\x74\x79\x70\x69\x73\x6f\x6d\x00\x00\x02\x00%\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11%' src/binary.bin > dst/binary.bin
strace -f -e trace=read,write -o rw_rsync_local_default.log rsync -avcz --progress src/ dst/

# rsync -no-W should enables delta tranfer no matter if local or remote
sed 's%\x00\x00\x00\x20\x66\x74\x79\x70\x69\x73\x6f\x6d\x00\x00\x02\x00%\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11%' src/binary.bin > dst/binary.bin
strace -f -e trace=read,write -o rw_rsync_local_delta_enabled.log rsync --no-W -avcz --progress src/ dst/

# rsync trnansfers delta over network by default
sed 's%\x00\x00\x00\x20\x66\x74\x79\x70\x69\x73\x6f\x6d\x00\x00\x02\x00%\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11%' src/binary.bin > dst/binary.bin
strace -f -e trace=read,write -o rw_rsync_remote.log rsync -avcz -e "ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null" --progress src/ nupic@localhost:/home/nupic/VboxSharedFolder/experiments/sync/exp2/dst/

# scp should transfers whole file not delta
sed 's%\x00\x00\x00\x20\x66\x74\x79\x70\x69\x73\x6f\x6d\x00\x00\x02\x00%\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11%' src/binary.bin > dst/binary.bin
strace -f -e trace=read,write -o rw_scp.log scp src/binary.bin nupic@localhost:/home/nupic/VboxSharedFolder/experiments/sync/exp2/dst/

# cp always transfers whole file not delta
sed 's%\x00\x00\x00\x20\x66\x74\x79\x70\x69\x73\x6f\x6d\x00\x00\x02\x00%\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11%' src/binary.bin > dst/binary.bin
strace -f -e trace=read,write -o rw_cp.log cp src/binary.bin dst/binary.bin

Then I have following loop for evaluation of results:

for i in *.log; do
  echo $i; cat $i | grep write | awk 'BEGIN {FS="="}{ sum += $2} END {print sum/1024/1024 "MB"}';
  echo "###########";
done

Here are results:

rw_cp.log
67.8075MB
###########
rw_rsync_local_default.log
146.697MB
###########
rw_rsync_local_delta_enabled.log
66.8765MB
###########
rw_rsync_remote.log
0.0707941MB
###########
rw_scp.log
136.048MB
###########

From those five experiments only two are clear to me:

  1. cp writes approx same amount of bytes as the size of original
    file (rw_cp.log).
  2. rsync uses delta algorithm when destination is remote (over the network) (rw_rsync_remote.log)

And here are unclear things to me:

  1. Why invoking rsync on both src and dst on localhost writes approx
    two times bytes as the size of original file? (rw_rsync_local_default.log)
  2. Why --no-W option for rsync does not transfer only delta for src and dst on localhost as stated here and why it still transfers approx whole file? (rw_rsync_local_delta_enabled.log)
  3. Bonus: Why scp transfers approx twice bytes as original file size? I understand that there is some encryption but two times seems large to me (rw_scp.log).

Best Answer

To answer the main question in short, rsync seems to write double the number of bytes, because it spawns two processes/threads to do the copy, and there's one stream data between the processes, and another from the receiving process to the target file.

We can tell this by looking at the strace output in more detail, the process IDs in the beginning of the file, and also the file descriptor numbers in the write calls can be used to tell different write "streams" from each other.

Presumably, this is so that a local transfer can work just like a remote transfer, only the source and destination are on the same system.


Using something like strace -e trace=process,socketpair,open,read,write would show some threads spawned off, the socket pair being created between them, and different threads opening the input and output files.

A test run similar to yours:

$ rm test2
$ strace -f -e trace=process,socketpair,open,close,dup,dup2,read,write -o rsync.log rsync -avcz --progress test1 test2
$ ls -l test1 test2
-rw-r--r-- 1 itvirta itvirta 81920004 Jun 21 20:20 test1
-rw-r--r-- 1 itvirta itvirta 81920004 Jun 21 20:20 test2

Let's take a count of bytes written for each thread separately:

$ for x in 15007 15008 15009  ; do echo -en "$x: " ; grep -E "$x (<... )?write"  rsync.log | awk 'BEGIN {FS=" = "} {sum += $2} END {print sum}'  ; done 
15007: 81967265
15008: 49
15009: 81920056

Which matches pretty much with the theory above. I didn't check what the other 40kB written by the first thread is, but I'll assume it prints the progress output, and whatever metadata about the synced file rsync needs to transfer to the other end.


I didn't check, but I'll suggest that even with delta compression enabled, perhaps the "remote" end of rsync still writes out (most of) the file in full, resulting in approximately the same amount of writes as with cp. The transfer between the rsync threads is smaller, but the final output is still the same.

Related Question