I have binary file which has approx 77MB:
nupic@nupic-virtualbox:~/VboxSharedFolder/experiments/sync/exp2$ ls -lah src/
total 77M
drwxrwx--- 1 root vboxsf 0 Jun 21 13:31 .
drwxrwx--- 1 root vboxsf 4.0K Jun 21 16:21 ..
-rwxrwx--- 1 root vboxsf 77M May 27 2014 binary.bin
I've been playing with rsync
and it's delta algorithm feature to see how it is working. Idea was to make small differences in binary file and see how much data were transferred using several methods. For those purposes I've made very simple script:
#!/bin/bash
# rsync does not trnansfers delta over local by default
sed 's%\x00\x00\x00\x20\x66\x74\x79\x70\x69\x73\x6f\x6d\x00\x00\x02\x00%\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11%' src/binary.bin > dst/binary.bin
strace -f -e trace=read,write -o rw_rsync_local_default.log rsync -avcz --progress src/ dst/
# rsync -no-W should enables delta tranfer no matter if local or remote
sed 's%\x00\x00\x00\x20\x66\x74\x79\x70\x69\x73\x6f\x6d\x00\x00\x02\x00%\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11%' src/binary.bin > dst/binary.bin
strace -f -e trace=read,write -o rw_rsync_local_delta_enabled.log rsync --no-W -avcz --progress src/ dst/
# rsync trnansfers delta over network by default
sed 's%\x00\x00\x00\x20\x66\x74\x79\x70\x69\x73\x6f\x6d\x00\x00\x02\x00%\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11%' src/binary.bin > dst/binary.bin
strace -f -e trace=read,write -o rw_rsync_remote.log rsync -avcz -e "ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null" --progress src/ nupic@localhost:/home/nupic/VboxSharedFolder/experiments/sync/exp2/dst/
# scp should transfers whole file not delta
sed 's%\x00\x00\x00\x20\x66\x74\x79\x70\x69\x73\x6f\x6d\x00\x00\x02\x00%\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11%' src/binary.bin > dst/binary.bin
strace -f -e trace=read,write -o rw_scp.log scp src/binary.bin nupic@localhost:/home/nupic/VboxSharedFolder/experiments/sync/exp2/dst/
# cp always transfers whole file not delta
sed 's%\x00\x00\x00\x20\x66\x74\x79\x70\x69\x73\x6f\x6d\x00\x00\x02\x00%\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11\x11%' src/binary.bin > dst/binary.bin
strace -f -e trace=read,write -o rw_cp.log cp src/binary.bin dst/binary.bin
Then I have following loop for evaluation of results:
for i in *.log; do
echo $i; cat $i | grep write | awk 'BEGIN {FS="="}{ sum += $2} END {print sum/1024/1024 "MB"}';
echo "###########";
done
Here are results:
rw_cp.log
67.8075MB
###########
rw_rsync_local_default.log
146.697MB
###########
rw_rsync_local_delta_enabled.log
66.8765MB
###########
rw_rsync_remote.log
0.0707941MB
###########
rw_scp.log
136.048MB
###########
From those five experiments only two are clear to me:
cp
writes approx same amount of bytes as the size of original
file (rw_cp.log
).rsync
uses delta algorithm when destination is remote (over the network) (rw_rsync_remote.log
)
And here are unclear things to me:
- Why invoking
rsync
on bothsrc
anddst
onlocalhost
writes approx
two times bytes as the size of original file? (rw_rsync_local_default.log
) - Why
--no-W
option forrsync
does not transfer only delta forsrc
anddst
onlocalhost
as stated here and why it still transfers approx whole file? (rw_rsync_local_delta_enabled.log
) - Bonus: Why
scp
transfers approx twice bytes as original file size? I understand that there is some encryption but two times seems large to me (rw_scp.log
).
Best Answer
To answer the main question in short,
rsync
seems to write double the number of bytes, because it spawns two processes/threads to do the copy, and there's one stream data between the processes, and another from the receiving process to the target file.We can tell this by looking at the
strace
output in more detail, the process IDs in the beginning of the file, and also the file descriptor numbers in thewrite
calls can be used to tell different write "streams" from each other.Presumably, this is so that a local transfer can work just like a remote transfer, only the source and destination are on the same system.
Using something like
strace -e trace=process,socketpair,open,read,write
would show some threads spawned off, the socket pair being created between them, and different threads opening the input and output files.A test run similar to yours:
Let's take a count of bytes written for each thread separately:
Which matches pretty much with the theory above. I didn't check what the other 40kB written by the first thread is, but I'll assume it prints the progress output, and whatever metadata about the synced file rsync needs to transfer to the other end.
I didn't check, but I'll suggest that even with delta compression enabled, perhaps the "remote" end of rsync still writes out (most of) the file in full, resulting in approximately the same amount of writes as with cp. The transfer between the rsync threads is smaller, but the final output is still the same.