Rsync – How to Verify Device Copy When Copy-Devices is Enabled

devicesdisk-imagersync

This is an extension to Why rsync attempts to copy file that is already up-to-date?

I'm attempting to use the --copy-devices patch to rsync to copy an entire disk drive and store it as an image on another machine.

The copy appears to have run correctly however, when I run rsync again with the same values, it appears to copy some of the data again every time.

I ran rsync with the verbosity turned up and got this:

$ sudo rsync -vvz --partial --progress --copy-devices /dev/sdb me@otherserver:/backupdisks/mydisk.img
opening connection using: ssh -l me otherserver rsync --server -vvze.Lsfx --partial --copy-devices . /backupdisks/mydisk.img  (11 args)
me@otherserver's password: 
delta-transmission enabled
sdb
320,071,851,520 100%   63.47MB/s    1:20:09 (xfr#1, to-chk=0/1)
total: matches=2441955  hash_hits=2441955  false_alarms=204015955 data=0

sent 188 bytes  received 21,979,001 bytes  2,837.31 bytes/sec
total size is 0  speedup is 0.00

I'm aware that rsync determines changes by time, but the disk has not changed between rsyncs (and how would it determine the modified time of a disk anyway?) The time on the remote image, however, does get updated each time. So this could be the issue.

The other possibility is that the disk has a bad sector which is returning a different value each time and negates whatever checksum is being used.

My question is two-fold:

  1. Has my image been transferred successfully and, if so, why does it appear to retransmit much of the disk if I run it again? (This may also be partly answered as part of my corollary question What are "matches", "hash_hits", and "false_alarms" in rsync output, and does "data=0" mean success? )

  2. Am I missing a switch to make this work properly? (Maybe --checksum ?) Is it possible to list block-level failures used by the rsync algorithm?

Best Answer

By default rsync compares files by size and timestamp, but a device does not have a size so it must calculate differences using the delta algorithm which is described in this tech report. Loosely, the remote file is divided into blocks of a chosen size, and the checksums of these are sent back. The local file is similarly checksummed in blocks, and compared with the list. The remote is then told how to reassemble the blocks it has to remake the file, and data for the blocks that do not match is sent.

You can see this by asking for debug output at level 3 just for the deltasum algorithm with option --debug=deltasum3. You can specify a block size with -B to simplify the numbers. For example, for a file that has already been copied once, a second run of

rsync -B 100000 --copy-devices -avv --debug=deltasum3 --no-W /dev/sdd /tmp/mysdd

produces output like this showing the checksum for each block:

count=164 rem=84000 blength=100000 s2length=2 flength=16384000
chunk[0] offset=0      len=100000 sum1=61f6893e
chunk[1] offset=100000 len=100000 sum1=32f30ba3
chunk[2] offset=200000 len=100000 sum1=45b1f9e5
...

You can then see it matching the checksums of the other device fairly trivially, since there are no differences:

potential match at 0      i=0 sum=61f6893e
match at 0      last_match=0      j=0 len=100000 n=0
potential match at 100000 i=1 sum=32f30ba3
match at 100000 last_match=100000 j=1 len=100000 n=0
potential match at 200000 i=2 sum=45b1f9e5
match at 200000 last_match=200000 j=2 len=100000 n=0
...

At the end the data= field is 0, showing no new data was sent.

total: matches=164  hash_hits=164  false_alarms=0 data=0

If we now corrupt the copy by overwriting the middle of the file:

echo test | dd conv=block,notrunc seek=80 bs=100000 of=/tmp/mysdd 
touch -r /dev/sdd /tmp/mysdd

then the rsync debug shows us a new checksum for block 80 but no match for it. We go from match 79 to match 81:

chunk[80] offset=8000000 len=100000 sum1=a73cccfe
...
potential match at 7900000 i=79 sum=58eabec6
match at 7900000 last_match=7900000 j=79 len=100000 n=0
potential match at 8100000 i=81 sum=eba488ba
match at 8100000 last_match=8000000 j=81 len=100000 n=100000

At the end we have data=100000 showing that a whole new data block had to be sent.

total: matches=163  hash_hits=385  false_alarms=0 data=100000

The number of matches has been reduced by 1, for the corrupt block checksum which failed to match. Perhaps the hash hits rise because we lost sequential matching.


If we look further in the same tech report, some test results are shown and the false alarms are described as "the number of times the 32 bit rolling checksum matched but the strong checksum did not". Each block has a simple checksum and an md5 checksum made (md4 in older versions). The simple checksum is easy to search for using a hash table as it is a 32 bit integer. Once it matches an entry, the longer 16 byte md5 checksum is also compared, and if it does not match it is a false alarm, and the search continues.

My example uses a very small (and old) usb key device of 16Mbytes, and the minimum hash table size is 2**16 i.e. 65536 entries, so it is pretty empty when holding the 164 chunk entries I have. So many false alarms are normal and more an indication of efficiency then anything else.