This is an extension to Why rsync attempts to copy file that is already up-to-date?
I'm attempting to use the --copy-devices
patch to rsync
to copy an entire disk drive and store it as an image on another machine.
The copy appears to have run correctly however, when I run rsync
again with the same values, it appears to copy some of the data again every time.
I ran rsync
with the verbosity turned up and got this:
$ sudo rsync -vvz --partial --progress --copy-devices /dev/sdb me@otherserver:/backupdisks/mydisk.img
opening connection using: ssh -l me otherserver rsync --server -vvze.Lsfx --partial --copy-devices . /backupdisks/mydisk.img (11 args)
me@otherserver's password:
delta-transmission enabled
sdb
320,071,851,520 100% 63.47MB/s 1:20:09 (xfr#1, to-chk=0/1)
total: matches=2441955 hash_hits=2441955 false_alarms=204015955 data=0
sent 188 bytes received 21,979,001 bytes 2,837.31 bytes/sec
total size is 0 speedup is 0.00
I'm aware that rsync determines changes by time, but the disk has not changed between rsyncs (and how would it determine the modified time of a disk anyway?) The time on the remote image, however, does get updated each time. So this could be the issue.
The other possibility is that the disk has a bad sector which is returning a different value each time and negates whatever checksum is being used.
My question is two-fold:
-
Has my image been transferred successfully and, if so, why does it appear to retransmit much of the disk if I run it again? (This may also be partly answered as part of my corollary question What are "matches", "hash_hits", and "false_alarms" in rsync output, and does "data=0" mean success? )
-
Am I missing a switch to make this work properly? (Maybe
--checksum
?) Is it possible to list block-level failures used by the rsync algorithm?
Best Answer
By default rsync compares files by size and timestamp, but a device does not have a size so it must calculate differences using the delta algorithm which is described in this tech report. Loosely, the remote file is divided into blocks of a chosen size, and the checksums of these are sent back. The local file is similarly checksummed in blocks, and compared with the list. The remote is then told how to reassemble the blocks it has to remake the file, and data for the blocks that do not match is sent.
You can see this by asking for debug output at level 3 just for the deltasum algorithm with option
--debug=deltasum3
. You can specify a block size with-B
to simplify the numbers. For example, for a file that has already been copied once, a second run ofproduces output like this showing the checksum for each block:
You can then see it matching the checksums of the other device fairly trivially, since there are no differences:
At the end the
data=
field is 0, showing no new data was sent.If we now corrupt the copy by overwriting the middle of the file:
then the rsync debug shows us a new checksum for block 80 but no match for it. We go from match 79 to match 81:
At the end we have
data=100000
showing that a whole new data block had to be sent.The number of matches has been reduced by 1, for the corrupt block checksum which failed to match. Perhaps the hash hits rise because we lost sequential matching.
If we look further in the same tech report, some test results are shown and the false alarms are described as "the number of times the 32 bit rolling checksum matched but the strong checksum did not". Each block has a simple checksum and an md5 checksum made (md4 in older versions). The simple checksum is easy to search for using a hash table as it is a 32 bit integer. Once it matches an entry, the longer 16 byte md5 checksum is also compared, and if it does not match it is a false alarm, and the search continues.
My example uses a very small (and old) usb key device of 16Mbytes, and the minimum hash table size is 2**16 i.e. 65536 entries, so it is pretty empty when holding the 164 chunk entries I have. So many false alarms are normal and more an indication of efficiency then anything else.