Why isn’t rsync using delta transfers for a single file across a network

rsync

I have looked at this question and this question, but they do not seem to address the symptoms I am seeing.

I have a large log file (around 600 MB) that I am trying to transfer across a cellular network. Because it is a log file it is just appended to (although it is actually in an SQLite database with only INSERT being performed, so it isn't quite as simple as that, but with the exception of the last 4k page (or maybe a few) the file is identical each time. It is important that only the changes (and whatever checksums need to be transmitted) actually get sent, because the data connection is metered.

Yet when I perform a test across an unmetered connection (e.g. free wifi hotspot) I do not see an speed-up or reduced data transfer observed or reported. Over a slow WiFi connection I see on the order of 1MB/s or less, reporting that the transfer is going to take nearly 20 minutes. Over a fast WiFi connection I see a uniform faster speed, but no report of speedup, and a second attempt to transfer (which now should be faster because the two files are identical) does now show any difference.

The (sanitized to remove sensitive info) command I am using is:

rsync 'ssh -p 9999' --progress LogFile michael@my.host.zzz:/home/michael/logs/LogFile

The output I get at the end looks like this:

LogFile
    640,856,064 100%   21.25MB/s   0:00:28 (xfr$1, to-chk=0/1)

There is no mention of any kind of speedup.

I suspect the problem may be one of the following:

  • I am missing some command line option. However, re-reading the man page seems to suggest that delta transfers are enabled by default: I only see options for disabling them.
  • I'm using rsync over ssh (on a non-standard port even) due to the server being behind a firewall that only allows ssh. I haven't seen anything explicitly saying delta transfers won't work if the rsync daemon isn't running, though. I tried using the "::" notation instead of ":" but the man page isn't very clear about what a "module" is, and my command is rejected for specifying an invalid module.

I have ruled out the following:

  • delta transfers not performed on a local network. Ruled out because I am trying to perform the transfer across the internet
  • overhead due to checksum calculation. I have seen this behavior both on a fast and slow Wifi connection and the transfer rate doesn't seem to be compute bound.

Best Answer

Summary

Databases tend to keep a lot of metadata, organizational data, etc. An insert is very unlikely to be a simple append, like it would be with a text file. Testing SQLite shows it behaves that way, in both WAL and non-WAL modes. This leads to rsync having to sync a lot more data than you'd expect. You can reduce this overhead somewhat by using a low --block-size (at the cost of more overhead computing and transferring checksums).

A better approach is probably to dump out new records as a SQL dump, compress it, and transfer that. Alternatively, there appear to be several replication solutions for SQLite, you could use one of those.

roaima suggests at a bare minimum you could probably do a full SQL dump, compress it using gzip --rsyncable, and then rsync that. Worth a test, I suppose, to see if that's a small enough delta.

Details

What you're trying should work. I'd personally add --partial to your rsync options, just in case it's somehow detecting the growing file as a partial transfer. You can also get better transfer stats with --stats.

The second thing to check is if SQLite is really only touching a few pages—honestly, I wouldn't be surprised if it's writing pages all over the file. One quick way to check would be to use cmp -l on two versions—see if there are changes to pages other than the final few. Remember that rsync's idea of a "page"/block is different than SQLite's; you can change rsync's via --block-size. Reducing it might help.

Edit: I did a quick test with SQLite. Even with 32k pages, adding a bunch of log entries scribbled on every page. Details below.

Edit 2: It appears to be better in WAL mode, though you still take a huge amount of overhead, probably from the checkpoint.

Edit 3: It is also better the more data you're adding per transfer—I guess it probably scribbles certain blocks over and over again. So you're transferring that same set of blocks regardless of whether it wrote to them once or a hundred times.

BTW: For minimizing transfer, you can probably do much better than rsync. For example, a SQL dump of new records since the last transfer run through xz --best (or even gzip) would probably be a fair bit smaller.

Quick SQLite Test

Schema:

CREATE TABLE log (id integer primary key not null, ts integer not null, app text not null, message text not null);
CREATE INDEX log_ts_idx on log(ts);
CREATE INDEX log_app_idx on log(app);

Perl program:

use 5.022;
use DBI;

my $DBH = DBI->connect('dbi:SQLite:test.db', '', '', {RaiseError => 1, AutoCommit => 0})
    or die "connect...";

my @apps = (
    '[kthreadd]',        '[ksoftirqd/0]',
    ⋮ # there were 191 of these
    '[kworker/5:0H]',
);

my @messages = <DATA>;

(my $curr_time) = $DBH->selectrow_array(<<QUERY);
    SELECT COALESCE(MAX(ts),978307200) FROM log
QUERY

my $n_apps = @apps;
my $n_msgs = @messages;
say "Apps: $n_apps";
say "Messages: $n_msgs";
say 'Start time: ', scalar gmtime($curr_time), ' UTC';

my $sth = $DBH->prepare(<<QUERY);
    INSERT INTO log(ts, app, message) VALUES (?, ?, ?)
QUERY

for (my $i = 0; $i < 10_000; ++$i) {
    $sth->execute(int($curr_time), $apps[int rand $n_apps], $messages[int rand $n_msgs]);
    $curr_time += rand 0.1;
}
$DBH->commit;

__DATA__
microcode: CPU0 microcode updated early to revision 0x19, date = 2013-06-21
Linux version 4.5.0-2-amd64 (debian-kernel@lists.debian.org) (gcc version 5.3.1 20160528 (Debian 5.3.1-21) ) #1 SMP Debian 4.5.5-1 (2016-05-29)
⋮

There were a lot more example log messages (2076).

Checking for which pages changed:

cp test.db test.db.old
perl test.pl
cmp -l test.db.old test.db | perl -n -E '/^\s*(\d+) / or die "wtf"; $bucket{int $1/32768} = 1; END { say join "\n", sort( { $a <=> $b } keys %bucket) }'
Related Question