Why is rsync taking a long time on large files that already exist

rsync

I made a backup of one of my computer's internal drives to an external. The next time I used rsync to sync the two drives, I noticed that large (40 + GB) files that weren't modified were still taking a long time to "copy". I thought rsync looked at mod-times and file size first? Why would it take so long; as though it were using checksum?

I had originally copied the files using rsync -rv --delete /src/path/ /dest/path/

Best Answer

Since you are not copying the metadata (which you would do if you used --archive or -a instead of just -r), the metadata (timestamps, ownerships etc.) will be different between the copy and the original. When you run rsync again, since the timestamps are different, the file is copied again.

So, you would instead want to use

rsync -ai --delete /src/path/ /dest/path

I'm using -i (--itemize-changes) since it also tells me why a file was copied.

Also note that when you do a local copy with rsync, it will not use its delta algorithm, but will instead behave as if --whole-file (or -W) was specified. This is because the delta algorithm is assumed to only be faster than a whole file transfer when transferring over a network. When using the delta algorithm, the whole file would need to be read and checksummed on both the source and target systems. Doing this locally seems a bit wasteful, so the file is just copied in full instead.

Related Solutions

Ubuntu – Reasons for rsync NOT transferring all files

There are 2 parts to this question. First, why is there a difference between "Number of files" and "Number of files transferred". This is explained in the rsync manpage:

Number of files: is the count of all "files" (in the generic sense), which includes directories, symlinks, etc.

Number of files transferred: is the count of normal files that were updated via rsync’s delta-transfer algorithm, which does not include created dirs, symlinks, etc.

The difference here should be equal to the total amount of directories, symnlinks, other special files. Those were not "transferred" but just re-created.

Now for the second part, why is there a size difference with du. du shows the amount of disk space used by a file, not the size of the file. The same file can take up a different amount of disk space, if for example the filesystems blocksizes differ.

If you are still worried about data integrity, a easy way to be sure is to created hashes for all your files and compare:

( cd /home/hholtmann && find . -type f -exec md5sum {} \; ) > /tmp/hholtmann.md5sum
( cd /media/wd750/c51/home/ && md5sum -c /tmp/hholtmann.md5sum )

Incremental file back up by date

I haven't used any dedicated programs for this, but it is quite easy to organize and fine tune with a combination of cron, bash, tar (incremental dumps) and/or rsync. In my mind, there are two optimal solutions, and I use both or one of them depending on the context. I think the first will be more appropriate for you, but I'll describe both here.

Incremental tar archives

The core of this solution is a script that might look something like this:

#!/bin/bash

# You will need to set the variables $EXCLUDE, $DATA and $BACKUPS
# as environment variables, in ~/.bashrc or somewhere.

OPTS="--create --no-check-device --bzip2 --verbose -X $EXCLUDE"
for d in `ls $DATA`; do
    SNAPSHOT=$BACKUPS/$d.snar
    if [ $1 == full ]; then
        echo "Archiving $d (full)..."
        rm -rvf $SNAPSHOT
        ARCHIVE=$DATA/$d.`date --iso-8601`.full.tar.bz2
        tar $OPTS --file=$ARCHIVE --listed-incremental=$SNAPSHOT $DATA/$d
    fi
    if [ $1 == increment ]; then
        echo "Archiving data/$d (increment)..."
        ARCHIVE=$DATA/$d.`date --iso-8601`.tar.bz2
        tar $OPTS --file=$ARCHIVE --listed-incremental=$SNAPSHOT $DATA/$d
    fi
done

This assumes there are subdirectories in $DATA and backups each one in a separate archive. If your setup is different, customize the script.

You can schedule the backup in your crontab like so:

# m  h  dom mon dow   command
  44 1  1   */2 *     ~/bin/backup_data full > ~/backups/data/logs/`date --iso-8601`.full.log 2>&1
  44 5  *   *   *     ~/bin/backup_data increment > ~/backups/data/logs/`date --iso-8601`.log 2>&1

As you can see, in this case a full backup is created once every two months, and incremental backups starting from that full dump are created every day. Problems with incremental archives in tar start when you lose one file or even change a timestamp. So, it is prudent to create a full dump once in a while.

As far as synchronizing between machines and removing old files, you should separate that task from the backing up itself, since it really is orthogonal. Of course, use rsync for synchronization, without the --delete option so that you don't lose any data on the large external drive. So your command for that might be:

rsync -av /backups/data /mnt/external

if the external drive is mounted on the laptop. Otherwise, you will need to do it over the network like so:

rsync -av /backups/data user@external:/backups/data

If you want to clean archives older than 90 days from your laptop, you can do so like this:

find /path/to/files -type f -mtime +90 -delete

Again, put these things in your crontab.

Incremental backups with rsync

You can use rsync alone to incrementally backup things. I especially like using timestamped snapshots and hardlinks for that, and it is just one command. Here is an example close to what I normally use:

rsync --verbose --progress --stats --human-readable --archive --link-dest=/backups/data/`date --iso-8601 -d "one day ago"` /data/ /backups/data/`date --iso-8601`/

which basically creates hard links to the snapshot from the previous day (the one given by --link-dest) for files that have not changed. If you will be running irregularly, you can use a symbolic link that points to the latest snapshot, and update that symlink after backup, like so:

rsync --verbose --progress --stats --human-readable --archive --link-dest=/backups/data/last /data/ /backups/data/`date --iso-8601`/ && rm -rvf /backups/data/last && ln -vs /backups/data/`date --iso-8601`/ /backups/data/last

On top of this, you will need to organize the synchronization with the external drive and delete old snapshots. This, generally, is done the same way as in the first solution I outlined above. However, when rsyncing snapshots between machines, make sure to use the -H options to preserve the hard links.

Summary

Compared to the solution using tar, the second one in my mind is somewhat simpler to manage and has all the files available at all times. Using archives, on the other hand, makes use of compression, uses fewer inodes, and has other pros on non-server machines.

Again, do all this in crontab whenever possible, so you don't have to remember about it. If you don't have the laptop turned on all the time, choose a time when it is often used, and perhaps do it several times a day so that at least some of the cron jobs start. Better yet, use something like anacron.

You can also run the backup script by hand , and fine-grain the dates in the filenames/directories if you want to do incrementals more than once each day. Obviously, you will need to play around with these solutions to make them fit your use case.

Update: a repository with an example script I use: https://github.com/langner/backup.sh/blob/master/backup.sh

Best Answer

Related Solutions

Ubuntu – Reasons for rsync NOT transferring all files

Incremental file back up by date

Related Question