There are 2 parts to this question. First, why is there a difference between "Number of files" and "Number of files transferred". This is explained in the rsync manpage:
Number of files: is the count of all "files" (in the generic sense), which includes directories, symlinks, etc.
Number of files transferred: is the count of normal files that were updated via rsync’s delta-transfer algorithm, which does not include created dirs, symlinks, etc.
The difference here should be equal to the total amount of directories, symnlinks, other special files. Those were not "transferred" but just re-created.
Now for the second part, why is there a size difference with du. du shows the amount of disk space used by a file, not the size of the file. The same file can take up a different amount of disk space, if for example the filesystems blocksizes differ.
If you are still worried about data integrity, a easy way to be sure is to created hashes for all your files and compare:
( cd /home/hholtmann && find . -type f -exec md5sum {} \; ) > /tmp/hholtmann.md5sum
( cd /media/wd750/c51/home/ && md5sum -c /tmp/hholtmann.md5sum )
I haven't used any dedicated programs for this, but it is quite easy to organize and fine tune with a combination of cron, bash, tar (incremental dumps) and/or rsync. In my mind, there are two optimal solutions, and I use both or one of them depending on the context. I think the first will be more appropriate for you, but I'll describe both here.
Incremental tar archives
The core of this solution is a script that might look something like this:
#!/bin/bash
# You will need to set the variables $EXCLUDE, $DATA and $BACKUPS
# as environment variables, in ~/.bashrc or somewhere.
OPTS="--create --no-check-device --bzip2 --verbose -X $EXCLUDE"
for d in `ls $DATA`; do
SNAPSHOT=$BACKUPS/$d.snar
if [ $1 == full ]; then
echo "Archiving $d (full)..."
rm -rvf $SNAPSHOT
ARCHIVE=$DATA/$d.`date --iso-8601`.full.tar.bz2
tar $OPTS --file=$ARCHIVE --listed-incremental=$SNAPSHOT $DATA/$d
fi
if [ $1 == increment ]; then
echo "Archiving data/$d (increment)..."
ARCHIVE=$DATA/$d.`date --iso-8601`.tar.bz2
tar $OPTS --file=$ARCHIVE --listed-incremental=$SNAPSHOT $DATA/$d
fi
done
This assumes there are subdirectories in $DATA
and backups each one in a separate archive. If your setup is different, customize the script.
You can schedule the backup in your crontab like so:
# m h dom mon dow command
44 1 1 */2 * ~/bin/backup_data full > ~/backups/data/logs/`date --iso-8601`.full.log 2>&1
44 5 * * * ~/bin/backup_data increment > ~/backups/data/logs/`date --iso-8601`.log 2>&1
As you can see, in this case a full backup is created once every two months, and incremental backups starting from that full dump are created every day. Problems with incremental archives in tar start when you lose one file or even change a timestamp. So, it is prudent to create a full dump once in a while.
As far as synchronizing between machines and removing old files, you should separate that task from the backing up itself, since it really is orthogonal. Of course, use rsync for synchronization, without the --delete
option so that you don't lose any data on the large external drive. So your command for that might be:
rsync -av /backups/data /mnt/external
if the external drive is mounted on the laptop. Otherwise, you will need to do it over the network like so:
rsync -av /backups/data user@external:/backups/data
If you want to clean archives older than 90 days from your laptop, you can do so like this:
find /path/to/files -type f -mtime +90 -delete
Again, put these things in your crontab.
Incremental backups with rsync
You can use rsync alone to incrementally backup things. I especially like using timestamped snapshots and hardlinks for that, and it is just one command. Here is an example close to what I normally use:
rsync --verbose --progress --stats --human-readable --archive --link-dest=/backups/data/`date --iso-8601 -d "one day ago"` /data/ /backups/data/`date --iso-8601`/
which basically creates hard links to the snapshot from the previous day (the one given by --link-dest
) for files that have not changed. If you will be running irregularly, you can use a symbolic link that points to the latest snapshot, and update that symlink after backup, like so:
rsync --verbose --progress --stats --human-readable --archive --link-dest=/backups/data/last /data/ /backups/data/`date --iso-8601`/ && rm -rvf /backups/data/last && ln -vs /backups/data/`date --iso-8601`/ /backups/data/last
On top of this, you will need to organize the synchronization with the external drive and delete old snapshots. This, generally, is done the same way as in the first solution I outlined above. However, when rsyncing snapshots between machines, make sure to use the -H
options to preserve the hard links.
Summary
Compared to the solution using tar
, the second one in my mind is somewhat simpler to manage and has all the files available at all times. Using archives, on the other hand, makes use of compression, uses fewer inodes, and has other pros on non-server machines.
Again, do all this in crontab whenever possible, so you don't have to remember about it. If you don't have the laptop turned on all the time, choose a time when it is often used, and perhaps do it several times a day so that at least some of the cron jobs start. Better yet, use something like anacron.
You can also run the backup script by hand , and fine-grain the dates in the filenames/directories if you want to do incrementals more than once each day. Obviously, you will need to play around with these solutions to make them fit your use case.
Update: a repository with an example script I use: https://github.com/langner/backup.sh/blob/master/backup.sh
Best Answer
Since you are not copying the metadata (which you would do if you used
--archive
or-a
instead of just-r
), the metadata (timestamps, ownerships etc.) will be different between the copy and the original. When you runrsync
again, since the timestamps are different, the file is copied again.So, you would instead want to use
I'm using
-i
(--itemize-changes
) since it also tells me why a file was copied.Also note that when you do a local copy with
rsync
, it will not use its delta algorithm, but will instead behave as if--whole-file
(or-W
) was specified. This is because the delta algorithm is assumed to only be faster than a whole file transfer when transferring over a network. When using the delta algorithm, the whole file would need to be read and checksummed on both the source and target systems. Doing this locally seems a bit wasteful, so the file is just copied in full instead.