How to get rsync to link identical files with –link-dest option if an old file already exists

rsync

One might think that --link-dest'ing to an identical file would work in all cases. But it does not when the file exists, even if the file is out of date/has different contents.

It is because of this, from the rsync man page on --link-dest:

"This option works best when copying into an empty destination
hierarchy, as rsync treats existing files as definitive (so rsync never
looks in the link-dest dirs when a destination file already exists
)"

This means that if y/file exists same as source, and z/file is out of date,

rsync -a --del -link-dest=y source:/file z

will result in TWO inodes (and twice the diskspace) being used, y/file and z/file, which will have the same contents and datestamps.

I came across this because I do daily backups basically with this script run once per day:

mv $somedaysago $today; 
yest=$today; today=`date +%Y%m%d`;
rsync -avPShyH --del --link-dest=../$yest host:/dirs $today

Because my backups span up to 10M files, doing rm -rf $olddir; rsync source:$dir newdir would take way too long (especially when only 0.5% of the files change per day, incurring the deletion and creation of 10M dir entries just to handle 50K new or changed files, which would make my backups not complete in time for the next day).

Here's a demo of the situation:

a is our source, 1 through 4 are our numbered backups:

$ mkdir -p 1 2; echo foo > 1/foobar; cp -lrv 1/* 2
`1/foobar' -> `2/foobar'
$ ls -i1 */foobar
1053003 1/foobar
1053003 2/foobar

$ mkdir a; echo quux > a/foobar
$ mv 1 3; rsync -avPhyH --del --link-dest=../2 a/ 3
sending incremental file list
./
foobar
           5 100%    0.00kB/s    0:00:00 (xfer#1, to-check=0/2)

sent 105 bytes  received 34 bytes  278.00 bytes/sec
total size is 5  speedup is 0.04

$ ls -i1 */foobar
1053003 2/foobar
1053007 3/foobar
1053006 a/foobar

$ mv 2 4; rsync -avPhyH --del --link-dest=../3 a/ 4
sending incremental file list
./
foobar
           5 100%    0.00kB/s    0:00:00 (xfer#1, to-check=0/2)

sent 105 bytes  received 34 bytes  278.00 bytes/sec
total size is 5  speedup is 0.04


$ ls -il1 */foobar
1053007 -rw-r--r-- 1 math math 5 Mar 30 00:57 3/foobar
1053008 -rw-r--r-- 1 math math 5 Mar 30 00:57 4/foobar
1053006 -rw-r--r-- 1 math math 5 Mar 30 00:57 a/foobar

$ md5sum [34a]/foobar
d3b07a382ec010c01889250fce66fb13  3/foobar
d3b07a382ec010c01889250fce66fb13  4/foobar
d3b07a382ec010c01889250fce66fb13  a/foobar

Now we have 2 backups of a/foobar that are identical in all ways, including timestamp, but occupying different inodes.

One might think a solution would be --delete-before, which kills the benefit of the incremental scan but this doesn't help either as the file will not be deleted, but used as a basis in case incremental copy is possible.

One might further surmise then we can turn off this incremental-copy hedge with --whole-file, but this does not help the algorithm any, there's no way to get what we want.

I consider this behaviour another bug in rsync, where a beneficial behaviour could be construed from careful selections of various command arguments, but the desired outcome is not available.

A solution would unfortunately be moving from a single rsync as an atomic operation to a dry-run with -n, logging it, processing that log as input to manually pre-delete all changed files, then running rsync --link-dest to get what we want — a big kludge compared to a single clean rsync.

Addendum: tried to pre-link $yesterday and $today on the backup server before the backup against production boxes with rsync --link-dest=../$yesterday $yesterday/ $today — but the same result – any file that exists in any way, even 0 length, will never be removed and link-dested, instead a whole new copy will be made from the sourcedir with a new inode and using up more diskspace.

Looking at pax(1) as a possible pre-linking-before-backup solution.

Best Answer

(Converted from question edit)

This is solved by upgrading rsync. Version 3.1.1 or later will now replace identical files in the target and --link-dest directory with one hardlinked file. Saves lots of space.

Related Question