Rsync – How to Compare Two Folders and Copy Differences

diff()file-copyfilesrsync

You've got three folders:

folder current, which contains your current files
folder old, which contains an older version of the same files
folder difference, which is just an empty folder

How do you compare old with current and copy the files which are different (or entirely new) in current to difference?

I have searched all around and it seems like a simple thing to tackle, but I can't get it to work in my particular example. Most sources suggested the use of rsync so I ended up with the following command:

rsync -ac --compare-dest=../old/ new/ difference/

What this does however, is copies all the files from new to difference, even those which are the same as in old.

In case it helps (maybe the command is fine and the fault lies elsewhere), this is how I tested this:

I made the three folders.
I made several text files with different contents in old.
I copied the files from old to new.
I changed the contents of some of the files in new and added a few additional files.
I ran the above command and checked the results in difference.

I have been looking for a solution for the past couple of days and I'd really appreciate some help. It doesn't necessarily have to be using rsync, but I'd like to know what I'm doing wrong if possible.

Best Answer

I am not sure whether you can do it with any existing linux commands such as rsync or diff. But in my case I had to write my own script using Python, as python has the "filecmp" module for file comparison. I have posted the whole script and usage in my personal site - http://linuxfreelancer.com/

It usage is simple - give it the absolute path of new directory, old directory and difference directory in that order.

#!/usr/bin/env python

import os, sys
import filecmp
import re
from distutils import dir_util
import shutil

holderlist = []


def compareme(dir1, dir2):
    dircomp = filecmp.dircmp(dir1, dir2)
    only_in_one = dircomp.left_only
    diff_in_one = dircomp.diff_files
    dirpath = os.path.abspath(dir1)
    [holderlist.append(os.path.abspath(os.path.join(dir1, x))) for x in only_in_one]
    [holderlist.append(os.path.abspath(os.path.join(dir1, x))) for x in diff_in_one]
    if len(dircomp.common_dirs) > 0:
        for item in dircomp.common_dirs:
            compareme(
                os.path.abspath(os.path.join(dir1, item)),
                os.path.abspath(os.path.join(dir2, item)),
            )
        return holderlist


def main():
    if len(sys.argv) > 3:
        dir1 = sys.argv[1]
        dir2 = sys.argv[2]
        dir3 = sys.argv[3]
    else:
        print "Usage: ", sys.argv[0], "currentdir olddir difference"
        sys.exit(1)

    if not dir3.endswith("/"):
        dir3 = dir3 + "/"

    source_files = compareme(dir1, dir2)
    dir1 = os.path.abspath(dir1)
    dir3 = os.path.abspath(dir3)
    destination_files = []
    new_dirs_create = []
    for item in source_files:
        destination_files.append(re.sub(dir1, dir3, item))
    for item in destination_files:
        new_dirs_create.append(os.path.split(item)[0])
    for mydir in set(new_dirs_create):
        if not os.path.exists(mydir):
            os.makedirs(mydir)
    # copy pair
    copy_pair = zip(source_files, destination_files)
    for item in copy_pair:
        if os.path.isfile(item[0]):
            shutil.copyfile(item[0], item[1])


if __name__ == "__main__":
    main()

Related Solutions

How to Sync Two Folders with Command Line Tools – Folder Synchronization Methods

This puts folder A into folder B:

rsync -avu --delete "/home/user/A" "/home/user/B"

If you want the contents of folders A and B to be the same, put /home/user/A/ (with the slash) as the source. This takes not the folder A but all of it's content and puts it into folder B. Like this:

rsync -avu --delete "/home/user/A/" "/home/user/B"

-a Do the sync preserving all filesystem attributes
-v run verbosely
-u only copy files with a newer modification time (or size difference if the times are equal)
--delete delete the files in target folder that do not exist in the source

Manpage: https://download.samba.org/pub/rsync/rsync.html

Synchronize only the newest files and folders, limit by size

#IF downvoted, please explain why in comment. Bummer !!!

From my comment i reached to a solution:

The whole command line is like this: $ find . -type f -printf "%T@ %p %s\n"| sort -n -r | awk '{ i+=$3; if (i<=200000) {print $2}}' | tar -cvf toto.tar -T -&& ssh -n prd "rm -rf dir/*" && scp toto.tar prd:tmp/ && ssh -n prd "tar xvf tmp/toto.tar"

The command starts with a find which look for all file in the current directory (should be adapted to the directory path on server A) and print 3 fields

%T@ print the unix timestamp
%p print the file path from where the find is launched
%s print the size in byte of the file
\n for new line of course.

Then the sorting of the output of find is done by sort -n -r which will reverse sort numerically on the first field sorting unix timestamp from most recent to oldest one.

To deal with the size constraint, awk is helping a little bit, by printing the second field of the sort output, until the sum of size is under the limit. For each line it processes, it sum the value of the third field (size) to local variable i, then print the second field of the sort output if i is under the limit.

tar -cvf toto.tar -T - will build an archive called toto.tar based on the file list provided by awk output.

If success, first deletion of the server B backup; then scp toto.tar host:dir will transfert the file to remote server (server B), and then ssh -n "tar xvf dir/toto.tar will unpack the transfered archive on the remote folder preserving directory structure.

My previous solution based on scp was not preserving directory structure that is why i edited this answer.

Here the result of the run in my homedir with max size of 200kb:

$ rm toto.tar; find . -type f -printf "%T@ %p %s\n"| sort -n -r | awk '{ i+=$3; if (i<=200000) {print $2}}'  | tar -cvf toto.tar -T -&& scp toto.tar prd:tmp/ && ssh -n prd "tar xvf tmp/toto.tar"
./.lesshst
./.viminfo
./scpplus
./.config/xfce4/desktop/icons.screen0-1350x650.rc
./.xsession-errors
./.config/xfce4/xfconf/xfce-perchannel-xml/xfce4-panel.xml
./.config/pulse/7f14833c645d4a6abb0beba68b79e0c0-default-source
./.config/pulse/7f14833c645d4a6abb0beba68b79e0c0-default-sink
./.cache/imsettings/log
./.cache/gpg-agent-info
./.ICEauthority
./.vboxclient-draganddrop.pid
./.vboxclient-seamless.pid
./.vboxclient-display.pid
./.vboxclient-clipboard.pid
./.dbus/session-bus/7f14833c645d4a6abb0beba68b79e0c0-0
./.cache/xscreensaver/xscreensaver-getimage.cache
./.config/xfce4/desktop/icons.screen0-1264x950.rc
./work/fpart-0.9.2/src/fpart
toto.tar                                                                                                                              100%  170KB 170.0KB/s   00:00    
./.lesshst
./.viminfo
./scpplus
./.config/xfce4/desktop/icons.screen0-1350x650.rc
./.xsession-errors
./.config/xfce4/xfconf/xfce-perchannel-xml/xfce4-panel.xml
./.config/pulse/7f14833c645d4a6abb0beba68b79e0c0-default-source
./.config/pulse/7f14833c645d4a6abb0beba68b79e0c0-default-sink
./.cache/imsettings/log
./.cache/gpg-agent-info
./.ICEauthority
./.vboxclient-draganddrop.pid
./.vboxclient-seamless.pid
./.vboxclient-display.pid
./.vboxclient-clipboard.pid
./.dbus/session-bus/7f14833c645d4a6abb0beba68b79e0c0-0
./.cache/xscreensaver/xscreensaver-getimage.cache
./.config/xfce4/desktop/icons.screen0-1264x950.rc
./work/fpart-0.9.2/src/fpart

The main concern is the fact this solution removes the backup folder, before transferring the latest 10gigs of data from the primary server. It's not very efficient if the set of newest data and the set of backuped data have lot of common file/directory. But this is a very easy way to really track down the latest 10gig (or whatever) of newest data, whatever is the data (quick and dirty)

update2: im bored but i will explain the second solution.

I finally reached a second solution which i will explain now. It's not efficiently coded, it's a big onliner, could be formatted into a shell script with basic check in case of failure, or filename strange formatting.

The biggest issue to the first solution is that it always try to backup the last 10gig of newest files. Whatever is already backuped. Which means if there is only 100M of new files at new launch, it will erase the whole backup and transfer again 10G of data (100Mnewest, and 9.9G less new)

Here is the oneliner : ssh -n prd 'cd /var/tmp/test/ && find . -type f -printf "%T@ %p %s\n" ' |awk '{ print int($1)" "$2" "$3 }'|sort -n -r >/tmp/remote ; find . -type f -printf "%T@ %p %s\n" |awk '{ print int($1)" "$2" "$3 }'|sort -n -r | awk '{ i+=$3; if (i<=200000) {print $1" "$2" "$3}}'>/tmp/locale; grep -F -x -v -f /tmp/remote /tmp/locale |cut -d" " -f2 >/tmp/newfile;grep -F -x -v -f /tmp/locale /tmp/remote |cut -d" " -f2 >/tmp/toremove; cat /tmp/toremove |while read i; do echo "removing $i on remote server"; ssh -n prd "rm /var/tmp/test/$i"; done ; cat /tmp/newfile | tar -cvf toto.tar -T -&& scp toto.tar prd:/var/tmp/test/ && ssh -n prd "cd /var/tmp/test; tar xvf /var/tmp/test/toto.tar; rm /var/tmp/test/toto.tar"; rm /tmp/remote /tmp/locale /tmp/toremove /tmp/newfile toto.tar

Of course change prd by your server B, and all directory path on local/remote server except for the create temporary file. Beware this do not deal with fucked file name with empty space or special character inside.

Explanation:

The main idea is to know which are the newest file not backuped on the backup server. Erase too old file on the backup server, and transfert to it only the newest one not present, all of this keeping in mind the size limit.

First connect to backup server and grab the list of backuped file: ssh -n prd 'cd /var/tmp/test/ && find . -type f -printf "%T@ %p %s\n" ' |awk '{ print int($1)" "$2" "$3 }'|sort -n -r >/tmp/remote ;; i have to remove the fraction part of the time due to some issue with tar which always set the fraction part to 0. Which mean that the date between backup server and origin server will differ on the fractionnal part. The sort will order from the biggest value of the first field, to the lowest, which means from the newest file to the oldest file. And i save the result into the /tmp/remote file. No need to check for whole size, as i always transfered less than 10G in my previous backup.
second, i do the same locally to get the list of the newest files with summed size under the limit : find . -type f -printf "%T@ %p %s\n" |awk '{ print int($1)" "$2" "$3 }'|sort -n -r | awk '{ i+=$3; if (i<=200000) {print $1" "$2" "$3}}'>/tmp/locale;; i save the result into /tmp/locale

So in fact, all file which are in /tmp/locale and not in /tmp/remote are the newest files to be synced on backup server.
All the files which are in /tmp/remote and not in /tmp/locale are the files to be removed on backup server (too old).

To distinguish those subsets, i use grep :

grep -F -x -v -f /tmp/remote /tmp/locale |cut -d" " -f2>/tmp/newfile; will display all the file contained in /tmp/locale and not in /tmp/remote, which i save into /tmp/newfile
grep -F -x -v -f /tmp/locale /tmp/remote |cut -d" " -f2 >/tmp/toremove; will display all files contained in /tmp/remote and not in /tmp/locale and i save it into /tmp/toremove

So now, i have the list of file to remotely delete, and the list of file to transfer to the backup server, keeping the directory structure.

I will use tar to build the locale archive to send on backup, delete remotely the old files, transfer the archive and unpack it.

And then we are almost done. I remove the temporary files in /tmp for cleaning.

Into details this gives : cat /tmp/toremove |while read i; do echo "removing $i on remote server"; ssh -n prd "rm /var/tmp/test/$i"; done ; This loop with read the file list i cat as input, displays a little message telling me which file it deletes and launch the remote rm via ssh

cat /tmp/newfile | tar -cvf toto.tar -T -&& scp toto.tar prd:/var/tmp/test/ && ssh -n prd "cd /var/tmp/test; tar xvf /var/tmp/test/toto.tar; rm /var/tmp/test/toto.tar"; will build the locale toto.tar archive which will contains all the files listed in /tmp/newfile. If success i transfer it to the remote server, and then i remotely unpack it via ssh, i remove the archive also on backup server, this way it will not interfere with the next launch.

rm /tmp/remote /tmp/locale /tmp/toremove /tmp/newfile toto.tar is the local cleaning of files used during this launch.

This onliner can be shortened, removing the use of temporary file, and piping directly the output of grep into the while loop and tar command.

It can also be improved to deal with all command return status (not enough place to build the archive; scp or ssh error...) and with strange file name (with space, or special character, to avoid mess with parameter expension)

Best Answer

Related Solutions

How to Sync Two Folders with Command Line Tools – Folder Synchronization Methods

Synchronize only the newest files and folders, limit by size

update2: im bored but i will explain the second solution.

Related Question