Synchronize only the newest files and folders, limit by size

rsync

I would like to sync Folder A to Folder B with the following restrictions

  • Folder A contains thousands of files in hundreds of folders and many hundreds of GBs of files.

  • Folder B will contain the newest 10GB (or whatever) of files (and all necessary folders) from Folder A to replicate the files and structure of just the newest files in Folder A.

  • Each time the sync is run, Folder B will be limited to the newest 10GB of files from Folder A with older files being discarded.

  • Empty directories should be pruned, but I can that myself if need be.

I am not necessarily attached to doing this via rsync if there is a better tool that can run under bash or OS X, but it needs to be able to run independently and without a login session.

Best Answer

#IF downvoted, please explain why in comment. Bummer !!!

From my comment i reached to a solution:

The whole command line is like this: $ find . -type f -printf "%T@ %p %s\n"| sort -n -r | awk '{ i+=$3; if (i<=200000) {print $2}}' | tar -cvf toto.tar -T -&& ssh -n prd "rm -rf dir/*" && scp toto.tar prd:tmp/ && ssh -n prd "tar xvf tmp/toto.tar"

The command starts with a find which look for all file in the current directory (should be adapted to the directory path on server A) and print 3 fields

  • %T@ print the unix timestamp
  • %p print the file path from where the find is launched
  • %s print the size in byte of the file
  • \n for new line of course.

Then the sorting of the output of find is done by sort -n -r which will reverse sort numerically on the first field sorting unix timestamp from most recent to oldest one.

To deal with the size constraint, awk is helping a little bit, by printing the second field of the sort output, until the sum of size is under the limit. For each line it processes, it sum the value of the third field (size) to local variable i, then print the second field of the sort output if i is under the limit.

tar -cvf toto.tar -T - will build an archive called toto.tar based on the file list provided by awk output.

If success, first deletion of the server B backup; then scp toto.tar host:dir will transfert the file to remote server (server B), and then ssh -n "tar xvf dir/toto.tar will unpack the transfered archive on the remote folder preserving directory structure.

My previous solution based on scp was not preserving directory structure that is why i edited this answer.

Here the result of the run in my homedir with max size of 200kb:

$ rm toto.tar; find . -type f -printf "%T@ %p %s\n"| sort -n -r | awk '{ i+=$3; if (i<=200000) {print $2}}'  | tar -cvf toto.tar -T -&& scp toto.tar prd:tmp/ && ssh -n prd "tar xvf tmp/toto.tar"
./.lesshst
./.viminfo
./scpplus
./.config/xfce4/desktop/icons.screen0-1350x650.rc
./.xsession-errors
./.config/xfce4/xfconf/xfce-perchannel-xml/xfce4-panel.xml
./.config/pulse/7f14833c645d4a6abb0beba68b79e0c0-default-source
./.config/pulse/7f14833c645d4a6abb0beba68b79e0c0-default-sink
./.cache/imsettings/log
./.cache/gpg-agent-info
./.ICEauthority
./.vboxclient-draganddrop.pid
./.vboxclient-seamless.pid
./.vboxclient-display.pid
./.vboxclient-clipboard.pid
./.dbus/session-bus/7f14833c645d4a6abb0beba68b79e0c0-0
./.cache/xscreensaver/xscreensaver-getimage.cache
./.config/xfce4/desktop/icons.screen0-1264x950.rc
./work/fpart-0.9.2/src/fpart
toto.tar                                                                                                                              100%  170KB 170.0KB/s   00:00    
./.lesshst
./.viminfo
./scpplus
./.config/xfce4/desktop/icons.screen0-1350x650.rc
./.xsession-errors
./.config/xfce4/xfconf/xfce-perchannel-xml/xfce4-panel.xml
./.config/pulse/7f14833c645d4a6abb0beba68b79e0c0-default-source
./.config/pulse/7f14833c645d4a6abb0beba68b79e0c0-default-sink
./.cache/imsettings/log
./.cache/gpg-agent-info
./.ICEauthority
./.vboxclient-draganddrop.pid
./.vboxclient-seamless.pid
./.vboxclient-display.pid
./.vboxclient-clipboard.pid
./.dbus/session-bus/7f14833c645d4a6abb0beba68b79e0c0-0
./.cache/xscreensaver/xscreensaver-getimage.cache
./.config/xfce4/desktop/icons.screen0-1264x950.rc
./work/fpart-0.9.2/src/fpart

The main concern is the fact this solution removes the backup folder, before transferring the latest 10gigs of data from the primary server. It's not very efficient if the set of newest data and the set of backuped data have lot of common file/directory. But this is a very easy way to really track down the latest 10gig (or whatever) of newest data, whatever is the data (quick and dirty)

update2: im bored but i will explain the second solution.

I finally reached a second solution which i will explain now. It's not efficiently coded, it's a big onliner, could be formatted into a shell script with basic check in case of failure, or filename strange formatting.

The biggest issue to the first solution is that it always try to backup the last 10gig of newest files. Whatever is already backuped. Which means if there is only 100M of new files at new launch, it will erase the whole backup and transfer again 10G of data (100Mnewest, and 9.9G less new)

Here is the oneliner : ssh -n prd 'cd /var/tmp/test/ && find . -type f -printf "%T@ %p %s\n" ' |awk '{ print int($1)" "$2" "$3 }'|sort -n -r >/tmp/remote ; find . -type f -printf "%T@ %p %s\n" |awk '{ print int($1)" "$2" "$3 }'|sort -n -r | awk '{ i+=$3; if (i<=200000) {print $1" "$2" "$3}}'>/tmp/locale; grep -F -x -v -f /tmp/remote /tmp/locale |cut -d" " -f2 >/tmp/newfile;grep -F -x -v -f /tmp/locale /tmp/remote |cut -d" " -f2 >/tmp/toremove; cat /tmp/toremove |while read i; do echo "removing $i on remote server"; ssh -n prd "rm /var/tmp/test/$i"; done ; cat /tmp/newfile | tar -cvf toto.tar -T -&& scp toto.tar prd:/var/tmp/test/ && ssh -n prd "cd /var/tmp/test; tar xvf /var/tmp/test/toto.tar; rm /var/tmp/test/toto.tar"; rm /tmp/remote /tmp/locale /tmp/toremove /tmp/newfile toto.tar

Of course change prd by your server B, and all directory path on local/remote server except for the create temporary file. Beware this do not deal with fucked file name with empty space or special character inside.

Explanation:

The main idea is to know which are the newest file not backuped on the backup server. Erase too old file on the backup server, and transfert to it only the newest one not present, all of this keeping in mind the size limit.

  • First connect to backup server and grab the list of backuped file: ssh -n prd 'cd /var/tmp/test/ && find . -type f -printf "%T@ %p %s\n" ' |awk '{ print int($1)" "$2" "$3 }'|sort -n -r >/tmp/remote ;; i have to remove the fraction part of the time due to some issue with tar which always set the fraction part to 0. Which mean that the date between backup server and origin server will differ on the fractionnal part. The sort will order from the biggest value of the first field, to the lowest, which means from the newest file to the oldest file. And i save the result into the /tmp/remote file. No need to check for whole size, as i always transfered less than 10G in my previous backup.
  • second, i do the same locally to get the list of the newest files with summed size under the limit : find . -type f -printf "%T@ %p %s\n" |awk '{ print int($1)" "$2" "$3 }'|sort -n -r | awk '{ i+=$3; if (i<=200000) {print $1" "$2" "$3}}'>/tmp/locale;; i save the result into /tmp/locale

So in fact, all file which are in /tmp/locale and not in /tmp/remote are the newest files to be synced on backup server.
All the files which are in /tmp/remote and not in /tmp/locale are the files to be removed on backup server (too old).

To distinguish those subsets, i use grep :

  • grep -F -x -v -f /tmp/remote /tmp/locale |cut -d" " -f2>/tmp/newfile; will display all the file contained in /tmp/locale and not in /tmp/remote, which i save into /tmp/newfile
  • grep -F -x -v -f /tmp/locale /tmp/remote |cut -d" " -f2 >/tmp/toremove; will display all files contained in /tmp/remote and not in /tmp/locale and i save it into /tmp/toremove

So now, i have the list of file to remotely delete, and the list of file to transfer to the backup server, keeping the directory structure.

I will use tar to build the locale archive to send on backup, delete remotely the old files, transfer the archive and unpack it.

And then we are almost done. I remove the temporary files in /tmp for cleaning.

Into details this gives : cat /tmp/toremove |while read i; do echo "removing $i on remote server"; ssh -n prd "rm /var/tmp/test/$i"; done ; This loop with read the file list i cat as input, displays a little message telling me which file it deletes and launch the remote rm via ssh

cat /tmp/newfile | tar -cvf toto.tar -T -&& scp toto.tar prd:/var/tmp/test/ && ssh -n prd "cd /var/tmp/test; tar xvf /var/tmp/test/toto.tar; rm /var/tmp/test/toto.tar"; will build the locale toto.tar archive which will contains all the files listed in /tmp/newfile. If success i transfer it to the remote server, and then i remotely unpack it via ssh, i remove the archive also on backup server, this way it will not interfere with the next launch.

rm /tmp/remote /tmp/locale /tmp/toremove /tmp/newfile toto.tar is the local cleaning of files used during this launch.

This onliner can be shortened, removing the use of temporary file, and piping directly the output of grep into the while loop and tar command.

It can also be improved to deal with all command return status (not enough place to build the archive; scp or ssh error...) and with strange file name (with space, or special character, to avoid mess with parameter expension)

Related Question