I would like to sync Folder A to Folder B with the following restrictions
-
Folder A contains thousands of files in hundreds of folders and many hundreds of GBs of files.
-
Folder B will contain the newest 10GB (or whatever) of files (and all necessary folders) from Folder A to replicate the files and structure of just the newest files in Folder A.
-
Each time the sync is run, Folder B will be limited to the newest 10GB of files from Folder A with older files being discarded.
-
Empty directories should be pruned, but I can that myself if need be.
I am not necessarily attached to doing this via rsync if there is a better tool that can run under bash or OS X, but it needs to be able to run independently and without a login session.
Best Answer
#IF downvoted, please explain why in comment. Bummer !!!
From my comment i reached to a solution:
The whole command line is like this:
$ find . -type f -printf "%T@ %p %s\n"| sort -n -r | awk '{ i+=$3; if (i<=200000) {print $2}}' | tar -cvf toto.tar -T -&& ssh -n prd "rm -rf dir/*" && scp toto.tar prd:tmp/ && ssh -n prd "tar xvf tmp/toto.tar"
The command starts with a find which look for all file in the current directory (should be adapted to the directory path on server A) and print 3 fields
find
is launchedThen the sorting of the output of
find
is done bysort -n -r
which will reverse sort numerically on the first field sorting unix timestamp from most recent to oldest one.To deal with the size constraint,
awk
is helping a little bit, by printing the second field of thesort
output, until the sum of size is under the limit. For each line it processes, it sum the value of the third field (size) to local variablei
, then print the second field of the sort output ifi
is under the limit.tar -cvf toto.tar -T -
will build an archive called toto.tar based on the file list provided byawk
output.If success, first deletion of the server B backup; then
scp toto.tar host:dir
will transfert the file to remote server (server B), and thenssh -n "tar xvf dir/toto.tar
will unpack the transfered archive on the remote folder preserving directory structure.My previous solution based on scp was not preserving directory structure that is why i edited this answer.
Here the result of the run in my homedir with max size of 200kb:
The main concern is the fact this solution removes the backup folder, before transferring the latest 10gigs of data from the primary server. It's not very efficient if the set of newest data and the set of backuped data have lot of common file/directory. But this is a very easy way to really track down the latest 10gig (or whatever) of newest data, whatever is the data (quick and dirty)
update2: im bored but i will explain the second solution.
I finally reached a second solution which i will explain now. It's not efficiently coded, it's a big onliner, could be formatted into a shell script with basic check in case of failure, or filename strange formatting.
The biggest issue to the first solution is that it always try to backup the last 10gig of newest files. Whatever is already backuped. Which means if there is only 100M of new files at new launch, it will erase the whole backup and transfer again 10G of data (100Mnewest, and 9.9G less new)
Here is the oneliner :
ssh -n prd 'cd /var/tmp/test/ && find . -type f -printf "%T@ %p %s\n" ' |awk '{ print int($1)" "$2" "$3 }'|sort -n -r >/tmp/remote ; find . -type f -printf "%T@ %p %s\n" |awk '{ print int($1)" "$2" "$3 }'|sort -n -r | awk '{ i+=$3; if (i<=200000) {print $1" "$2" "$3}}'>/tmp/locale; grep -F -x -v -f /tmp/remote /tmp/locale |cut -d" " -f2 >/tmp/newfile;grep -F -x -v -f /tmp/locale /tmp/remote |cut -d" " -f2 >/tmp/toremove; cat /tmp/toremove |while read i; do echo "removing $i on remote server"; ssh -n prd "rm /var/tmp/test/$i"; done ; cat /tmp/newfile | tar -cvf toto.tar -T -&& scp toto.tar prd:/var/tmp/test/ && ssh -n prd "cd /var/tmp/test; tar xvf /var/tmp/test/toto.tar; rm /var/tmp/test/toto.tar"; rm /tmp/remote /tmp/locale /tmp/toremove /tmp/newfile toto.tar
Of course change
prd
by your server B, and all directory path on local/remote server except for the create temporary file. Beware this do not deal with fucked file name with empty space or special character inside.Explanation:
The main idea is to know which are the newest file not backuped on the backup server. Erase too old file on the backup server, and transfert to it only the newest one not present, all of this keeping in mind the size limit.
ssh -n prd 'cd /var/tmp/test/ && find . -type f -printf "%T@ %p %s\n" ' |awk '{ print int($1)" "$2" "$3 }'|sort -n -r >/tmp/remote ;
; i have to remove the fraction part of the time due to some issue withtar
which always set the fraction part to0
. Which mean that the date between backup server and origin server will differ on the fractionnal part. The sort will order from the biggest value of the first field, to the lowest, which means from the newest file to the oldest file. And i save the result into the/tmp/remote
file. No need to check for whole size, as i always transfered less than 10G in my previous backup.find . -type f -printf "%T@ %p %s\n" |awk '{ print int($1)" "$2" "$3 }'|sort -n -r | awk '{ i+=$3; if (i<=200000) {print $1" "$2" "$3}}'>/tmp/locale;
; i save the result into/tmp/locale
So in fact, all file which are in
/tmp/locale
and not in/tmp/remote
are the newest files to be synced on backup server.All the files which are in
/tmp/remote
and not in/tmp/locale
are the files to be removed on backup server (too old).To distinguish those subsets, i use
grep
:grep -F -x -v -f /tmp/remote /tmp/locale |cut -d" " -f2>/tmp/newfile;
will display all the file contained in/tmp/locale
and not in/tmp/remote
, which i save into/tmp/newfile
grep -F -x -v -f /tmp/locale /tmp/remote |cut -d" " -f2 >/tmp/toremove;
will display all files contained in/tmp/remote
and not in/tmp/locale
and i save it into/tmp/toremove
So now, i have the list of file to remotely delete, and the list of file to transfer to the backup server, keeping the directory structure.
I will use
tar
to build the locale archive to send on backup, delete remotely the old files, transfer the archive and unpack it.And then we are almost done. I remove the temporary files in
/tmp
for cleaning.Into details this gives :
cat /tmp/toremove |while read i; do echo "removing $i on remote server"; ssh -n prd "rm /var/tmp/test/$i"; done ;
This loop with read the file list icat
as input, displays a little message telling me which file it deletes and launch the remoterm
viassh
cat /tmp/newfile | tar -cvf toto.tar -T -&& scp toto.tar prd:/var/tmp/test/ && ssh -n prd "cd /var/tmp/test; tar xvf /var/tmp/test/toto.tar; rm /var/tmp/test/toto.tar";
will build the localetoto.tar
archive which will contains all the files listed in/tmp/newfile
. If success i transfer it to the remote server, and then i remotely unpack it viassh
, i remove the archive also on backup server, this way it will not interfere with the next launch.rm /tmp/remote /tmp/locale /tmp/toremove /tmp/newfile toto.tar
is the local cleaning of files used during this launch.This onliner can be shortened, removing the use of temporary file, and piping directly the output of
grep
into thewhile
loop andtar
command.It can also be improved to deal with all command return status (not enough place to build the archive;
scp
orssh
error...) and with strange file name (with space, or special character, to avoid mess with parameter expension)