Recursive move (`mv -rn`, like `cp -rn`), a move that will only move not present files

backupcpfilesmvrsync

Context

I have user uploaded content that needs to be backed up. The content is over 3 separate servers at /var/www/domain/media/ (on each server it is in the same place). The backup is a NFS mounted RAID at /var/www/domain/bak/.

media/ is owned by a different user than bak/, basically the webapp can write to media/ but can only read bak/ (users can delete their upload only until it is backed up at 00:00 GMT).

This has resulted in two issues: the user can force the same filename to overwrite the file in the backup, and a file in media/ can end up on two different servers (exactly the same file if the user upload it two times and it gets served by a distinct server).

All this runs on 4 CenOS 7 (web X 3 + backup X 1). The "web" servers have limited disk space, and moving things to the backup server is needed to keep them from filling the disks.

There are no race conditions so that is something we do not need to care about. The backup is done from the single backup machine, by executing commands through ssh over the other three machines, sequentially.


Current solution

The "move" of the files to backup is done after purging duplicates:

find /var/www/domain/media/ -type f | > media
find /var/www/domain/bak/ -type f | awk '{a=gensub("bak","media",1); print a}' > bak
cat bak media | sort | uniq -d > dupes
cat dupes | xargs rm
cp -r /var/www/domain/media/* /var/www/domain/bak/
rm -rf /var/www/domain/media/*

The problem with using mv is that /var/www/domain/media/ has subdirectories per user. For example:

media/user13/myvideo.webm
media/user13/walk-in-the-park.webm
media/user16/cat-video.webm
media/user17/presentation-may-2016.webm

bak/user13/mountai-trip.webm
bak/user13/walk-in-the-park.webm
bak/user14/reax-the-dog.webm

The command must create directories for user16 and user17, whilst it must avoid overwriting bak/user13/walk-in-the-park.webm.


Issue with the current solution

I would like to keep the duplicates on media/ instead of deleting them. Copying them to another place falls into the same problem since new files will come during the day and i'll need to sync the dupes with their copies.

How can i move all files media/ that are not in bak/ whilst keeping the directory structure and not removing the files already present.

In other words i'm looking for a move that will perform:

source      | destination         | action
----------- | ------------------- | ----------------------------------
file exists | file does not exist | move (`mv`), source -> destination
file exists | file exists         | do nothing, both files stay as they are
no file     | file exists         | do nothing (will not trigger)
no file     | file does not exist | do nothing (well, there's nothing to do something with!)

Attempts at a more elegant solution

I believe rsync shall be able to perform this. I'm aware of --remove-source-files but i cannot find a way for it not to check timestamps, checksums, filesize, everything.

I'm keeping and checking checksums as a completely separate process.

I only care about the filenames. I'm aware that it might lead to file corruption but i'm afraid that it is much easier to get a corrupted file on the normal disk rather than on the RAID server.

Non-rsync solutions are welcome. I though of writing a shell-script to perform the move (extend the script from the Current solution section). Yet, once i though how error prone it would be i gave up.

I also tried:

tar -cf /var/www/domain/media | (cd /var/www/domain/bar; tar -kxf -)

But it is both too slow for media files (which might be rather big) and keeps all files at media/ (which has limited disk space).

Best Answer

To do nothing if the file already exists in the destination tree (regardless of any metadata), pass the option --ignore-existing to rsync.

rsync -a --remove-source-files --ignore-existing /var/www/domain/media/ /var/www/domain/bak/
Related Question