I'll go with Gilles and point you to Unison as suggested by hasen j. Unison was DropBox 20 years before DropBox. Rock solid code that a lot of people (myself included) use every day -- very worthwhile to learn. Still, join
needs all the publicity it can get :)
This is only half an answer, but I have to get back to work :)
Basically, I wanted to demonstrate the little-known join
utility which does just that: joins two tables on a some field.
First, set up a test case including file names with spaces:
for d in a b 'c c'; do mkdir -p "old/$d"; echo $RANDOM > "old/${d}/${d}.txt"; done
cp -r old new
(edit some directory and/or file names in new
).
Now, we want to build a map: hash -> filename for each directory and then use join
to match up files with the same hash. To generate the map, put the following in makemap.sh
:
find "$1" -type f -exec md5 -r "{}" \; \
| sed "s/\([a-z0-9]*\) ${1}\/\(.*\)/\1 \"\2\"/" \
makemap.sh
spits out a file with lines of the form, 'hash "filename"', so we just join on the first column:
join <(./makemap.sh 'old') <(./makemap.sh 'new') >moves.txt
This generates moves.txt
which looks like this:
49787681dd7fcc685372784915855431 "a/a.txt" "bar/a.txt"
bfdaa3e91029d31610739d552ede0c26 "c c/c c.txt" "c c/c c.txt"
The next step would be to actually do the moves, but my attempts got stuck on quoting... mv -i
and mkdir -p
should come handy.
1st attempt (didn't work)
You need to include the directories in addition to the files:
rsync -zavC --delete --include '*/' --include='*.php' --include='*.js' \
--exclude="*" /media/datacod/Test/ /home/lucas/Desktop/rsync/
2nd attempt
rsync -avzC --filter='-rs_*/.svn*' --include="*/" --include='*.js' \
--include='*.php' --exclude="*" --delete dir1/ dir2/
test data
I wrote this script to create some sample data to test this out on. Here's that script, setup_svn_sample.bash
:
#!/bin/bash
# setup .svn dirs
mkdir -p dir{1,2}/dir{1,2,3,4}/.svn
# fake data under .svn
mkdir -p dir1/dir{1,2,3,4}/.svn/origdir
mkdir -p dir2/dir{1,2,3,4}/.svn/keepdir
# files to not sync
touch dir1/dir{1,2,3,4}/file{1,2}
# files to sync
touch dir1/dir{1,2,3,4}/file1.js
touch dir1/dir{1,2,3,4}/file1.php
Running it produces the following directories:
source dir
$ tree -a dir1
dir1
|-- dir1
| |-- file1
| |-- file1.js
| |-- file1.php
| |-- file2
| `-- .svn
| `-- origdir
|-- dir2
| |-- file1
| |-- file1.js
| |-- file1.php
| |-- file2
| `-- .svn
| `-- origdir
|-- dir3
| |-- file1
| |-- file1.js
| |-- file1.php
| |-- file2
| `-- .svn
| `-- origdir
`-- dir4
|-- file1
|-- file1.js
|-- file1.php
|-- file2
`-- .svn
`-- origdir
destination dir
$ tree -a dir2
dir2
|-- dir1
| `-- .svn
| `-- keepdir
|-- dir2
| `-- .svn
| `-- keepdir
|-- dir3
| `-- .svn
| `-- keepdir
`-- dir4
`-- .svn
`-- keepdir
Running the above rsync
command:
rsync -avzC --filter='-rs_*/.svn*' --include="*/" --include='*.js' \
--include='*.php' --exclude="*" --delete dir1/ dir2/
sending incremental file list
dir1/file1.js
dir1/file1.php
dir2/file1.js
dir2/file1.php
dir3/file1.js
dir3/file1.php
dir4/file1.js
dir4/file1.php
sent 480 bytes received 168 bytes 1296.00 bytes/sec
total size is 0 speedup is 0.00
Resulting dir2 afterards:
$ tree -a dir2
dir2
|-- dir1
| |-- file1.js
| |-- file1.php
| `-- .svn
| `-- keepdir
|-- dir2
| |-- file1.js
| |-- file1.php
| `-- .svn
| `-- keepdir
|-- dir3
| |-- file1.js
| |-- file1.php
| `-- .svn
| `-- keepdir
`-- dir4
|-- file1.js
|-- file1.php
`-- .svn
`-- keepdir
Why does it work?
The key piece to this script is to make use of the filters capability of rsync
. Filters allow you to remove files from the matched set at various points in the command. So in our case we're filtering any files that match the pattern */.svn*
. The modifiers -rs_
tell the filter that we want to filter on both the source side as well as the target side.
excerpt from the FILTER NOTES section of rsync's man page
An s is used to indicate that the rule applies to the sending side.
When a rule affects the sending side, it prevents files from being
transferred. The default is for a rule to affect both sides unless
--delete-excluded
was specified, in which case default rules become
sender-side only. See also the hide (H) and show (S) rules, which are
an alternate way to specify sending-side includes/excludes.
An r is used to indicate that the rule applies to the receiving
side. When a rule affects the receiving side, it prevents files from
being deleted. See the s modifier for more info. See also the protect
(P) and risk (R) rules, which are an alternate way to specify
receiver-side includes/excludes.
See man rsync for more details.
Tips for figuring this out (hint using --dry-run
)
While describing how to do this I thought I'd mention the --dry-run
switch to rsync
. It' extremely useful in seeing what will happen without having the rsync
actually take place.
For Example
Using the following command will do a test run and show us the decision logic behind rsync
:
rsync --dry-run -avvzC --filter='-rs_*/.svn*' --include="*/" \
--include='*.js' --include='*.php' --exclude="*" --delete dir1/ dir2/
sending incremental file list
[sender] showing directory dir3 because of pattern */
[sender] showing directory dir2 because of pattern */
[sender] showing directory dir4 because of pattern */
[sender] showing directory dir1 because of pattern */
[sender] hiding file dir1/file1 because of pattern *
[sender] showing file dir1/file1.js because of pattern *.js
[sender] hiding file dir1/file2 because of pattern *
[sender] showing file dir1/file1.php because of pattern *.php
[sender] hiding directory dir1/.svn because of pattern */.svn*
[sender] hiding file dir2/file1 because of pattern *
[sender] showing file dir2/file1.js because of pattern *.js
[sender] hiding file dir2/file2 because of pattern *
[sender] showing file dir2/file1.php because of pattern *.php
[sender] hiding directory dir2/.svn because of pattern */.svn*
[sender] hiding file dir3/file1 because of pattern *
[sender] showing file dir3/file1.js because of pattern *.js
[sender] hiding file dir3/file2 because of pattern *
[sender] showing file dir3/file1.php because of pattern *.php
[sender] hiding directory dir3/.svn because of pattern */.svn*
[sender] hiding file dir4/file1 because of pattern *
[sender] showing file dir4/file1.js because of pattern *.js
[sender] hiding file dir4/file2 because of pattern *
[sender] showing file dir4/file1.php because of pattern *.php
[sender] hiding directory dir4/.svn because of pattern */.svn*
delta-transmission disabled for local transfer or --whole-file
[generator] risking directory dir3 because of pattern */
[generator] risking directory dir2 because of pattern */
[generator] risking directory dir4 because of pattern */
[generator] risking directory dir1 because of pattern */
[generator] protecting directory dir1/.svn because of pattern */.svn*
dir1/file1.js
dir1/file1.php
[generator] protecting directory dir2/.svn because of pattern */.svn*
dir2/file1.js
dir2/file1.php
[generator] protecting directory dir3/.svn because of pattern */.svn*
dir3/file1.js
dir3/file1.php
[generator] protecting directory dir4/.svn because of pattern */.svn*
dir4/file1.js
dir4/file1.php
total: matches=0 hash_hits=0 false_alarms=0 data=0
sent 231 bytes received 55 bytes 572.00 bytes/sec
total size is 0 speedup is 0.00 (DRY RUN)
In the above output you can see that the ./svn
directories are being protected by our filter rule. Valuable insight for debugging the rsync
.
References
Best Answer
If you're not going to use the remote file system as the data source of what has been transferred then you need to externally track the files that have been successfully transferred previously, then exclude them from future transfers.
rsync
can include and exclude files based on patterns in a file so you can include a specific list of files in a transfer. Then exclude that list from future transfers.You might need some more
rsync
specific regex quoting but the Perlquotemeta
function and replacements was the first easy solution that came to mind.The main problem will be dealing with any special characters in files names. If you want to deal with new lines or tabs and other strange things in the names then you will have to put a bit more work into the
perl
(or whatever) that parses and generates the inclusion pattern list. If you can restrict the names of your transfer files to a simple character set then you don't need to worry about this step as much. Theperl
is a halfway solution that should get you past most common regex chars.The reason for using the include list rather than letting
rsync
pull the whole directory it self is so that you have a defined/complete list of files for the subsequent exclude list. You could probably achieve the same result by parsing thersync
output or a--log-file=FILE
for the files that were transferred but that looked a little harder.