rsync – Is There Any Speed Benefit of Using tar + rsync + untar Over Just rsync?

rsynctar

I often find myself sending folders with 10K – 100K of files to a remote machine (within the same network on-campus).

I was just wondering if there are reasons to believe that,

 tar + rsync + untar

Or simply

 tar (from src to dest) + untar

could be faster in practice than

rsync 

when transferring the files for the first time.

I am interested in an answer that addresses the above in two scenarios: using compression and not using it.

Update

I have just run some experiments moving 10,000 small files (total size = 50 MB), and tar+rsync+untar was consistently faster than running rsync directly (both without compression).

Best Answer

When you send the same set of files, rsync is better suited because it will only send differences. tar will always send everything and this is a waste of resources when a lot of the data are already there. The tar + rsync + untar loses this advantage in this case, as well as the advantage of keeping the folders in-sync with rsync --delete.

If you copy the files for the first time, first packeting, then sending, then unpacking (AFAIK rsync doesn't take piped input) is cumbersome and always worse than just rsyncing, because rsync won't have to do any task more than tar anyway.

Tip: rsync version 3 or later does incremental recursion, meaning it starts copying almost immediately before it counts all files.

Tip2: If you use rsync over ssh, you may also use either tar+ssh

tar -C /src/dir -jcf - ./ | ssh user@server 'tar -C /dest/dir -jxf -'

or just scp

scp -Cr srcdir user@server:destdir

General rule, keep it simple.

UPDATE:

I've created 59M demo data

mkdir tmp; cd tmp
for i in {1..5000}; do dd if=/dev/urandom of=file$i count=1 bs=10k; done

and tested several times the file transfer to a remote server (not in the same lan), using both methods

time rsync -r  tmp server:tmp2

real    0m11.520s
user    0m0.940s
sys     0m0.472s

time (tar cf demo.tar tmp; rsync demo.tar server: ; ssh server 'tar xf demo.tar; rm demo.tar'; rm demo.tar)

real    0m15.026s
user    0m0.944s
sys     0m0.700s

while keeping separate logs from the ssh traffic packets sent

wc -l rsync.log rsync+tar.log 
   36730 rsync.log
   37962 rsync+tar.log
   74692 total

In this case, I can't see any advantage in less network traffic by using rsync+tar, which is expected when the default mtu is 1500 and while the files are 10k size. rsync+tar had more traffic generated, was slower for 2-3 seconds and left two garbage files that had to be cleaned up.

I did the same tests on two machines on the same lan, and there the rsync+tar did much better times and much much less network traffic. I assume cause of jumbo frames.

Maybe rsync+tar would be better than just rsync on a much larger data set. But frankly I don't think it's worth the trouble, you need double space in each side for packing and unpacking, and there are a couple of other options as I've already mentioned above.

Related Question