Copying faster than cp

cpfile-copyrecursiversyncsolaris

I am currently copying a large number of directories and files recursively on the same disk using cp -r.

Is there a way to do this more quickly? Would compressing the files first be better, or maybe using rsync?

Best Answer

I was recently puzzled by the sometimes slow speed of cp. Specifically, how come df = pandas.read_hdf('file1', 'df') (700ms for a 1.2GB file) followed by df.to_hdf('file2') (530ms) could be so much faster than cp file1 file2 (8s)?

Digging into this:

cat file1 > file2 isn't any better (8.1s).
dd bs=1500000000 if=file1 of=file2 neither (8.3s).
rsync file1 file2 is worse (11.4s), because file2 existed already so it tries to do its rolling checksum and block update magic.

Oh, wait a second! How about unlinking (deleting) file2 first if it exists?

Now we are talking:

rm -f file2: 0.2s (to add to any figure below).
cp file1 file2: 1.0s.
cat file1 > file2: 1.0s.
dd bs=1500000000 if=file1 of=file2: 1.2s.
rsync file1 file2: 4s.

So there you have it. Make sure the target files don't exist (or truncate them, which is presumably what pandas.to_hdf() does).

Edit: this was without emptying the cache before any of the commands, but as noted in the comments, doing so just consistently adds ~3.8s to all numbers above.

Also noteworthy: this was tried on various Linux versions (Centos w. 2.6.18-408.el5 kernel, and Ubuntu w. 3.13.0-77-generic kernel), and ext4 as well as ext3. Interestingly, on a MacBook with Darwin 10.12.6, there is no difference and both versions (with or without existing file at the destination) are fast.

Related Solutions

Faster alternative to cp for copying large files (~20 GB)

%CPU should be low during a copy. The CPU tells the disk controller "grab data from sectors X–Y into memory buffer at Z". Then it goes and does something else (or sleep, if there is nothing else). The hardware triggers an interrupt when the data is in memory. Then the CPU has to copy it a few times, and tells the network card "transmit packets at memory locations A, B, and C". Then it goes back to doing something else.

You're pushing ~240mbps. On a gigabit LAN, you ought to be able to do at least 800mbps, but:

That's shared among everyone using the file server (and possibly a connection between switches, etc.)
That's limited by the speed the file server can handle the write, keeping in mind its disk I/O bandwidth is shared by everyone using it.
You didn't specify how you're accessing the file server (NFS, CIFS (Samba), AFS, etc.). You may need to tune your network mount, but on anything half-recent the defaults are usually pretty sane.

For tracking down the bottleneck, iostat -kx 10 is going to be a useful command. It'll show you the utilization on your local hard disks. If you can run that on the file server, it'll tell you how busy the file server is.

The general solution is going to be to speed up that bottleneck, which of course you don't have the budget for. But, there are a couple of special cases where you can find a faster approach:

If the files are compressible, and you have a fast CPU, doing a minimal compress on-the-fly might be quicker. Something like lzop or maybe gzip --fastest.
If you are only changing a few bits here and there, and then sending the file back, only sending deltas will be much faster. Unfortunately, rsync won't really help here, as it will need to read the file on both sides to find the delta. Instead, you need something that keeps track of the delta as you change the file... Most approaches here are app-specific. But its possible that you could rig something up with, e.g., device-mapper (see the brand new dm-era target) or btrfs.
If you're copying the same data to multiple machines, you can use something like udpcast to send it to all the machines at once.

And, since you note you're not the sysadmin, I'm guessing that means you have a sysadmin. Or at least someone responsible for the file server & network. You should probably ask him/her/them, they should be much more familiar with the specifics of your setup. Your sysadmin(s) should at least be able to tell you what transfer rate you can reasonably expect.

Rsync with colons in filenames

I surmise that your external drive uses a filesystem such as VFAT which doesn't allow colons in file names.

A simple option would be to back up your files as archives (zip, 7z, tar.xz, whatever catches your fancy). This way you wouldn't be limited by any characteristic of the filesystem other than the maximum file size.

Another possibility would be to use rdiff-backup, which takes care of translating file names that don't fit on the destination filesystem, as suggested by poolie.

A generic approach to unsupported characters is to leverage the filesystem layer to transform the file names. The FUSE filesystem posixovl transforms file names into names that Windows's VFAT supports.

mkdir ~/mnt
mount.posixovl -S /media/extern_drive ~/mnt
rsync -a /work ~/mnt
fusermount -u ~/mnt

See How can I substitute colons when I rsync on a USB key? for more details, and check that thread for any new solution that may come up.

Best Answer

Related Solutions

Faster alternative to cp for copying large files (~20 GB)

Rsync with colons in filenames

Related Question