Faster alternative to cp for copying large files (~20 GB)

cpfile-copy

I am a graduate student, and the group in which I work maintains a Linux cluster. Each node of the cluster has its own local disk, but these local disks are relatively small and are not equipped with automatic backup. So the group owns a fileserver with many TBs of storage space. I am a relative Linux novice, so I am not sure what are the specs of the fileserver in terms of speed, networking ability, etc. I do know from experience that the local disks are significantly faster than the fileserver in terms of I/O. About a dozen or so people use the fileserver.

Using cp to copy a ~20 GB file from the fileserver to one of the local disks takes about 11.5 minutes in real time on average (according to time). I know that this cp operation is not very efficient because (1) time tells me that the system time for such a copy is only ~45 seconds; and because (2) when I examine top during the copy, %CPU is quite low (by inspection, roughly 0-10% on average).

Using cp to copy the same ~20 GB file from one folder on the local disk to another folder on the same local disk takes less time — about 9 minutes in real time (~51 seconds in system time, according to time). So apparently the fileserver is somewhat slower than the local disk, as expected, but perhaps not significantly slower. I am surprised that copying from local to same local is not faster than 9 minutes.

I need to copy ~200 large files — each ~20 GB — from the fileserver to one of the local disks. So, my question is: Is there a faster alternative to cp for copying large files in Linux? (Or are there any flags within cp that I could use which would speed up copying?) Even if I could somehow shave a minute off this copying time, that would help immensely.

I am sure that buying new, faster hardware disks, but I don't have access to such resources. I am also not a system administrator — I am only a (novice) user — so I don't have access to more detailed information on the load that is on the disks. I do know that while about a dozen people use the fileserver daily, I am the only person using this particular node/local disk.

Best Answer

%CPU should be low during a copy. The CPU tells the disk controller "grab data from sectors X–Y into memory buffer at Z". Then it goes and does something else (or sleep, if there is nothing else). The hardware triggers an interrupt when the data is in memory. Then the CPU has to copy it a few times, and tells the network card "transmit packets at memory locations A, B, and C". Then it goes back to doing something else.

You're pushing ~240mbps. On a gigabit LAN, you ought to be able to do at least 800mbps, but:

  1. That's shared among everyone using the file server (and possibly a connection between switches, etc.)
  2. That's limited by the speed the file server can handle the write, keeping in mind its disk I/O bandwidth is shared by everyone using it.
  3. You didn't specify how you're accessing the file server (NFS, CIFS (Samba), AFS, etc.). You may need to tune your network mount, but on anything half-recent the defaults are usually pretty sane.

For tracking down the bottleneck, iostat -kx 10 is going to be a useful command. It'll show you the utilization on your local hard disks. If you can run that on the file server, it'll tell you how busy the file server is.

The general solution is going to be to speed up that bottleneck, which of course you don't have the budget for. But, there are a couple of special cases where you can find a faster approach:

  • If the files are compressible, and you have a fast CPU, doing a minimal compress on-the-fly might be quicker. Something like lzop or maybe gzip --fastest.
  • If you are only changing a few bits here and there, and then sending the file back, only sending deltas will be much faster. Unfortunately, rsync won't really help here, as it will need to read the file on both sides to find the delta. Instead, you need something that keeps track of the delta as you change the file... Most approaches here are app-specific. But its possible that you could rig something up with, e.g., device-mapper (see the brand new dm-era target) or btrfs.
  • If you're copying the same data to multiple machines, you can use something like udpcast to send it to all the machines at once.

And, since you note you're not the sysadmin, I'm guessing that means you have a sysadmin. Or at least someone responsible for the file server & network. You should probably ask him/her/them, they should be much more familiar with the specifics of your setup. Your sysadmin(s) should at least be able to tell you what transfer rate you can reasonably expect.

Related Question