Move/copy millions of images from Macos to external drive to ubuntu server

cpfile-copymvosx

I have created a dataset of millions (>15M, so far) of images for a machine-learning project, taking up over 500GB of storage. I created them on my Macbook Pro but want to get them to our DGX1 (GPU cluster) somehow. I thought it would be faster to copy to a fast external SSD (2x nvme in raid0) and then plug that drive directly into local terminal and copy it to the network scratch disk. I'm not so sure anymore, as I've been cp-ing to the external drive for over 24 hrs now.

I tried using the finder gui to copy at first (bad idea!). For a smaller dataset (2M images), I used 7zip to create a few archives. I'm now using the terminal in MacOS to copy the files using cp.

I tried cp /path/to/dataset /path/to/external-ssd.

Finder was definitely not the best approach as it took forever at the "preparing" to copy stage.

Using 7zip to archive the dataset increased the "file" transfer speed, but it took over 4 days(!) to extract the files, and that for a dataset an order of magnitude smaller.

Using the command line cp, started off quickly but seems to have slowed down. Activity monitor says I'm getting 6-8k IO's on the disk. Well, maybe. iostat reports somewhere between 14-16 MB/s, at least during random spot checks. It's been 24 hours and it isn't quite halfway done.

Is there a better way to do this?

It wasn't clear to me that rsync was any better than cp for my purpose:
How to copy a file from a remote server to a local machine?

Best Answer

  1. Archiving your data is a good option as for the file transfer speed. However, if those images are mostly JPEGs, data is already compressed, and you'll end up wasting CPU time compressing data for a final 1 or 2% gain in file size. This is why you may give a try to tar since it only packs files together without trying to compress them (unless you ask it to ;-)

  2. Another hack that may be worth trying if your network setup allows it is to start a web server on your laptop, then download them from the destination host. This cuts the process from "copy from laptop to external media" + "copy from external media to destination" into a single-step process. I've practiced it many times (between Linux machines) and it works pretty well.

This is detailed here. The main steps are :

On the sender side :

  1. cd to the directory containing files to share
  2. start a web server with Python :
    • with Python 2 : python -m SimpleHTTPServer port
    • with Python 3 : python -m http.server port

On the receiver side, files will be available at http://senderIp:port. You can retrieve files easily with wget -c http://senderIp:port/yourArchiveName

Related Question