Can tar archive files in parallel

parallelismtar

I'm trying to move parts of a large directory (~40 GiB and ~8 million files) across multiple machines via Amazon S3, and due to needing to preserve symlinks I am tarring up the directory and then uploading the resultant file, rather than syncing directly to S3.

Most of the files are already compressed, so I'm not compressing the archive with gzip or bzip. My command is along the lines of

tar --create --exclude='*.large-files' --exclude='unimportant-directory-with-many-files' --file /tmp/archive.tar /directory/to/archive

While running this, I've noticed that tar only appears to use one core on the eight-core machine. My impression, based on the pegging of that core, the low load average (~1), and the stats I'm seeing from iostat is that this operation is actually cpu-bound, rather than disk-bound, as I'd expect. Since it's slow (~90 minutes), I'm interested in trying to parallelize tar to make use of the additional cores.

Other questions on this topic either focus on compression or create multiple archives (which, due to the directory structure, is not easy in my situation). It seems most people forget that you can even create a tarball without compressing it.

Best Answer

Because of the nature of a tar archive which sequentially stores the files in the output, there is no way to parallelize the process unless you make more than one archive.

Note that the bottleneck of the operation would likely be the hard drive. For that reason, even if you did split the task in two or more processes, it would not go faster unless they operate on different drives.

Related Question