Can tar archive files in parallel

parallelismtar

I'm trying to move parts of a large directory (~40 GiB and ~8 million files) across multiple machines via Amazon S3, and due to needing to preserve symlinks I am tarring up the directory and then uploading the resultant file, rather than syncing directly to S3.

Most of the files are already compressed, so I'm not compressing the archive with gzip or bzip. My command is along the lines of

tar --create --exclude='*.large-files' --exclude='unimportant-directory-with-many-files' --file /tmp/archive.tar /directory/to/archive

While running this, I've noticed that tar only appears to use one core on the eight-core machine. My impression, based on the pegging of that core, the low load average (~1), and the stats I'm seeing from iostat is that this operation is actually cpu-bound, rather than disk-bound, as I'd expect. Since it's slow (~90 minutes), I'm interested in trying to parallelize tar to make use of the additional cores.

Other questions on this topic either focus on compression or create multiple archives (which, due to the directory structure, is not easy in my situation). It seems most people forget that you can even create a tarball without compressing it.

Best Answer

Because of the nature of a tar archive which sequentially stores the files in the output, there is no way to parallelize the process unless you make more than one archive.

Note that the bottleneck of the operation would likely be the hard drive. For that reason, even if you did split the task in two or more processes, it would not go faster unless they operate on different drives.

Related Solutions

Tar Command – Create Archive Excluding Hidden Files

You posted in a comment that you are working on a Mac OS X system. This is an important clue to the purpose of these ._* files.

These ._* archive entries are chunks of AppleDouble data that contain the extra information associated with the corresponding file (the one without the ._ prefix). They are generated by the Mac OS X–specific copyfile(3) family of functions. The AppleDouble blobs store access control data (ACLs) and extended attributes (commonly, Finder flags and “resource forks”, but xattrs can be used to store any kind of data).

The system-supplied Mac OS X archive tools (bsdtar (also symlinked as tar), gnutar, and pax) will generate a ._* archive member for any file that has any extended information associated with it; in “unarchive” mode, they will also decode those archive members and apply the resulting extended information to the associated file. This creates a “full fidelity” archive for use on Mac OS X systems by preserving and later extracting all the information that the HFS+ filesystem can store.

The corresponding archive tools on other systems do not know to give special handling to these ._* files, so they are unpacked as normal files. Since such files are fairly useless on other systems, they are often seen as “junk files”. Correspondingly, if a non–Mac OS X system generates an archive that includes normal files that start with ._, the Mac OS X unarchiving tools will try to decode those files as extended information.

There is, however an undocumented(?) way to make the system-supplied Mac OS X archivers behave like they do on other Unixy systems: the COPYFILE_DISABLE environment variable. Setting this variable (to any value, even the empty string), will prevent the archivers from generating ._* archive members to represent any extended information associated with the archived files. Its presence will also prevent the archivers from trying to interpret such archive members as extended information.

COPYFILE_DISABLE=1 tar czf new.tar.gz …
COPYFILE_DISABLE=1 tar xzf unixy.tar.gz …

You might set this variable in your shell’s initialization file if you want to work this way more often than not.

# disable special creation/extraction of ._* files by tar, etc. on Mac OS X
COPYFILE_DISABLE=1; export COPYFILE_DISABLE

Then, when you need to re-enable the feature (to preserve/restore the extended information), you can “unset” the variable for individual commands:

(unset COPYFILE_DISABLE; tar czf new-osx.tar.gz …)

The archivers on Mac OS X 10.4 also do something similar, though they use a different environment variable: COPY_EXTENDED_ATTRIBUTES_DISABLE

Tar directory into independent archive files no larger than a certain size

tar can cope with partial archives after splitting. When you try to restore part of such an archive, it will skip over whatever it can't use at the start, and tell you about any partial file at the end; everything in between will be restored properly. You can instruct tar itself to split archives as it creates them, using the tape length options; see Create a tar archive split into blocks of a maximum size for details.

There are utilities which do better than that though, and produce parts of archives which stand alone (as long as the size limit is sufficient to store the largest file in the archive); unfortunately the ones I know of don't meet all your requirements. On most platforms, there's zipsplit which can split zip files, but it only copes with archives up to 2GB in size. On Plan 9, there's tarsplit which splits tarballs, but I'm not sure it can be easily ported to whatever system you're using (I suspect you're not using Plan 9...).

Best Answer

Related Solutions

Tar Command – Create Archive Excluding Hidden Files

Tar directory into independent archive files no larger than a certain size

Related Question