Linux – Multiple tar processes writing to the same archive file at once

clusterlinuxparallelismtar

I am running many tasks on a Linux cluster. Each task creates many output files. When all tasks are finished, I run something like tar cf foo.tar output_files/to create a tar archive. This is a very slow process since there are many thousands of files and directories.

Is there any way to do this in parallel as the output files are being created?

Is it possible to have multiple tar processes, spread across multiple machines, all adding their files to the same archive at once?

The cluster has a shared filesystem.

I am not interested in compression since it slows things down even more, and because all of the input files are themselves already compressed. Ideally the output would be a tar file, but I would consider other archive formats as well.

Best Answer

You can't have multiple processes adding to the same tar archive (or any other usual archive format, compressed or not). Each file is stored contiguously, and there is no way to insert data in a file, only to append or overwrite, so continuing to write to a file that isn't the last one would overwrite subsequent files.

If you know the file size in advance, you could reserve the size in the tar archive and have the program keep writing. That would require a lot of coding: it's a very unusual thing to do.

Unix has a feature designed to accommodate a group of files that are written to independently. It's called a directory.

There are very few cases where you'd gain anything from an uncompressed archive over a directory. Reading it might be slightly faster in some circumstances; this is an intrinsic consequence of the directory format (where each file entry is a pointer to its content) as opposed to the archive format (where each file entry is its content directly), which is precisely what makes it possible to build the directory piecewise. Transforming a directory tree to an archive is post-processing that needs to be done sequentially.

Related Question