Linux – Multiple tar processes writing to the same archive file at once

clusterlinuxparallelismtar

I am running many tasks on a Linux cluster. Each task creates many output files. When all tasks are finished, I run something like tar cf foo.tar output_files/to create a tar archive. This is a very slow process since there are many thousands of files and directories.

Is there any way to do this in parallel as the output files are being created?

Is it possible to have multiple tar processes, spread across multiple machines, all adding their files to the same archive at once?

The cluster has a shared filesystem.

I am not interested in compression since it slows things down even more, and because all of the input files are themselves already compressed. Ideally the output would be a tar file, but I would consider other archive formats as well.

Best Answer

You can't have multiple processes adding to the same tar archive (or any other usual archive format, compressed or not). Each file is stored contiguously, and there is no way to insert data in a file, only to append or overwrite, so continuing to write to a file that isn't the last one would overwrite subsequent files.

If you know the file size in advance, you could reserve the size in the tar archive and have the program keep writing. That would require a lot of coding: it's a very unusual thing to do.

Unix has a feature designed to accommodate a group of files that are written to independently. It's called a directory.

There are very few cases where you'd gain anything from an uncompressed archive over a directory. Reading it might be slightly faster in some circumstances; this is an intrinsic consequence of the directory format (where each file entry is a pointer to its content) as opposed to the archive format (where each file entry is its content directly), which is precisely what makes it possible to build the directory piecewise. Transforming a directory tree to an archive is post-processing that needs to be done sequentially.

Related Solutions

Virtual write-only file system for storing files in archive

It seems tar wants to know all the file names upfront. So it is less on-the-fly and more after-the-fly. cpio does not seem to have that problem:

| cpio -vo 2>&1 > >(gzip > /tmp/arc.cpio.gz) | parallel rm

Shell – Compress a large number of large files fast

The first step is to figure out what the bottleneck is: is it disk I/O, network I/O, or CPU?

If the bottleneck is the disk I/O, there isn't much you can do. Make sure that the disks don't serve many parallel requests as that can only decrease performance.

If the bottleneck is the network I/O, run the compression process on the machine where the files are stored: running it on a machine with a beefier CPU only helps if the CPU is the bottleneck.

If the bottleneck is the CPU, then the first thing to consider is using a faster compression algorithm. Bzip2 isn't necessarily a bad choice — its main weakness is decompression speed — but you could use gzip and sacrifice some size for compression speed, or try out other formats such as lzop or lzma. You might also tune the compression level: bzip2 defaults to -9 (maximum block size, so maximum compression, but also longest compression time); set the environment variable BZIP2 to a value like -3 to try compression level 3. This thread and this thread discuss common compression algorithms; in particular this blog post cited by derobert gives some benchmarks which suggest that gzip -9 or bzip2 with a low level might be a good compromise compared to bzip2 -9. This other benchmark which also includes lzma (the algorithm of 7zip, so you might use 7z instead of tar --lzma) suggests that lzma at a low level can reach the bzip2 compression ratio faster. Just about any choice other than bzip2 will improve decompression time. Keep in mind that the compression ratio depends on the data, and the compression speed depends on the version of the compression program, on how it was compiled, and on the CPU it's executed on.

Another option if the bottleneck is the CPU and you have multiple cores is to parallelize the compression. There are two ways to do that. One that works with any compression algorithm is to compress the files separately (either individually or in a few groups) and use parallel to run the archiving/compression commands in parallel. This may reduce the compression ratio but increases the speed of retrieval of an individual file and works with any tool. The other approach is to use a parallel implementation of the compression tool; this thread lists several.

Best Answer

Related Solutions

Virtual write-only file system for storing files in archive

Shell – Compress a large number of large files fast

Related Question