Shell – tarring in parallel

archivefilesparallelismshell-script

An Oceanographer friend at work needs to back up many months worth of data. She is overwhelmed so I volunteered to do it. There are hundreds of directories to be backed up and we want to tar/bzip them into files with the same name as the directory. I can do this easy enough serially – but – I wanted to take advantage of the several hundred cores on my work station.

Question: using find with the -n -P args or GNU Parallel, how do I tar/bzip the directories, using as many cores as possible while naming the end product: origonalDirName.tar.bz2?

I have used find to bunzip 100 files simultaneously and it was VERY fast – so this is the way to approach the problem though I do not know how to get each filename to be that of each directory.

Best Answer

Just tar to stdout and pipe it to pigz. (You most likely don't want to parallelize disk access, just the compression part.):

$ tar c- myDirectory/ | pigz > myDirectory.tar.gz

A plain tar invocation like the one above basically only concatenates directory trees in a reversible way. The compression part can be separate as it is in this example.

pigz does multithreaded compression. The number of threads it uses can be adjusted with -p and it'll default to the number of cores available.

Related Question