Virtual write-only file system for storing files in archive

filesystemsfuseparallelismtar

I have an embarrassingly parallel process that creates a huge amount of nearly (but not completely) identical files. Is there a way to archive the files "on the fly", so that the data does not consume more space than necessary?

The process itself accepts command-line parameters and prints the name of each file created to stdout. I'm invoking it with parallel --gnu which takes care of distributing input (which comes from another process) and collecting output:

arg_generating_process | parallel --gnu my_process | magic_otf_compressor

SIMPLE EXAMPLE for the first part of the pipe in bash:

for ((f = 0; $f < 100000; f++)); do touch $f; echo $f; done

How could magic_otf_compressor look like? It's supposed to treat each input line as file name, copy each file to a compressed .tar archive (the same archive for all files processed!) and then delete it. (Actually, it should be enough to print the name of each processed file, another | parallel --gnu rm could take care of deleting the files.)

Is there any such tool? I'm not considering compressing each file individually, this would waste far too much space. I have looked into archivemount (will keep file system in memory -> impossible, my files are too large and too many) and avfs (couldn't get it to work together with FUSE). What have I missed?

I'm just one step away from hacking such a tool myself, but somebody must have done it before…

EDIT: Essentially I think I'm looking for a stdin front-end for libtar (as opposed to the command-line front-end tar that reads arguments from, well, the command line).

Best Answer

It seems tar wants to know all the file names upfront. So it is less on-the-fly and more after-the-fly. cpio does not seem to have that problem:

| cpio -vo 2>&1 > >(gzip > /tmp/arc.cpio.gz) | parallel rm