I have an embarrassingly parallel process that creates a huge amount of nearly (but not completely) identical files. Is there a way to archive the files "on the fly", so that the data does not consume more space than necessary?
The process itself accepts command-line parameters and prints the name of each file created to stdout. I'm invoking it with parallel --gnu
which takes care of distributing input (which comes from another process) and collecting output:
arg_generating_process | parallel --gnu my_process | magic_otf_compressor
SIMPLE EXAMPLE for the first part of the pipe in bash
:
for ((f = 0; $f < 100000; f++)); do touch $f; echo $f; done
How could magic_otf_compressor
look like? It's supposed to treat each input line as file name, copy each file to a compressed .tar
archive (the same archive for all files processed!) and then delete it. (Actually, it should be enough to print the name of each processed file, another | parallel --gnu rm
could take care of deleting the files.)
Is there any such tool? I'm not considering compressing each file individually, this would waste far too much space. I have looked into archivemount
(will keep file system in memory -> impossible, my files are too large and too many) and avfs
(couldn't get it to work together with FUSE). What have I missed?
I'm just one step away from hacking such a tool myself, but somebody must have done it before…
EDIT: Essentially I think I'm looking for a stdin front-end for libtar
(as opposed to the command-line front-end tar
that reads arguments from, well, the command line).
Best Answer
It seems
tar
wants to know all the file names upfront. So it is less on-the-fly and more after-the-fly.cpio
does not seem to have that problem: