Virtual write-only file system for storing files in archive

filesystemsfuseparallelismtar

I have an embarrassingly parallel process that creates a huge amount of nearly (but not completely) identical files. Is there a way to archive the files "on the fly", so that the data does not consume more space than necessary?

The process itself accepts command-line parameters and prints the name of each file created to stdout. I'm invoking it with parallel --gnu which takes care of distributing input (which comes from another process) and collecting output:

arg_generating_process | parallel --gnu my_process | magic_otf_compressor

SIMPLE EXAMPLE for the first part of the pipe in bash:

for ((f = 0; $f < 100000; f++)); do touch $f; echo $f; done

How could magic_otf_compressor look like? It's supposed to treat each input line as file name, copy each file to a compressed .tar archive (the same archive for all files processed!) and then delete it. (Actually, it should be enough to print the name of each processed file, another | parallel --gnu rm could take care of deleting the files.)

Is there any such tool? I'm not considering compressing each file individually, this would waste far too much space. I have looked into archivemount (will keep file system in memory -> impossible, my files are too large and too many) and avfs (couldn't get it to work together with FUSE). What have I missed?

I'm just one step away from hacking such a tool myself, but somebody must have done it before…

EDIT: Essentially I think I'm looking for a stdin front-end for libtar (as opposed to the command-line front-end tar that reads arguments from, well, the command line).

Best Answer

It seems tar wants to know all the file names upfront. So it is less on-the-fly and more after-the-fly. cpio does not seem to have that problem:

| cpio -vo 2>&1 > >(gzip > /tmp/arc.cpio.gz) | parallel rm

Related Solutions

Find – Recursively Search All Archive Files for Filename Patterns

(Adapted from How do I recursively grep through compressed archives?)

Install AVFS, a filesystem that provides transparent access inside archives. First run this command once to set up a view of your machine's filesystem in which you can access archives as if they were directories:

mountavfs

After this, if /path/to/archive.zip is a recognized archive, then ~/.avfs/path/to/archive.zip# is a directory that appears to contain the contents of the archive.

find ~/.avfs"$PWD" \( -name '*.7z' -o -name '*.zip' -o -name '*.tar.gz' -o -name '*.tgz' \) \
     -exec sh -c '
                  find "$0#" -name "*vacation*.jpg"
                 ' {} 'Test::Version' \;

Explanations:

Mount the AVFS filesystem.
Look for archive files in ~/.avfs$PWD, which is the AVFS view of the current directory.
For each archive, execute the specified shell snippet (with $0 = archive name and $1 = pattern to search).
$0# is the directory view of the archive $0.
{\} rather than {} is needed in case the outer find substitutes {} inside -exec ; arguments (some do it, some don't).

Or in zsh ≥4.3:

mountavfs
ls -l ~/.avfs$PWD/**/*.(7z|tgz|tar.gz|zip)(e\''
     reply=($REPLY\#/**/*vacation*.jpg(.N))
'\')

Explanations:

~/.avfs$PWD/**/*.(7z|tgz|tar.gz|zip) matches archives in the AVFS view of the current directory and its subdirectories.
PATTERN(e\''CODE'\') applies CODE to each match of PATTERN. The name of the matched file is in $REPLY. Setting the reply array turns the match into a list of names.
$REPLY\# is the directory view of the archive.
$REPLY\#/**/*vacation*.jpg matches *vacation*.jpg files in the archive.
The N glob qualifier makes the pattern expand to an empty list if there is no match.

Free Command Line Tool – Extract All Archive Formats

I use atool. It does the job. It works with many, though not all formats:

tar, gzip, bzip2, bzip, lzip, lzop, lzma, zip, rar, lha, arj, arc, p7zip etc.

These compression tools are still needed, though as atool is simply a front end for them.

I particularly like the als command it provides which lists the contents of any supported archive format.

The main atool command uses its own flags for extracting archives (passing the appropriate flags to the specific underlying extraction tools).

Oh, and it's in some distributions' repositories (Fedora in my case, though as I recall, back when I used Ubuntu it wasn't in their repos then. and I installed from a tarball.).

Update on Repositories: atool is in the following distributions' repositories (current releases checked only):

Fedora
Debian (thanks @terdon, and, presumably, it's derivatives like Ubuntu)
Ubuntu (q.e.d., and, presumably, derivatives like Mint)
Open Suse
CentOS (and, presumably, RHEL)
Arch Linux

I'm sure there are others... plausibly, most modern distributions.

Answer for Updated Question "How can I configure something like atool to not use unzip to extract zip files ... and to use gunzip instead":

Edit the atool config file ~/.atoolrc and add the line:

path_unzip /usr/bin/gunzip

with the correct path to your gunzip program.

See the man page for the complete list of possible variables you can put in this config file, of which there are a lot. If the command line options necessary for gunzip are different than unzip, you may have to modify the atool source (perl) itself.

Best Answer

Related Solutions

Find – Recursively Search All Archive Files for Filename Patterns

Free Command Line Tool – Extract All Archive Formats

Related Question