How to recompress 2 million gzip files without storing them twice

compressiondisk-usagelarge filestar

I have about 2 million (60GiB) of gzipped small files and I would like to create a compressed archive containing all of them in an uncompressed version. Unfortunately, I cannot just uncompress them all and then create the compressed archive as I only have about 70GiB of free disk space. In other words, how can I do an equivalent of tar --file-filter="zcat" zcf file.tar.gz directory if the command-line switch like --file-filter doesn't exist in GNU tar?

Best Answer

An option could be to use avfs (here assuming a GNU system):

mkdir ~/AVFS &&
avfsd ~/AVFS &&
cd ~/AVFS/where/your/gz/files/are/ &&
find . -name '*.gz' -type f -printf '%p#\0' |
  tar --null -T - --transform='s/.gz#$//' -cf - | pigz > /dest/file.tar.gz

Related Solutions

The difference between a tar of a complete filesystem and an image

An image is a raw (literal, byte for byte) copy of a filesystem. Because this includes all the fs meta-data, you can mount it the same way you would mount a physical device with the exact same byte for byte data on it.

A tar file (aka. a 'tarchive') is an archival format that is filesystem agnostic -- although it includes information such as permissions, ownership, and maintains directory structure, it does not depend further upon the source filesystem. This means tarchives are portable from one type of filesystem to another; anywhere you have a tar utility, you should be able to use a tarfile regardless of origin.

A tarchive is not a literal byte for byte copy of a region of storage. It's a set of files structured by tar, and hence, unlike an image, its contents can be analyzed and manipulated externally (by the tar utility itself). This also means it does depend on some existing filesystem in order to be unpacked and used.

A tarchive can contain the contents of an entire filesystem, but this is not the same as containing the actual filesystem, as an image does. In order to reproduce the original fs, you would have to create an fs partition of the same type (n.b., which the tarchive contains no indication of) and unpack into it. Conversely, if you want to "unpack" an image into a subdirectory of an existing filesystem, you must mount it and copy out manually (although there may be tools to aid in this).

So, the two methodologies best suit slightly different purposes. With regard to back-ups, tar is the better choice for a number of reasons:

You are only copying actual files, and not empty space.
You are not bringing the underlying filesystem and its attendant imperfections with you (fragmentation, inconsistencies).
You can avoid including things which should never be included (e.g., /proc, /dev).
Tar files are easier to update.

How should I combine many compressed files into one archive

Since tar files are a streaming format — you can cat two of them together and get an almost-correct result — you don't need to extract them to disk at all to do this. You can decompress (only) the files, concatenate them together, and recompress that stream:

xzcat *.tar.xz | xz -c > combined.tar.xz

combined.tar.xz will be a compressed tarball of all the files in the component tarballs that is only slightly corrupt. To extract, you'll have to use the --ignore-zeros option (in GNU tar), because the archives do have an "end-of-file" marker that will appear in the middle of the result. Other than that, though, everything will work correctly.

GNU tar also supports a --concatenate mode for producing combined archives. That has the same limitations as above — you must use --ignore-zeros to extract — but it doesn't work with compressed archives. You can build something up to trick it into working using process substitution, but it's a hassle and even more fragile.

If there are files that appear more than once in different tar files, this won't work properly, but you've got that problem regardless. Otherwise this will give you what you want — piping the output through xz is how tar compresses its output anyway.

If archives that only work with a particular tar implementation aren't adequate for your purposes, appending to the archive with r is your friend:

tar cJf combined.tar.xz dummy-file
for x in db-*.tar.xz
do
    mkdir tmp
    pushd tmp
    tar xJf "../$x"
    tar rJf ../combined.tar.xz .
    popd
    rm -r tmp
done

This only ever extracts a single archive at a time, so the working space is limited to the size of a single archive's contents. The compression is streaming just like it would have been had you made the final archive all at once, so it will be as good as it ever could have been. You do a lot of excess decompression and recompression that will make this slower than the cat versions, but the resulting archive will work anywhere without any special support.

Note that — depending on what exactly you want — just adding the uncompressed tar files themselves to an archive might suffice. They will compress (almost) exactly as well as their contents in a single file, and it will reduce the compression overhead for each file. This would look something like:

tar cJf combined.tar.xz dummy-file
for x in db-*.tar.xz
do
    xz -dk "$x"
    tar rJf combined.tar.xz "${x%.xz}"
    rm -f "${x%.xz}"
done

This is slightly less efficient in terms of the final compressed size because there are extra tar headers in the stream, but saves some time on extracting and re-adding all the files as files. You'd end up with combined.tar.xz containing many (uncompressed) db-*.tar files.

Best Answer

Related Solutions

The difference between a tar of a complete filesystem and an image

How should I combine many compressed files into one archive

Related Question