How should I combine many compressed files into one archive

compressiontar

I have a few hundred .tar.xz files which are almost identical (they are daily database dumps, and the database changes slowly).

I believe that due to the similarities in the uncompressed files, they will compress very well, and small scale tests have shown that compressing any number of these uncompressed files creates an archive only slightly larger than one of them.

My problem is that all the uncompressed files would be a few terabytes (compression ratio is about 25:1), and I don't have that much disk space to use as a working area.

Is there a way I can process the individual compressed files one at a time, adding them to a single archive and retaining the benefits of compressing them together?

Best Answer

Since tar files are a streaming format — you can cat two of them together and get an almost-correct result — you don't need to extract them to disk at all to do this. You can decompress (only) the files, concatenate them together, and recompress that stream:

xzcat *.tar.xz | xz -c > combined.tar.xz

combined.tar.xz will be a compressed tarball of all the files in the component tarballs that is only slightly corrupt. To extract, you'll have to use the --ignore-zeros option (in GNU tar), because the archives do have an "end-of-file" marker that will appear in the middle of the result. Other than that, though, everything will work correctly.

GNU tar also supports a --concatenate mode for producing combined archives. That has the same limitations as above — you must use --ignore-zeros to extract — but it doesn't work with compressed archives. You can build something up to trick it into working using process substitution, but it's a hassle and even more fragile.

If there are files that appear more than once in different tar files, this won't work properly, but you've got that problem regardless. Otherwise this will give you what you want — piping the output through xz is how tar compresses its output anyway.


If archives that only work with a particular tar implementation aren't adequate for your purposes, appending to the archive with r is your friend:

tar cJf combined.tar.xz dummy-file
for x in db-*.tar.xz
do
    mkdir tmp
    pushd tmp
    tar xJf "../$x"
    tar rJf ../combined.tar.xz .
    popd
    rm -r tmp
done

This only ever extracts a single archive at a time, so the working space is limited to the size of a single archive's contents. The compression is streaming just like it would have been had you made the final archive all at once, so it will be as good as it ever could have been. You do a lot of excess decompression and recompression that will make this slower than the cat versions, but the resulting archive will work anywhere without any special support.

Note that — depending on what exactly you want — just adding the uncompressed tar files themselves to an archive might suffice. They will compress (almost) exactly as well as their contents in a single file, and it will reduce the compression overhead for each file. This would look something like:

tar cJf combined.tar.xz dummy-file
for x in db-*.tar.xz
do
    xz -dk "$x"
    tar rJf combined.tar.xz "${x%.xz}"
    rm -f "${x%.xz}"
done

This is slightly less efficient in terms of the final compressed size because there are extra tar headers in the stream, but saves some time on extracting and re-adding all the files as files. You'd end up with combined.tar.xz containing many (uncompressed) db-*.tar files.