I have a few hundred .tar.xz
files which are almost identical (they are daily database dumps, and the database changes slowly).
I believe that due to the similarities in the uncompressed files, they will compress very well, and small scale tests have shown that compressing any number of these uncompressed files creates an archive only slightly larger than one of them.
My problem is that all the uncompressed files would be a few terabytes (compression ratio is about 25:1), and I don't have that much disk space to use as a working area.
Is there a way I can process the individual compressed files one at a time, adding them to a single archive and retaining the benefits of compressing them together?
Best Answer
Since tar files are a streaming format — you can
cat
two of them together and get an almost-correct result — you don't need to extract them to disk at all to do this. You can decompress (only) the files, concatenate them together, and recompress that stream:combined.tar.xz
will be a compressed tarball of all the files in the component tarballs that is only slightly corrupt. To extract, you'll have to use the--ignore-zeros
option (in GNUtar
), because the archives do have an "end-of-file" marker that will appear in the middle of the result. Other than that, though, everything will work correctly.GNU
tar
also supports a--concatenate
mode for producing combined archives. That has the same limitations as above — you must use--ignore-zeros
to extract — but it doesn't work with compressed archives. You can build something up to trick it into working using process substitution, but it's a hassle and even more fragile.If there are files that appear more than once in different tar files, this won't work properly, but you've got that problem regardless. Otherwise this will give you what you want — piping the output through
xz
is howtar
compresses its output anyway.If archives that only work with a particular
tar
implementation aren't adequate for your purposes, appending to the archive withr
is your friend:This only ever extracts a single archive at a time, so the working space is limited to the size of a single archive's contents. The compression is streaming just like it would have been had you made the final archive all at once, so it will be as good as it ever could have been. You do a lot of excess decompression and recompression that will make this slower than the
cat
versions, but the resulting archive will work anywhere without any special support.Note that — depending on what exactly you want — just adding the uncompressed tar files themselves to an archive might suffice. They will compress (almost) exactly as well as their contents in a single file, and it will reduce the compression overhead for each file. This would look something like:
This is slightly less efficient in terms of the final compressed size because there are extra tar headers in the stream, but saves some time on extracting and re-adding all the files as files. You'd end up with
combined.tar.xz
containing many (uncompressed)db-*.tar
files.