Since tar files are a streaming format — you can cat
two of them together and get an almost-correct result — you don't need to extract them to disk at all to do this. You can decompress (only) the files, concatenate them together, and recompress that stream:
xzcat *.tar.xz | xz -c > combined.tar.xz
combined.tar.xz
will be a compressed tarball of all the files in the component tarballs that is only slightly corrupt. To extract, you'll have to use the --ignore-zeros
option (in GNU tar
), because the archives do have an "end-of-file" marker that will appear in the middle of the result. Other than that, though, everything will work correctly.
GNU tar
also supports a --concatenate
mode for producing combined archives. That has the same limitations as above — you must use --ignore-zeros
to extract — but it doesn't work with compressed archives. You can build something up to trick it into working using process substitution, but it's a hassle and even more fragile.
If there are files that appear more than once in different tar files, this won't work properly, but you've got that problem regardless. Otherwise this will give you what you want — piping the output through xz
is how tar
compresses its output anyway.
If archives that only work with a particular tar
implementation aren't adequate for your purposes, appending to the archive with r
is your friend:
tar cJf combined.tar.xz dummy-file
for x in db-*.tar.xz
do
mkdir tmp
pushd tmp
tar xJf "../$x"
tar rJf ../combined.tar.xz .
popd
rm -r tmp
done
This only ever extracts a single archive at a time, so the working space is limited to the size of a single archive's contents. The compression is streaming just like it would have been had you made the final archive all at once, so it will be as good as it ever could have been. You do a lot of excess decompression and recompression that will make this slower than the cat
versions, but the resulting archive will work anywhere without any special support.
Note that — depending on what exactly you want — just adding the uncompressed tar files themselves to an archive might suffice. They will compress (almost) exactly as well as their contents in a single file, and it will reduce the compression overhead for each file. This would look something like:
tar cJf combined.tar.xz dummy-file
for x in db-*.tar.xz
do
xz -dk "$x"
tar rJf combined.tar.xz "${x%.xz}"
rm -f "${x%.xz}"
done
This is slightly less efficient in terms of the final compressed size because there are extra tar headers in the stream, but saves some time on extracting and re-adding all the files as files. You'd end up with combined.tar.xz
containing many (uncompressed) db-*.tar
files.
cpio
(the older of the two utilities counting shipping with UNIX) only used to have hard link support for the -p
option (i.e. copying from filesystem to filesystem), but the newc
output format (not the default one cpio
uses) also supports hard links in the output file.
(GNU) tar
supports hard links without any special options. A comparison can be found here.
So if you run a test with a large hard linked file and 100 small files:
$ mkdir tmp
$ dd if=/dev/urandom of=tmp/blabla bs=1k count=1024
1024+0 records in
1024+0 records out
1048576 bytes (1,0 MB) copied, 0,0764345 s, 13,7 MB/s
$ ln tmp/blabla tmp/hardlink
$ tar cvf tmp.tar tmp
$ find tmp -print0 | cpio -0o > out.cpio
4104 blocks
$ find tmp -print0 | cpio -0o --format=newc > outnewc.cpio
2074 blocks
$ xz -9k out.tar outnewc.cpio
$ bzip2 -9k out.tar outnewc.cpio
$ ls -l out*
-rw-rw-r-- 1 anthon users 2101248 Nov 23 12:30 out.cpio
-rw-rw-r-- 1 anthon users 1061888 Nov 23 12:30 outnewc.cpio
-rw-rw-r-- 1 anthon users 1055935 Nov 23 12:30 outnewc.cpio.bz2
-rw-rw-r-- 1 anthon users 1050652 Nov 23 12:30 outnewc.cpio.xz
-rw-rw-r-- 1 anthon users 1157120 Nov 23 12:30 out.tar
-rw-rw-r-- 1 anthon users 1055402 Nov 23 12:30 out.tar.bz2
-rw-rw-r-- 1 anthon users 1050928 Nov 23 12:30 out.tar.xz
You see that the uncompressed versions (outnewc.cpio
and out.tar
) give cpio an advantage and that compressing them with xz -9
gives better results than bzip2 -9
(gzip
is usually much worse than either). And that compression with xz
minimizes the tar
and cpio
output difference. Compression is however heavily dependent on the data, and also on the ordering of the data in the archives, so you should really test this on (a sample of) your real data.
If you want to compress in parallel, you might want to look at my article here
Best Answer
gzip
orbzip2
will compress the file and remove the non-compressed one automatically (this is their default behaviour).However, keep in mind that while the compressing process, both files will exists.
If you want to compress log files (ie: files containing text), you may prefer
bzip2
, since it has a better ratio for text files.Comparison and examples:
UPDATE as @Jjoao told me in a comment, interestingly,
xz
seems to have a best ratio on plain files with its default options:For more informations, here is an interesting benchmark for different tools: http://binfalse.de/2011/04/04/comparison-of-compression/
For the example above, I use
-9
for a best compression ratio, but if the time needed to compress data is more important than the ratio, you'd better not use it (use a lower option, ie-1
, or something between).