I have some process that creates a stream of millions of highly similar lines. I'm piping this to gz
. Does the compression ratio improve over time in such a setup? I.e. is the compression ratio better for 1 million similar lines, than say 10,000?
Does gz compression ratio improve over time
compressiongzip
Related Solutions
I did a small benchmark. It only tests writes though.
Test data is a Linux kernel source tree (linux-3.8), already unpacked into memory (/dev/shm/ tmpfs), so there should be as little influence as possible from the data source. I used compressible data for this test since compression with non-compressible files is nonsense regardless of encryption.
Using btrfs filesystem on a 4GiB LVM volume, on LUKS [aes, xts-plain, sha256], on RAID-5 over 3 disks with 64kb chunksize. CPU is a Intel E8400 2x3Ghz without AES-NI. Kernel is 3.8.2 x86_64.
The script:
#!/bin/bash
PARTITION="/dev/lvm/btrfs"
MOUNTPOINT="/mnt/btrfs"
umount "$MOUNTPOINT" >& /dev/null
for method in no lzo zlib
do
for iter in {1..3}
do
echo Prepare compress="$method", iter "$iter"
mkfs.btrfs "$PARTITION" >& /dev/null
mount -o compress="$method",compress-force="$method" "$PARTITION" "$MOUNTPOINT"
sync
time (cp -a /dev/shm/linux-3.8 "$MOUNTPOINT"/linux-3.8 ; umount "$MOUNTPOINT")
echo Done compress="$method", iter "$iter"
done
done
So in each iteration, it makes a fresh filesystem, and measures the time it takes to copy the linux kernel source from memory and umount. So it's a pure write-test, zero reads.
The results:
Prepare compress=no, iter 1
real 0m12.790s
user 0m0.127s
sys 0m2.033s
Done compress=no, iter 1
Prepare compress=no, iter 2
real 0m15.314s
user 0m0.132s
sys 0m2.027s
Done compress=no, iter 2
Prepare compress=no, iter 3
real 0m14.764s
user 0m0.130s
sys 0m2.039s
Done compress=no, iter 3
Prepare compress=lzo, iter 1
real 0m11.611s
user 0m0.146s
sys 0m1.890s
Done compress=lzo, iter 1
Prepare compress=lzo, iter 2
real 0m11.764s
user 0m0.127s
sys 0m1.928s
Done compress=lzo, iter 2
Prepare compress=lzo, iter 3
real 0m12.065s
user 0m0.132s
sys 0m1.897s
Done compress=lzo, iter 3
Prepare compress=zlib, iter 1
real 0m16.492s
user 0m0.116s
sys 0m1.886s
Done compress=zlib, iter 1
Prepare compress=zlib, iter 2
real 0m16.937s
user 0m0.144s
sys 0m1.871s
Done compress=zlib, iter 2
Prepare compress=zlib, iter 3
real 0m15.954s
user 0m0.124s
sys 0m1.889s
Done compress=zlib, iter 3
With zlib
it's a lot slower, with lzo
a bit faster, and in general, not worth the bother (difference is too small for my taste, considering I used easy-to-compress data for this test).
I'd make a read test also but it's more complicated as you have to deal with caching.
Since tar files are a streaming format — you can cat
two of them together and get an almost-correct result — you don't need to extract them to disk at all to do this. You can decompress (only) the files, concatenate them together, and recompress that stream:
xzcat *.tar.xz | xz -c > combined.tar.xz
combined.tar.xz
will be a compressed tarball of all the files in the component tarballs that is only slightly corrupt. To extract, you'll have to use the --ignore-zeros
option (in GNU tar
), because the archives do have an "end-of-file" marker that will appear in the middle of the result. Other than that, though, everything will work correctly.
GNU tar
also supports a --concatenate
mode for producing combined archives. That has the same limitations as above — you must use --ignore-zeros
to extract — but it doesn't work with compressed archives. You can build something up to trick it into working using process substitution, but it's a hassle and even more fragile.
If there are files that appear more than once in different tar files, this won't work properly, but you've got that problem regardless. Otherwise this will give you what you want — piping the output through xz
is how tar
compresses its output anyway.
If archives that only work with a particular tar
implementation aren't adequate for your purposes, appending to the archive with r
is your friend:
tar cJf combined.tar.xz dummy-file
for x in db-*.tar.xz
do
mkdir tmp
pushd tmp
tar xJf "../$x"
tar rJf ../combined.tar.xz .
popd
rm -r tmp
done
This only ever extracts a single archive at a time, so the working space is limited to the size of a single archive's contents. The compression is streaming just like it would have been had you made the final archive all at once, so it will be as good as it ever could have been. You do a lot of excess decompression and recompression that will make this slower than the cat
versions, but the resulting archive will work anywhere without any special support.
Note that — depending on what exactly you want — just adding the uncompressed tar files themselves to an archive might suffice. They will compress (almost) exactly as well as their contents in a single file, and it will reduce the compression overhead for each file. This would look something like:
tar cJf combined.tar.xz dummy-file
for x in db-*.tar.xz
do
xz -dk "$x"
tar rJf combined.tar.xz "${x%.xz}"
rm -f "${x%.xz}"
done
This is slightly less efficient in terms of the final compressed size because there are extra tar headers in the stream, but saves some time on extracting and re-adding all the files as files. You'd end up with combined.tar.xz
containing many (uncompressed) db-*.tar
files.
Best Answer
It does up to a certain point and this evens out. The compression algorithms have a restriction on the size of the blocks they look at (
bzip2
) and/or on the tables they keep with information on previous patterns (gzip
).In the case of gzip, once a table is full old entries get pushed out, and compression no further improves. Depending on the your compression quality factor (
-0
to-9
) and the repetitiveness of your input this filling up can of course can take a while and you might not notice.