compression gzip bzip2 xz zstd – Compression Tool with an Arbitrarily Large Dictionary

bzip2compressiongzipxzzstd

I am looking for a compression tool with an arbitrarily large dictionary (and "block size"). Let me explain by way of examples.

First let us create 32MB random data and then concatenate it to itself to make a file of twice the length of length 64MB.

head -c32M /dev/urandom > test32.bin
cat test32.bin test32.bin > test64.bin

Of course test32.bin is not compressible because it is random but the first half of test64.bin is the same as the second half, so it should be compressible by roughly 50%.

First let's try some standard tools. test64.bin is of size exactly 67108864.

  • gzip -9. Compressed size 67119133.
  • bzip2 -9. Compressed size 67409123. (A really big overhead!)
  • xz -7. Compressed size 67112252.
  • xz -8. Compressed size 33561724.
  • zstd –ultra -22. Compressed size 33558039.

We learn from this that gzip and bzip2 can never compress this file. However with a big enough dictionary xz and zstd can compress the file and in that case zstd does the best job.

However, now try:

head -c150M /dev/urandom > test150.bin
cat test150.bin test150.bin > test300.bin

test300.bin is of size exactly 314572800. Let's try the best compression algorithms again at their highest settings.

  • xz -9. Compressed size 314588440
  • zstd –ultra -22. Compressed size 314580017

In this case neither tool can compress the file.

Is there a tool that has an arbitrarily large dictionary size so it
can compress a file such as test300.bin?


Thanks to the comment and answer it turns out both zstd and xz can do it. You need zstd version 1.4.x however.

  • zstd –long=28. Compressed size 157306814
  • xz -9 –lzma2=dict=150MiB. Compressed size 157317764.

Best Answer

It's at least available with the xz command. The xz manpage has:

The following table summarises the features of the presets:

Preset    DictSize    CompCPU     CompMem     DecMem
    -0    256 KiB        0          3 MiB      1 MiB

[...]

    -9     64 MiB        6        674 MiB     65 MiB

Column descriptions:

DictSize is the LZMA2 dictionary size. It is waste of memory to use a dictionary bigger than the size of the uncompressed file. This is why it is good to avoid using the presets -7 ... -9 when there's no real need for them. [...]

As documented in the Custom compressor filter chains section, you can simply supply manually the dictionary size to xz with for example --lzma2=dict=150MiB (we have insight information telling 150MiB is enough, else in doubt the file size would have to be used).

xz -9 --lzma2=dict=150MiB test300.bin

While doing this the xz process on amd64 stayed most of the time at about 1.6g usage of resident memory.

$ ls -l test*
-rw-r--r--. 1 user user 157286400 Jan 19 16:03 test150.bin
-rw-r--r--. 1 user user 157317764 Jan 19 16:03 test300.bin.xz
Related Question