I am looking for a compression tool with an arbitrarily large dictionary (and "block size"). Let me explain by way of examples.
First let us create 32MB random data and then concatenate it to itself to make a file of twice the length of length 64MB.
head -c32M /dev/urandom > test32.bin
cat test32.bin test32.bin > test64.bin
Of course test32.bin
is not compressible because it is random but the first half of test64.bin
is the same as the second half, so it should be compressible by roughly 50%.
First let's try some standard tools. test64.bin is of size exactly 67108864.
- gzip -9. Compressed size 67119133.
- bzip2 -9. Compressed size 67409123. (A really big overhead!)
- xz -7. Compressed size 67112252.
- xz -8. Compressed size 33561724.
- zstd –ultra -22. Compressed size 33558039.
We learn from this that gzip and bzip2 can never compress this file. However with a big enough dictionary xz and zstd can compress the file and in that case zstd does the best job.
However, now try:
head -c150M /dev/urandom > test150.bin
cat test150.bin test150.bin > test300.bin
test300.bin is of size exactly 314572800. Let's try the best compression algorithms again at their highest settings.
- xz -9. Compressed size 314588440
- zstd –ultra -22. Compressed size 314580017
In this case neither tool can compress the file.
Is there a tool that has an arbitrarily large dictionary size so it
can compress a file such as test300.bin?
Thanks to the comment and answer it turns out both zstd and xz can do it. You need zstd version 1.4.x however.
- zstd –long=28. Compressed size 157306814
- xz -9 –lzma2=dict=150MiB. Compressed size 157317764.
Best Answer
It's at least available with the
xz
command. Thexz
manpage has:As documented in the Custom compressor filter chains section, you can simply supply manually the dictionary size to
xz
with for example--lzma2=dict=150MiB
(we have insight information telling 150MiB is enough, else in doubt the file size would have to be used).While doing this the
xz
process on amd64 stayed most of the time at about 1.6g usage of resident memory.