Compressing many similar big files

archivingcompressiontarxz

I have hundreds of similar big files (30 megabyte each) which I want to compress. Every pair of files have 99% of same data (less then 1% difference), so I expect to have not more than 40-50 megabyte archive.

Single file can be compressed from 30 MB to 13-15 MB (with xz -1, gz -1, bzip2 -1), but when compressing two or more files I want to have archive with size 13-15MB + N*0.3MB where N is number of files.

When using tar (to create solid archive) and xz -6 (to define compression dictionary to be bigger than one file – Update – this was not enough!), I still have archive with size N*13MB.

I think that both gzip and bzip2 will not help me because they have dictionary less than 1 MB, and my tar stream has repetitions every 30 MB.

How can I archive the my problem in modern Linux using standard tools?

Is it possible to tune xz to compress fast, but use dictionary bigger than 30-60 MB?

Update: Did the trick with tar c input_directory | xz --lzma2=dict=128M,mode=fast,mf=hc4 --memory=2G > compressed.tar.xz. Not sure about necessary of mf=hc4 and --memory=2G options; but dict=128M set the dictionary to be big enough (bigger than one file), and mode=fast make the process bit faster than -e.

Best Answer

Given your details, I assume that you have verified that your files really have 99% of data in common, with a contiguous (or almost contiguous) 1% of difference in them.

First, you should use tar to make one archive with your files inside it. For tests, I would create a .tar with 10 files, so having a 300MB size.

Then, using xz, you have to set it so that the dictionary is bigger than the size of one file. Since you don't say if you have memory restrictions, I'd go with xz -9. There's no point in not using all available memory.

I'd also use the --extreme preset, to test if it makes difference.

Dictionary size

In one documentation that I have available - site - it's said that the dictionary size is roughly equal to the decompressor memory usage. And the -1 parameter means a dict of 1MiB, -6 means 10 MiB (or 8 MiB in another part of the same manual). That's why you're not getting any advantage by tarring those files together. Using the -9 would make the decompessor (and, so, dictionary) be 64 MiB, and I think that is what you wanted.

Edit

Another possibility would be using another compressor. I'd go with 7zip, but would tar those files first and then 7zip them.

Depending on your files content, perhaps you could use 7zip with PPM-D method (instead of LZMA or LZMA2, that is the default and the same used by xz)

Not good: Zip (dict = 32kB), Bzip (dict = 900 kB).

Related Question