Does gz compression ratio improve over time

compressiongzip

I have some process that creates a stream of millions of highly similar lines. I'm piping this to gz. Does the compression ratio improve over time in such a setup? I.e. is the compression ratio better for 1 million similar lines, than say 10,000?

Best Answer

It does up to a certain point and this evens out. The compression algorithms have a restriction on the size of the blocks they look at (bzip2) and/or on the tables they keep with information on previous patterns (gzip).

In the case of gzip, once a table is full old entries get pushed out, and compression no further improves. Depending on the your compression quality factor (-0 to -9) and the repetitiveness of your input this filling up can of course can take a while and you might not notice.

Related Question