Uncompressed file estimation wrong

compressiongzipsplit

I had a large (~60G) compressed file (tar.gz).

I used split to break it into 4 parts and then cat to join them back together.

However, now, when I am trying to estimate the size of the uncompressed file, it turns out it is smaller than the original? How is this possible?

$ gzip -l myfile.tar.gz 
         compressed        uncompressed  ratio uncompressed_name
        60680003101          3985780736 -1422.4% myfile.tar

Best Answer

This is caused by the size of the field used to store the uncompressed size in gzipped files: it’s only 32 bits, so gzip can only store sizes of files up to 4 GiB. Anything larger is compressed and uncompressed correctly, but gzip -l gives an incorrect uncompressed size.

So splitting the tarball and reconstructing it hasn’t caused this, and shouldn’t have affected the file — if you want to make sure, you can check it with gzip -tv.

See Fastest way of working out uncompressed size of large GZIPPED file for more details, and the gzip manual:

The gzip format represents the input size modulo 2³², so the uncompressed size and compression ratio are listed incorrectly for uncompressed files 4 GiB and larger.