Fastest way of working out uncompressed size of large GZIPPED file

compressiongzip

Once a file is gzipped, is there a way of quickly querying it to say what the uncompressed file size is (without decompressing it), especially in cases where the uncompressed file is > 4GB in size.

According to the RFC https://tools.ietf.org/html/rfc1952#page-5 you can query the last 4 bytes of the file, but if the uncompressed file was > 4GB then the value just represents the uncompressed value modulo 2^32

This value can also be retrieved by running gunzip -l foo.gz, however the "uncompressed" column just contains uncompressed value modulo 2^32 again, presumably as it's reading the footer as described above.

I was just wondering if there is a way of getting the uncompressed file size without having to decompress it first, this would be especially useful in the case where gzipped files contain 50GB+ of data and would take a while to decompress using methods like gzcat foo.gz | wc -c


EDIT: The 4GB limitation is openly acknowledged in the man page of the gzip utility included with OSX (Apple gzip 242)

  BUGS
    According to RFC 1952, the recorded file size is stored in a 32-bit
    integer, therefore, it can not represent files larger than 4GB. This
    limitation also applies to -l option of gzip utility.

Best Answer

I believe the fastest way is to modify gzip so that testing in verbose mode outputs the number of bytes decompressed; on my system, with a 7761108684-byte file, I get

% time gzip -tv test.gz
test.gz:     OK (7761108684 bytes)
gzip -tv test.gz  44.19s user 0.79s system 100% cpu 44.919 total

% time zcat test.gz| wc -c
7761108684
zcat test.gz  45.51s user 1.54s system 100% cpu 46.987 total
wc -c  0.09s user 1.46s system 3% cpu 46.987 total

To modify gzip (1.6, as available in Debian), the patch is as follows:

--- a/gzip.c
+++ b/gzip.c
@@ -61,6 +61,7 @@
 #include <stdbool.h>
 #include <sys/stat.h>
 #include <errno.h>
+#include <inttypes.h>

 #include "closein.h"
 #include "tailor.h"
@@ -694,7 +695,7 @@

     if (verbose) {
         if (test) {
-            fprintf(stderr, " OK\n");
+            fprintf(stderr, " OK (%jd bytes)\n", (intmax_t) bytes_out);

         } else if (!decompress) {
             display_ratio(bytes_in-(bytes_out-header_bytes), bytes_in, stderr);
@@ -901,7 +902,7 @@
     /* Display statistics */
     if(verbose) {
         if (test) {
-            fprintf(stderr, " OK");
+            fprintf(stderr, " OK (%jd bytes)", (intmax_t) bytes_out);
         } else if (decompress) {
             display_ratio(bytes_out-(bytes_in-header_bytes), bytes_out,stderr);
         } else {
Related Question