Shell – Fastest and most efficient way to get number of records (lines) in a gzip-compressed file

gzipshell

I am trying to do a record count on a 7.6 GB gzip file. I found few approaches using the zcat command.

$ zcat T.csv.gz | wc -l
423668947

This works but it takes too much time (more than 10 minutes to get the count). I tried a few more approaches like

$ sed -n '$=' T.csv.gz
28173811
$ perl -lne 'END { print $. }' < T.csv.gz
28173811
$ awk 'END {print NR}' T.csv.gz
28173811

All three of these commands are executing pretty fast but giving an incorrect count of 28173811.

How can I perform a record count in a minimal amount of time?

Best Answer

The sed, perl and awk commands that you mention may be correct, but they all read the compressed data and counts newline characters in that. These newline characters have nothing to do with the newline characters in the uncompressed data.

To count the number of lines in the uncompressed data, there is no way around uncompressing it. Your approach with zcat is the correct approach and since the data is so large, it will take time to uncompress it.

Most utilities that deals with gzip compression and decompression will most likely use the same shared library routines to do so. The only way to speed it up would be to find an implementation of the zlib routines that are somehow faster than the default ones, and rebuild e.g. zcat to use those.

Related Solutions

Bash – the most efficient way to move a large number of files that reside in a single directory

Taking advantage of GNU mv's -t option to specify the target directory, instead of relying on the last argument:

find . -name "*" -maxdepth 1 -exec mv -t /home/foo2/bulk2 {} +

If you were on a system without the option, you could use an intermediate shell to get the arguments in the right order (find … -exec … + doesn't support putting extra arguments after the list of files).

find . -name "*" -maxdepth 1 -exec sh -c 'mv "$@" "$0"' /home/foo2/bulk2 {} +

Fastest way of working out uncompressed size of large GZIPPED file

I believe the fastest way is to modify gzip so that testing in verbose mode outputs the number of bytes decompressed; on my system, with a 7761108684-byte file, I get

% time gzip -tv test.gz
test.gz:     OK (7761108684 bytes)
gzip -tv test.gz  44.19s user 0.79s system 100% cpu 44.919 total

% time zcat test.gz| wc -c
7761108684
zcat test.gz  45.51s user 1.54s system 100% cpu 46.987 total
wc -c  0.09s user 1.46s system 3% cpu 46.987 total

To modify gzip (1.6, as available in Debian), the patch is as follows:

--- a/gzip.c
+++ b/gzip.c
@@ -61,6 +61,7 @@
 #include <stdbool.h>
 #include <sys/stat.h>
 #include <errno.h>
+#include <inttypes.h>

 #include "closein.h"
 #include "tailor.h"
@@ -694,7 +695,7 @@

     if (verbose) {
         if (test) {
-            fprintf(stderr, " OK\n");
+            fprintf(stderr, " OK (%jd bytes)\n", (intmax_t) bytes_out);

         } else if (!decompress) {
             display_ratio(bytes_in-(bytes_out-header_bytes), bytes_in, stderr);
@@ -901,7 +902,7 @@
     /* Display statistics */
     if(verbose) {
         if (test) {
-            fprintf(stderr, " OK");
+            fprintf(stderr, " OK (%jd bytes)", (intmax_t) bytes_out);
         } else if (decompress) {
             display_ratio(bytes_out-(bytes_in-header_bytes), bytes_out,stderr);
         } else {

Best Answer

Related Solutions

Bash – the most efficient way to move a large number of files that reside in a single directory

Fastest way of working out uncompressed size of large GZIPPED file

Related Question