Shell – Fastest and most efficient way to get number of records (lines) in a gzip-compressed file

gzipshell

I am trying to do a record count on a 7.6 GB gzip file. I found few approaches using the zcat command.

$ zcat T.csv.gz | wc -l
423668947

This works but it takes too much time (more than 10 minutes to get the count). I tried a few more approaches like

$ sed -n '$=' T.csv.gz
28173811
$ perl -lne 'END { print $. }' < T.csv.gz
28173811
$ awk 'END {print NR}' T.csv.gz
28173811

All three of these commands are executing pretty fast but giving an incorrect count of 28173811.

How can I perform a record count in a minimal amount of time?

Best Answer

The sed, perl and awk commands that you mention may be correct, but they all read the compressed data and counts newline characters in that. These newline characters have nothing to do with the newline characters in the uncompressed data.

To count the number of lines in the uncompressed data, there is no way around uncompressing it. Your approach with zcat is the correct approach and since the data is so large, it will take time to uncompress it.

Most utilities that deals with gzip compression and decompression will most likely use the same shared library routines to do so. The only way to speed it up would be to find an implementation of the zlib routines that are somehow faster than the default ones, and rebuild e.g. zcat to use those.

Related Question