I am trying to do a record count on a 7.6 GB gzip file. I found few approaches using the zcat
command.
$ zcat T.csv.gz | wc -l
423668947
This works but it takes too much time (more than 10 minutes to get the count). I tried a few more approaches like
$ sed -n '$=' T.csv.gz
28173811
$ perl -lne 'END { print $. }' < T.csv.gz
28173811
$ awk 'END {print NR}' T.csv.gz
28173811
All three of these commands are executing pretty fast but giving an incorrect count of 28173811.
How can I perform a record count in a minimal amount of time?
Best Answer
The
sed
,perl
andawk
commands that you mention may be correct, but they all read the compressed data and counts newline characters in that. These newline characters have nothing to do with the newline characters in the uncompressed data.To count the number of lines in the uncompressed data, there is no way around uncompressing it. Your approach with
zcat
is the correct approach and since the data is so large, it will take time to uncompress it.Most utilities that deals with
gzip
compression and decompression will most likely use the same shared library routines to do so. The only way to speed it up would be to find an implementation of thezlib
routines that are somehow faster than the default ones, and rebuild e.g.zcat
to use those.