The simplest method to count lines matching specific patterns, including ‘0’ if line is not found

grepsortuniqwc

I have very big logs (several gigabytes per day), that can (but do not need to) contain specific lines. I have to count the number of occurences of every one of these lines on a daily basis.

I have a file patterns.in, that contains the desired lines. For example:

aaaa
bbbb
cccc
dddd
eeee
ffff

The log files can look like this:

asd
dfg
aaaa
aaaa
sa
sdf
dddd
dddd
dddd
dddd
ghj
bbbb
cccc
cccc
cccc
fgg
fgh
hjk

The first (and perhaps most obvious approach) is to use grep, sort and uniq in the following way:

grep -f patterns.in logfile.txt | sort | uniq -c

which gives the following result:

   2 aaaa
   1 bbbb
   3 cccc
   4 dddd

It is close to what I want to achieve, but my desired result is:

   2 aaaa
   1 bbbb
   3 cccc
   4 dddd
   0 eeee
   0 ffff

So the problem is: how to print '0' if a line from pattern.in file is not matched? It needs to be done in a simplest possible way, as all I have available is the cygwin environment.

Best Answer

how about feeding the pattern file back in as a data file so that each pattern finds at least one match, and then subtracting one from the final reported count for each match

grep -f patterns.in logfile.txt patterns.in | cut -f2 -d':' | sort | uniq -c | awk '{print($1 - 1" "$2)}'
Related Question