I have very big logs (several gigabytes per day), that can (but do not need to) contain specific lines. I have to count the number of occurences of every one of these lines on a daily basis.
I have a file patterns.in
, that contains the desired lines. For example:
aaaa
bbbb
cccc
dddd
eeee
ffff
The log files can look like this:
asd
dfg
aaaa
aaaa
sa
sdf
dddd
dddd
dddd
dddd
ghj
bbbb
cccc
cccc
cccc
fgg
fgh
hjk
The first (and perhaps most obvious approach) is to use grep
, sort
and uniq
in the following way:
grep -f patterns.in logfile.txt | sort | uniq -c
which gives the following result:
2 aaaa
1 bbbb
3 cccc
4 dddd
It is close to what I want to achieve, but my desired result is:
2 aaaa
1 bbbb
3 cccc
4 dddd
0 eeee
0 ffff
So the problem is: how to print '0' if a line from pattern.in
file is not matched? It needs to be done in a simplest possible way, as all I have available is the cygwin environment.
Best Answer
how about feeding the pattern file back in as a data file so that each pattern finds at least one match, and then subtracting one from the final reported count for each match