Using grep/sed/awk to classify log file entries

awkgreplogsregular expressionsed

I need to process a very large log file with many lines in different formats.

My goal is to extract unique line entries who have the same starting pattern, e.g. '^2011-02-21.*MyKeyword.*Error', effectively obtaining a list of samples for each line pattern, therefore identifying the patterns.

I only know a few patterns so far, and browsing through the file manually is definitely not the option.

Please note that besides the known patterns, there is a number of unknown ones too, and I'd like to automate extracting those as well.

What is the best way to do this? I do know regular expressions quite well, but haven't done much work with awk/sed which I imagine would be used at some point in this process.

Best Answer

If I understand correctly, you have a bunch of patterns, and you want to extract one match per pattern. The following awk script should do the trick. It prints the first occurrence of the given pattern, and records that the pattern has been seen so as not to print subsequent occurrences.

awk '
/^2011-02-21.*MyKeyword.*Error/ {
    if (!seen["^2011-02-21.*MyKeyword.*Error"]++) print;
    next;
}
1 {if (!seen[""]++) print}  # also print the first line that matches no pattern
'

Here's a variant that keeps one MyKeyword.*Error line per day.

awk '
/^[0-9]{4}-[0-9]{2}-[0-9]{2}.*MyKeyword.*Error/ {
    if (!seen[substr($0,10) "MyKeyword.*Error"]++) print;
    next;
}
'
Related Question