Using grep/sed/awk to classify log file entries

awkgreplogsregular expressionsed

I need to process a very large log file with many lines in different formats.

My goal is to extract unique line entries who have the same starting pattern, e.g. '^2011-02-21.*MyKeyword.*Error', effectively obtaining a list of samples for each line pattern, therefore identifying the patterns.

I only know a few patterns so far, and browsing through the file manually is definitely not the option.

Please note that besides the known patterns, there is a number of unknown ones too, and I'd like to automate extracting those as well.

What is the best way to do this? I do know regular expressions quite well, but haven't done much work with awk/sed which I imagine would be used at some point in this process.

Best Answer

If I understand correctly, you have a bunch of patterns, and you want to extract one match per pattern. The following awk script should do the trick. It prints the first occurrence of the given pattern, and records that the pattern has been seen so as not to print subsequent occurrences.

awk '
/^2011-02-21.*MyKeyword.*Error/ {
    if (!seen["^2011-02-21.*MyKeyword.*Error"]++) print;
    next;
}
1 {if (!seen[""]++) print}  # also print the first line that matches no pattern
'

Here's a variant that keeps one MyKeyword.*Error line per day.

awk '
/^[0-9]{4}-[0-9]{2}-[0-9]{2}.*MyKeyword.*Error/ {
    if (!seen[substr($0,10) "MyKeyword.*Error"]++) print;
    next;
}
'

Related Solutions

awk – Multiline Regexp with Grep, Sed, Awk, and Perl

You can do this with Awk by setting the "Record Separator" variable to be a regex matching at least two consecutive newline characters:

awk -v RS='\n\n+' '/1.*2.*3/' file.txt

You can also set the "Field Separator" to be a single newline character:

awk -v RS='\n\n+' -F '\n' '$1 == "LINE OF TEXT 1" && $2 == "LINE OF TEXT 2" && $3 == "LINE OF TEXT 3"' file.txt

Broken up for readability:

awk -v RS='\n\n+' -F '\n' '
  $1 == "LINE OF TEXT 1" &&
  $2 == "LINE OF TEXT 2" &&
  $3 == "LINE OF TEXT 3"
' file.txt

With your requirement of only printing the filename if a match is found, you can do this like so:

awk -v RS='\n\n+' -F '\n' '
  $1 == "LINE OF TEXT 1" &&
  $2 == "LINE OF TEXT 2" &&
  $3 == "LINE OF TEXT 3" {
    match++
  }
  END {
    if (match) {
      print FILENAME
    }
' file.txt

But considering you are talking about using find in combination with awk, I'd recommend just using Awk for the exit status and using find for the printing:

find . -type f -exec awk -v RS='\n\n+' -F '\n' '
  $1 ~ /LINE OF TEXT 1/ &&
  $2 ~ /LINE OF TEXT 2/ &&
  $3 ~ /LINE OF TEXT 3/ {
    exit 0
  }
  END { exit 1 }
' {} \; -print

That way, if you want to do something else before printing (some other find primary), you're already set up to do so.

Testing regex from stdin using grep|sed|awk

You can use - as the "file" to search, which will use standard input as the "haystack" to search for matching "needles" in:

$ grep -oE '[aeiou]+' -
This is a test  < input
i               > output
i               > output
a               > output
e               > output
whaaaat?        < input
aaaa            > output

Use Ctrl-D to send EOF and end the stream.

I don't believe, though, that you can do the same to use standard input for the -f switch which reads a list of patterns from a file. However, if you have a lot of patterns to text on one corpus, you can:

grep -f needle-patterns haystack.txt

where needle-patterns is a plaintext file with one regular expression per line.

Best Answer

Related Solutions

awk – Multiline Regexp with Grep, Sed, Awk, and Perl

Testing regex from stdin using grep|sed|awk

Related Question