Shell Script – Awk Optimization

awkshell

I am looking for some assistance in trying to optimize a bro network log parsing script, here's the background:

I have a large amount of bro logs, but I'm only interested in querying IPs within my scope (multiple variable length subnets).

So I have a text file with regex patterns to match the IP ranges I'm looking for:
scope.txt:

/^10\.0\.0\.([8-9]|[1-3][0-9]|4[0-5])$/

(scope.txt contains up to 20 more lines of other IP ranges in regex patterns)
findInScope.sh:

#!bin/sh
for file in /data/bro_logs/2016-11-26/conn.*.log.gz
do
    echo "$file"
    touch /tmp/$file
    for nets in $(cat scope.txt)
    do
        echo "$nets"
        zcat $file | bro-cut -d | awk '$3 ~ '$nets' || $5 ~ '$nets'' >> /tmp/$file
    done
    sort /tmp/$file | uniq > ~/$file
    rm /tmp/$file
done

As more background, each hour of original bro conn logs is about 100MBs, so my current script takes about 10-20 minute to parse through one hour of log data. One day of logs can take up to 3 hours.

I thought about a single awk statement with 40 or's but decided I don't want to do that because I want a separate scope.txt file in to use the same script for different scopes of IP ranges.

I also tried zcat on multiple conn.log files (i.e. zcat conn.*.log.gz) but the output file ended up being over 1GB, and I wanted to keep hourly logs intact.

Best Answer

You should gain a lot by passing the log file just once through awk. This means combining all the regexps into one. If you don't want to do this in your scope.txt file, then do it before calling awk. For example,

sed <scope.txt 's|^/\^|(|; s|\$/$|)|; $!s/$/|/' | tr -d '\n' >pattern

zcat $file | bro-cut -d |
awk '
BEGIN{ getline pat <"pattern"; pat = "^(" pat ")$" }
$3 ~ pat || $5 ~ pat
'  >~/$file

The sed replaces the /^ and $ surrounding each regexp line with a enclosing () pair, adds an | at the end of the line, and puts the result all on one line into file pattern. This file is therefore all the patterns or-ed together. The missing ^(...)$ is added in the awk script BEGIN statement, which reads the pattern file into variable pat.

The above replaces your inner for loop, and the sort|uniq.

Related Question