Shell Script – Extracting Queries from Log Files Excluding Banned Lines

shellshell-script

I have a log file that looks something like the following:

query1 startQuery
query1 do something
query1 do something else
query2 startQuery
query1 do something banned
query2 do something
query3 startQuery
query2 endQuery 1000
query3 something else to do
query1 endQuery 2003
query3 do something
query4 startQuery
query4 endQuery 100
query3 endQuery 1434

I am finding the longest running queries:

> grep "endQuery" logfile | awk '{print $3 " " $1}' | sort -nr | head -n 3
2003 query1
1434 query3
1000 query2

However, there are certain operations known to be long, and I want to find the longest running queries that do not include these operations. For example, I want to find the longest running queries that do not, in any of their log lines, include the word "banned".

In this example it would output:

1434 query3
1000 query2
100 query4

In reality these log files are large and contain a lot of queries.

Best Answer

First, note that you don't need the call to grep, by the way: it can be seamlessly integrated into the awk call.

<logfile awk '/endQuery/ {print $3 " " $1}'

You can filter out the banned queries at the awk stage. Store ongoing queries in an array, remove them if they're banned, and only print out the non-banned ones.

<logfile awk '
    $2 == "startQuery" {q[$1]=1}        # store the names of active queries
    q[$1] && /banned/ {delete q[$1]}    # delete banned queries
    $2 == "endQuery" {
        if (q[$1]) print $3, $1;        # only report non-banned queries
        delete q[$1];
    }
' | sort -nr | head -n 3
Related Question