Bash – Optimizing a shell script with long running while loop

bashgrepshellshell-script

I have written a shell script which has to do the following:

  1. Capture session commands into one file.
  2. Every individual command into separate file.
  3. Mail every individual command file content based on certain criteria.

As for my observation, the loop has to iterate minimum of 25,000 times. Now my problem is that it is taking more than 6 hours to complete all iterations.

Below is the main part of the script which is taking long time to process.

 if [ -s "$LOC/check.txt" ]; then

    while read line; do
            echo -e " started processing $line at `date` " >> "$SCRIPT_LOC/running_status.txt"
            TST=`grep -w $line $PERM_LOC/id_processing.txt`
            USER=`echo $TST | grep -w $line | awk -F '"' '{print $10}'`
            HOST=`echo $TST | grep -w $line | awk -F '"' '{print $18}'`
            ID=`echo $TST | echo $line | tr -d '\"'`
            IP=`echo $TST | grep -w $line | awk -F '"' '{print $20}'`
            DB=`echo $TST | grep -w $line | awk -F '"' '{print $22}'`
            CONN_TSMP=`echo $TST | grep -w $line | awk -F '"' '{print $2}'`

            if [ -z "$IP" ]; then
                    IP=`echo "$HOST"`
            fi

            if [ "$USER" == "root" ] && [ -z $DB ]; then
                     TARGET=/data1/sessions/root_sec
                     CMD_TARGET=/data1/commands/root_commands
                     FILE=`echo "$ID-$CONN_TSMP-$USER@$IP.txt"`
            else
                     TARGET=/data1/sessions/user_sec
                     CMD_TARGET=/data1/commands/user_commands
                     FILE=`echo "$ID-$CONN_TSMP-$USER@$IP.txt"`
            fi

            ls $TARGET/$FILE
            If [ $? -ne 0 ]; then
                     echo $TST | awk -F 'STATUS="0"' '{print $2}'| sed "s/[</>]//g" >> "$TARGET/$FILE"
                     echo -e "\n" >> "$TARGET/$FILE"
             fi

             grep $line  $LOC/out.txt  > "$LOC/temp.txt"

             while read val; do
                      TSMP=`echo "$val" | awk -F '"' '{print $2}'`
                      QUERY=`echo "$val" | awk -F 'SQLTEXT=' '{print $2}' | sed "s/[/]//g"`
                       echo " TIMESTAMP=$TSMP " >> "$TARGET/$FILE"
                       echo " QUERY=$QUERY " >> "$TARGET/$FILE"
                       RES=`echo "$QUERY" | awk {'print $1'} | sed 's/["]//g' `
                       TEXT=`grep "$RES" "$PERM_LOC/commands.txt"`
                       if [ -n "$TEXT" ]; then
                               NUM=`expr $NUM + 1`
                               SUB_FILE=`echo "$ID-$command-$NUM-$TSMP-$USER@$IP.txt"`
                               echo -e "===============\n" > "$CMD_TARGET/$SUB_FILE"
                               echo "FILE      =   \"$SUB_FILE\"" >> "$CMD_TARGET/$SUB_FILE"
                                ### same way append 6 more lines to $SUB_FILE            

                                SUB=`echo "$WARN_ME" | grep "$command"`
                                if [ "$command" == "$VC" ]; then
                                      STATE=`echo " very critical "`
                                elif [ -z "$SUB" ]; then
                                      STATE=CRITICAL
                                else
                                       STATE=WARNING
                                fi

                                if [ "$USER" != "root" -a "$command" != "$VC" ]; then
                                       mail command &
                                elif [ "$USER" == "root" -a -z "$HOST" ]; then
                                       mail command &
                                elif [ "$USER" == "root" -a "$command" == "$VC" ]; then
                                       mail command &
                                else
                                       echo -e "some message \n" >> $LOC/operations.txt
                                fi
                       fi
             done < "$LOC/temp.txt"
    done < "$LOC/check.txt"
 fi

Can any one help me how to optimize this code either by dividing or by changing logic or by using functions or by anything else?

Here I have to use a shell script only and the server on which the script will be executed should not take more than 3GB of RAM to process it.

Any help is very very useful.

Best Answer

Oh my!

I can see why it takes forever to run, you're repeating operations, not caching information and pretty much beating the computer to death. Poor Computer. :(

Awk is not a light-weight, and you're invoking it many, many times over the same data. I was able to run it once and set all five variables.

Without knowing what this is supposed to be doing or accomplishing, there's just so much that can be done.

Considering that ALL of the processing is grep's, awk's, sed's and tr's, you could get an impressive speed boost by writing this script in PERL. PERL is/was designed to handle text and reports. It can do all those grep/awk/sed/tr internally without shelling out to another program repeatedly.

But here's some improvements:

if [ -s "$LOC/check.txt" ]; then

function setvars() {
    CONN_TSMP="$1"
    USER="$2"
    HOST="$3"
    DB="$4"
    IP="$5"
    return
}
    while read line; do
        echo " started processing ${line} at $(date) " >> "${SCRIPT_LOC}/running_status.txt"
        ID=$(echo "$line" | tr -d '"')
        # are you sure you don't want the FIRST match?  This will give ALL the matches,
        # which will prevent you from getting good values for the variables
        # to only get first entry that matches:
        # TST=$(grep --max-count=1 -w "$line" "$PERM_LOC/id_processing.txt")
        # (or -m 1, but long options document what you're doing better)
        TST=$(grep -w "$line" "$PERM_LOC/id_processing.txt")
        VARS=$(echo "${TST}" | awk -F '"' '{print "\""$2"\" \""$10"\" \""$18"\" \""$20"\" \""$22'})
        #                                        CONN_TSMP     USER      HOST      IP        DB
        # magic!  setvars receives the 5 values awk pulled out (ran it once!)
        # NO QUOTES on next line, already has them embedded from awk
        setvars $VARS

        if [ -z "$IP" ]; then
            IP="$HOST"
        fi

        CMD_TARGET="/data1/commands/user_commands"
        FILE="${ID}-${CONN_TSMP}-${USER}@${IP}.txt"

        if [ "$USER" == "root" ] && [ -z "$DB" ]; then
            TARGET="/data1/sessions/root_sec"
        else
            TARGET="/data1/sessions/user_sec"
        fi

        # does this need to be redirected to a file?
        ls "$TARGET/$FILE"
        if [ $? -ne 0 ]; then
            # awk can likely do the print and the removal of </> characters in
            # one pass (my awk-fu is weak this morning)
            echo "$TST" | awk -F 'STATUS="0"' '{print $2}'| sed "s/[</>]//g" >> "$TARGET/$FILE"
            echo -e "\n" >> "$TARGET/$FILE"
        fi

        # ALWAYS quote your values, embedded spaces will bite you!
        grep "$line" "$LOC/out.txt" > "$LOC/temp.txt"

        while read val; do
            TSMP=$(echo "$val" | awk -F '"' '{print $2}')
            QUERY=$(echo "$val" | awk -F 'SQLTEXT=' '{print $2}' | sed "s/[\"/]//g")
            echo " TIMESTAMP=$TSMP " >> "$TARGET/$FILE"
            echo " QUERY=$QUERY " >> "$TARGET/$FILE"
            TEXT=$(grep "$QUERY" "$PERM_LOC/commands.txt")
            if [ -n "$TEXT" ]; then
                NUM=$(expr $NUM + 1)
                # could also be:  NUM=$(($NUM+1)) (bash v4.0+)
                SUB_FILE="$ID-$command-$NUM-$TSMP-$USER@$IP.txt"
                echo -e "===============\n" > "$CMD_TARGET/$SUB_FILE"
                echo "FILE      =   \"$SUB_FILE\"" >> "$CMD_TARGET/$SUB_FILE"
                ### same way append 6 more lines to $SUB_FILE

                SUB=$(echo "$WARN_ME" | grep "$command")
                if [ "$command" == "$VC" ]; then
                    STATE=" very critical "
                elif [ -z "$SUB" ]; then
                    STATE=" CRITICAL "
                else
                    STATE=" WARNING "
                fi

                if [ "$USER" != "root" -a "$command" != "$VC" ]; then
                    # this should probably be $command instead of command?
                    # oh wait, probably a placeholder statement
                    mail command &
                elif [ "$USER" == "root" -a -z "$HOST" ]; then
                    mail command &
                elif [ "$USER" == "root" -a "$command" == "$VC" ]; then
                    mail command &
                else
                    echo -e "some message \n" >> $LOC/operations.txt
                fi
            fi
        done < "$LOC/temp.txt"
    done < "$LOC/check.txt"
fi

Hmm, "shell script only". Well, with that in mind, perhaps you could pre-grep "$LOC/check.txt" and/or "$LOC/temp.txt" so that you could use the 'already grepped' output instead of grepping in the loop.

The more I look at it, the more convinced I am that awk could likely do all this work in a single pass through the data... AND process EVERY entry, not just the first one (as I pointed out in the comments, you really need another loop between the "read line" and "read var" loops.)

It'd be a long awk script, but definitely doable. And awk is worth knowing, take a moment and play with it, it's not that difficult, just different. Grok Awk!

Related Question