Bash – Preserving Spaces in Array Values and Trimming Sort Command Results

arraybashquotingshellvariable

In the context of searching for expressions in log files, I was wondering generally if there was a way to quantify and qualify the contents of the logs in /var/log in some way. In particular, does the lexicon used to describe error conditions correspond to general expectations and does how does the terminology vary across logs. To investigate this further and validate those expectations, I made this little script which checks, counts and sorts a few things:

#!/bin/bash
# Meh - In /var/log, (1) for each file count lines which
# contain matches for expressions in the set; then, 
# (2), iterate files again and sort (all) words in a file  
# by descending frequency.

FILES="/var/log/* /var/log/**/* /var/log/**/**/*"
WORDS=(error fail.* wrong bad break panic abort.* disaster problem issue 'couldn'\''t' 'didn'\''t' 'wasn'\''t' 'shouldn'\''t' 'isn'\''t' 'don'\''t' "is'\ 'not" 'did\ not' die.* crash.* dump.* seg.* bug.* report.* status)

# Header section 1
printf '%*s\n' "${COLUMNS:-$(tput cols)}" '' | tr ' ' -
printf 'Matches for expressions in set\n' 
printf '%*s\n' "${COLUMNS:-$(tput cols)}" '' | tr ' ' -

# (1)
for f in $FILES
do
    for w in ${WORDS[@]} 
    do 
        echo -n "$f [$w]:"
        grep -aci $w $f 2>/dev/null
    done
printf '%*s\n' "${COLUMNS:-$(tput cols)}" '' | tr ' ' -
done

# Header section 2
printf 'Sorted occurences for all words in a file\n' 
printf '%*s\n' "${COLUMNS:-$(tput cols)}" '' | tr ' ' -

# (2)
for f in $FILES
do
    echo "[total number of lines: $(wc -l $f 2>/dev/null)]"
    cat $f 2>/dev/null | tr -c '[:alnum:]' '[\n*]' | tr -d '[:digit:]' | sort -f | uniq -ci | sort -fnr
    printf '%*s\n' "${COLUMNS:-$(tput cols)}" '' | tr ' ' -
done

exit

However rough and flawed, it achieves the purpose of giving me a glimpse into frequency and validating the set to some extent. For instance here's some sample output for dmesgon my system:

/var/log/dmesg [error]:27
/var/log/dmesg [fail.*]:7
/var/log/dmesg [wrong]:0
/var/log/dmesg [bad]:0
/var/log/dmesg [break]:0
/var/log/dmesg [panic]:1
/var/log/dmesg [abort.*]:0
/var/log/dmesg [disaster]:0
/var/log/dmesg [problem]:0
/var/log/dmesg [issue]:0
/var/log/dmesg [couldn't]:2
/var/log/dmesg [didn't]:0
/var/log/dmesg [wasn't]:0
/var/log/dmesg [shouldn't]:0
/var/log/dmesg [isn't]:0
/var/log/dmesg [don't]:0
/var/log/dmesg [is'\]:/var/log/dmesg ['not]:0     <------
/var/log/dmesg [did\]:/var/log/dmesg [not]:14     <------
/var/log/dmesg [die.*]:0
/var/log/dmesg [crash.*]:0
/var/log/dmesg [dump.*]:0
/var/log/dmesg [seg.*]:0
/var/log/dmesg [bug.*]:3
/var/log/dmesg [report.*]:1
/var/log/dmesg [status]:28

    [total number of lines: 1059 /var/log/dmesg]
      14784 
        306 usb
        220 pci
        133 x
        128 hub
        116 acpi
        113 d
        109 mem
         95 a
         94 uhci
         76 hcd
         76 device
         73 bus
         56 io
         55 to
         54 port
         54 e
         54 ata
         53 c
         51 power
         48 interface
         47 registered
         46 ehci
         40 system
         40 for
         38 new
         37 sda
         37 bridge
         36 on
         36 irq
         34 type
         34 reset
         34 probe
         34 nouveau
         33 v
         33 sd
         31 reserved
         30 memory
         29 f
         27 ports
         27 found
         27 error
         27 and
         26 resource
         26 reg
         26 input
         26 driver
         25 id
         25 i
         23 window
         23 disabled
         22 xc
         22 status
         22 from
         22 drm
         22 bit
         ...

It doesn't come as a big surprise that "error" is the top "problem word" on my setup or that lots of hardware related entries appear in dmesg.

Questions

  • How do you preserve the space in the value of the literal expressions made out of two words("is not", "did not") in the array value list? Tried 'did'\040'not' or "did'\ 'not" and many variations. Uncertain as to how to apply the info from this Q&A.
  • As the second loop(2) ends up outputting every unique occurrences of a word in a log, how can its scope be restricted or how should I discard:
    • the output consisting of single chars?
    • the output consisting of single occurrences?
  • Are there any recommendations for visually presenting the output, or is there a tool which provides further functionality for either searching or formatting count and word data?

Best Answer

When using shell variables, you can preserve space characters (more precisely, prevent values from being split into words based on the field separator characters, which are enumerated in the $IFS shell variable) by surrounding the shell variables with double quotes.

for w in "${WORDS[@]}" 
do 
  echo -n "$f [$w]:"
  grep -aci "$w" $f 2>/dev/null
done

(It wouldn't hurt to surround $f with quotes, too, in case you encounter filenames with spaces.)

As the second loop(2) ends up outputting every unique occurrences of a word in a log, how can its scope be restricted or how should I discard:

the output consisting of single chars?

Add grep .. in the pipeline to include only lines with 2 or more characters.

the output consisting of single occurrences?

Add -d to the uniq in the pipeline, so that it will only show duplicate lines.

cat $f 2>/dev/null | tr -c '[:alnum:]' '[\n*]' | tr -d '[:digit:]' | sort -f | grep .. | uniq -dci | sort -fnr

Are there any recommendations for visually presenting the output, or is there a tool which provides further functionality for either searching or formatting count and word data?

There are a bunch of applications out there that will scan and summarize interesting occurrences in log files, some free, some commercial. I'm not sure we're allowed to give broad recommendations, but if you can give examples of queries you'd like to make or output formats you'd like to see, maybe we can answer those types of questions.

Related Question