Bash – Preserving Spaces in Array Values and Trimming Sort Command Results

arraybashquotingshellvariable

In the context of searching for expressions in log files, I was wondering generally if there was a way to quantify and qualify the contents of the logs in /var/log in some way. In particular, does the lexicon used to describe error conditions correspond to general expectations and does how does the terminology vary across logs. To investigate this further and validate those expectations, I made this little script which checks, counts and sorts a few things:

#!/bin/bash
# Meh - In /var/log, (1) for each file count lines which
# contain matches for expressions in the set; then, 
# (2), iterate files again and sort (all) words in a file  
# by descending frequency.

FILES="/var/log/* /var/log/**/* /var/log/**/**/*"
WORDS=(error fail.* wrong bad break panic abort.* disaster problem issue 'couldn'\''t' 'didn'\''t' 'wasn'\''t' 'shouldn'\''t' 'isn'\''t' 'don'\''t' "is'\ 'not" 'did\ not' die.* crash.* dump.* seg.* bug.* report.* status)

# Header section 1
printf '%*s\n' "${COLUMNS:-$(tput cols)}" '' | tr ' ' -
printf 'Matches for expressions in set\n' 
printf '%*s\n' "${COLUMNS:-$(tput cols)}" '' | tr ' ' -

# (1)
for f in $FILES
do
    for w in ${WORDS[@]} 
    do 
        echo -n "$f [$w]:"
        grep -aci $w $f 2>/dev/null
    done
printf '%*s\n' "${COLUMNS:-$(tput cols)}" '' | tr ' ' -
done

# Header section 2
printf 'Sorted occurences for all words in a file\n' 
printf '%*s\n' "${COLUMNS:-$(tput cols)}" '' | tr ' ' -

# (2)
for f in $FILES
do
    echo "[total number of lines: $(wc -l $f 2>/dev/null)]"
    cat $f 2>/dev/null | tr -c '[:alnum:]' '[\n*]' | tr -d '[:digit:]' | sort -f | uniq -ci | sort -fnr
    printf '%*s\n' "${COLUMNS:-$(tput cols)}" '' | tr ' ' -
done

exit

However rough and flawed, it achieves the purpose of giving me a glimpse into frequency and validating the set to some extent. For instance here's some sample output for dmesgon my system:

/var/log/dmesg [error]:27
/var/log/dmesg [fail.*]:7
/var/log/dmesg [wrong]:0
/var/log/dmesg [bad]:0
/var/log/dmesg [break]:0
/var/log/dmesg [panic]:1
/var/log/dmesg [abort.*]:0
/var/log/dmesg [disaster]:0
/var/log/dmesg [problem]:0
/var/log/dmesg [issue]:0
/var/log/dmesg [couldn't]:2
/var/log/dmesg [didn't]:0
/var/log/dmesg [wasn't]:0
/var/log/dmesg [shouldn't]:0
/var/log/dmesg [isn't]:0
/var/log/dmesg [don't]:0
/var/log/dmesg [is'\]:/var/log/dmesg ['not]:0     <------
/var/log/dmesg [did\]:/var/log/dmesg [not]:14     <------
/var/log/dmesg [die.*]:0
/var/log/dmesg [crash.*]:0
/var/log/dmesg [dump.*]:0
/var/log/dmesg [seg.*]:0
/var/log/dmesg [bug.*]:3
/var/log/dmesg [report.*]:1
/var/log/dmesg [status]:28

    [total number of lines: 1059 /var/log/dmesg]
      14784 
        306 usb
        220 pci
        133 x
        128 hub
        116 acpi
        113 d
        109 mem
         95 a
         94 uhci
         76 hcd
         76 device
         73 bus
         56 io
         55 to
         54 port
         54 e
         54 ata
         53 c
         51 power
         48 interface
         47 registered
         46 ehci
         40 system
         40 for
         38 new
         37 sda
         37 bridge
         36 on
         36 irq
         34 type
         34 reset
         34 probe
         34 nouveau
         33 v
         33 sd
         31 reserved
         30 memory
         29 f
         27 ports
         27 found
         27 error
         27 and
         26 resource
         26 reg
         26 input
         26 driver
         25 id
         25 i
         23 window
         23 disabled
         22 xc
         22 status
         22 from
         22 drm
         22 bit
         ...

It doesn't come as a big surprise that "error" is the top "problem word" on my setup or that lots of hardware related entries appear in dmesg.

Questions

How do you preserve the space in the value of the literal expressions made out of two words("is not", "did not") in the array value list? Tried 'did'\040'not' or "did'\ 'not" and many variations. Uncertain as to how to apply the info from this Q&A.
As the second loop(2) ends up outputting every unique occurrences of a word in a log, how can its scope be restricted or how should I discard:
- the output consisting of single chars?
- the output consisting of single occurrences?
Are there any recommendations for visually presenting the output, or is there a tool which provides further functionality for either searching or formatting count and word data?

Best Answer

When using shell variables, you can preserve space characters (more precisely, prevent values from being split into words based on the field separator characters, which are enumerated in the $IFS shell variable) by surrounding the shell variables with double quotes.

for w in "${WORDS[@]}" 
do 
  echo -n "$f [$w]:"
  grep -aci "$w" $f 2>/dev/null
done

(It wouldn't hurt to surround $f with quotes, too, in case you encounter filenames with spaces.)

As the second loop(2) ends up outputting every unique occurrences of a word in a log, how can its scope be restricted or how should I discard:

the output consisting of single chars?

Add grep .. in the pipeline to include only lines with 2 or more characters.

the output consisting of single occurrences?

Add -d to the uniq in the pipeline, so that it will only show duplicate lines.

cat $f 2>/dev/null | tr -c '[:alnum:]' '[\n*]' | tr -d '[:digit:]' | sort -f | grep .. | uniq -dci | sort -fnr

Are there any recommendations for visually presenting the output, or is there a tool which provides further functionality for either searching or formatting count and word data?

There are a bunch of applications out there that will scan and summarize interesting occurrences in log files, some free, some commercial. I'm not sure we're allowed to give broad recommendations, but if you can give examples of queries you'd like to make or output formats you'd like to see, maybe we can answer those types of questions.

Related Solutions

Bash’s choice: operator vs reserved word

Firstly, let's differentiate these two things, "operator" and "reserved word." They are not even at the same level of abstraction, they are entirely different things. Start with standard English-language definitions*:

operator: a symbol or function denoting an operation.

(Note that "symbol" does not imply "single character"; a word is a symbol, too.)

operation: a process in which a number, quantity, expression, etc., is altered or manipulated according to formal rules, such as those of addition, multiplication, and differentiation.

Hopefully clear? Make up a few examples of your own; it will help.

Then we have "reserved word," for which we really only need to look up "reserved":

reserved: kept specially for a particular purpose or person

In this case we mean words that were kept for a particular purpose by the shell designers, and may not be used for other purposes (at least, not without escaping them).

Note that these terms do not describe disjoint sets of things. That is, there are reserved words which denote the alteration or manipulation of expressions (or commands) according to formal rules.

In actual fact, ! as a "reserved word" preceding a pipeline is in fact an operator on the exit status of that pipeline, just by standard English definitions.

The issue is purely one of semantics, and the difference is not as important as the authors of that particular manual seem to believe. The application of these labels comes after the fact of the design and the labels are used to describe the concrete behavior embedded in the design; neither item ("reserved word" or "operator") has an inherent advantage over the other. The labels may have advantages in some cases, but frankly they actually just mean different things. They are just two different terms.

Specific programmatic meanings:

From a cursory look over that particular manual, the key terms for this question actually appear to be:

token
word
metacharacter
operator
control operator
reserved word

In that sequence.

My takeaway:

A word is something that must be separated from other words by placement of metacharacters. An operator itself consists of "word separators" (metacharacters) and so does not need to be specifically separated from words.

Thus the entirety of the important difference between {...} and (...) could have been conveyed in the quotation you cite without even using the terms that prompted this question. Here is a rewording:

In addition to the creation of a subshell, there is a subtle difference between these two constructs due to historical reasons. The braces must be separated from the list by blanks or other shell metacharacters. The parentheses are recognized as separate tokens by the shell even if they are not separated from the list by whitespace.

The entire meaning is there. This defines the difference between "operator" and "reserved word." There is no other difference.

_{*Definitions are taken from New Oxford American Dictionary.}

Bash add value to array with embedded variable and single quotes

If this works:

rsync "${rsarg[@]}" --filter='merge /home/roger/home.rsync-filter' ...

Then I suppose you do not want to give the single-quotes to rsync. Here, the shell eats them before rsync sees them. Your other examples have quotes within quotes, so the inner ones stay and are passed for rsync to see.

So skip the single-quotes when building the array:

rsarg+=("--filter=merge $srcff")

Note that from the shell's point of view, the equal sign is nothing special, and there's no need to treat the part that comes after any differently than the part before. --foo=bar is the same as --foo="bar" or "--foo=bar", or even --fo"o=ba"r.

Best Answer

Related Solutions

Bash’s choice: operator vs reserved word

Bash add value to array with embedded variable and single quotes

Related Question