When using shell variables, you can preserve space characters (more precisely, prevent values from being split into words based on the field separator characters, which are enumerated in the $IFS shell variable) by surrounding the shell variables with double quotes.
for w in "${WORDS[@]}"
do
echo -n "$f [$w]:"
grep -aci "$w" $f 2>/dev/null
done
(It wouldn't hurt to surround $f
with quotes, too, in case you encounter filenames with spaces.)
As the second loop(2) ends up outputting every unique occurrences of a word in a log, how can its scope be restricted or how should I discard:
the output consisting of single chars?
Add grep ..
in the pipeline to include only lines with 2 or more characters.
the output consisting of single occurrences?
Add -d
to the uniq
in the pipeline, so that it will only show duplicate lines.
cat $f 2>/dev/null | tr -c '[:alnum:]' '[\n*]' | tr -d '[:digit:]' | sort -f | grep .. | uniq -dci | sort -fnr
Are there any recommendations for visually presenting the output, or is there a tool which provides further functionality for either searching or formatting count and word data?
There are a bunch of applications out there that will scan and summarize interesting occurrences in log files, some free, some commercial. I'm not sure we're allowed to give broad recommendations, but if you can give examples of queries you'd like to make or output formats you'd like to see, maybe we can answer those types of questions.
Your system should have GNU grep, that has an option -P
to use Perl expressions and you can use that, combined with -c
(so no need for wc -l
):
grep -Pvc '\S' somefile
The '\S'
hands the pattern \S
to grep and matches all line containing anything that is not space, -v
selects all the other lines (those only with space), and -c
counts them.
From the man page for grep:
-P, --perl-regexp
Interpret PATTERN as a Perl regular expression (PCRE, see
below). This is highly experimental and grep -P may warn of
unimplemented features.
-v, --invert-match
Invert the sense of matching, to select non-matching lines. (-v
is specified by POSIX.)
-c, --count
Suppress normal output; instead print a count of matching lines
for each input file. With the -v, --invert-match option (see
below), count non-matching lines. (-c is specified by POSIX.)
Best Answer
wc
counts over the whole file; You can useawk
to process line by line (not counting the line delimiter):or as
awk
is mostly a superset ofgrep
:(note that some
awk
implementations report the number of bytes (likewc -c
) as opposed to the number of characters (likewc -m
) and others will count bytes that don't form part of valid characters in addition to the characters (whilewc -m
would ignore them in most implementations))