[edit: clarified that I need an in awk solution, and corrected that I need to sort 'indexes' (or rather, output them in a sorted way) instead of the ambiguous 'values')]
In awk, I often count things, or store a set of values, inside an array, using the values as indices (taking advantage of awk's indexes_are_hashes mechanism)
For example: if I want to know how many different values of $2 I encountered, and how often each values were seen:
awk '
... several different treatments ...
{ count[$2]++ }
... other treatments ...
END { for(str in count) {
print "counted: " str " : " count[str] " times."
... and other lines underneath, with additional infos ...
}
}
'
The problem is that (non GNU, or other nicer versions) regular awk (and regular nawk) :
- [A] doesn't output the different values in the order it has 'encountered' them,
- [B] nor provide an easy way to go through the indexes either in numerical or alphabetical order
for [A]: not too difficult to do .. just have another array to index the "newly seen" entries.
the QUESTION is for [B]: How can I do a simple call to sort to reorder the display of the different indexes?
(note : I am aware that gnu awk has an "easy" way for [B]: https://www.gnu.org/software/gawk/manual/html_node/Controlling-Array-Traversal.html … But I want the way to do something similar in regular awk/nawk !)
(ie: I need to do a loop to output the different indexes seen, sort them, re-read them [in an old awk…] into "something" ( ex: another array ordered_seen ?) and use that something to display the seen[s] in the chosen order. And this needs to be inside awk as under each indexes I often need to output a paragraph of additional infos. A "sort" outside of awk would reorder everything)
So far: I find no "axiomatic" one-liner (or n-liner?) way to do that.
I end up with a kludge that takes several lines, outputs each values to a file through sort, and then re-reads that sorted file and insert each line in order into a sorted_countindexes[n++], and then for(i=0;i<=n;i++){ …output count[sorted_countindexes[n]]… }
I'd welcome a better/simpler/more "axiomatic" to output indexes according to a sort, for regular awk (or nawk)
MCVE: here is a simple example : outputting the indexes in alphabetical order would be really nice:
# create the 2 basic files to be parsed by the awk:
printf 'a b a a a c c d e s s s s e f s a e r r f\ng f r e d e z z c s d r\n' >fileA
printf 's f g r e d f g e z s d v f e z a d d g r f e a\ns d f e r\n'>fileB
# and the awk loop: It outputs in 'whatever order', I want in 'alphabetical order'
for f in file? ; do printf 'for file: %s: ' "$f"
tr ' ' '\n' < "$f" | awk '
{ count[$0]++ }
END { for(str in count){
printf("%s:%d ",str,count[str])
}; print ""
} '
done
#this outputs:
for file: fileA: d:3 e:5 f:3 g:1 r:4 s:6 z:2 a:5 b:1 c:3
for file: fileB: d:5 e:5 f:5 g:3 r:3 s:3 v:1 z:2 a:2
# I'd like to have the letters outputted in alphabetical order instead!
Best Answer
The above just builds a newline-separated string from the array indices (quoting it appropriately for
sh
), creates a shell script that pipes that string tosort
, and then loops on the output. If you want to modifysort
s behavior just add a string of Unixsort
arguments to thesort
function call, e.g.sort(seen,"-fu")
. It could obviously be modified to print or do whatever else you want inside thesort()
function instead of populating an array of indices for you to loop on when it returns if that's what you prefer but then the function is as cohesive.Note however that it will be limited to the maximum command line length on your system.
The
\047
s in the code represent'
s which shell does not allow to be included in'
-delimited strings or scripts and so while we could use'
directly in an awk script being read from a file as I'm doing above, if you were to use that script on the command line asawk 'script' file
you'd need to use something instead of'
and\047
works both when the script is interpreted from the command line and from a file so it's the most portable choice of'
-replacement.The
'
s (\047
s) are present to quotestr
in a way that ensures that the shell doesn't expand variables, have mismatched quotes, etc. when the string is being piped to sort, i.e. they do this:so we don't get something like this, which is vulnerable/buggy, instead: