awk – Easily Sort Array Indexes to Output in Chosen Order

awknon-gnu

[edit: clarified that I need an in awk solution, and corrected that I need to sort 'indexes' (or rather, output them in a sorted way) instead of the ambiguous 'values')]

In awk, I often count things, or store a set of values, inside an array, using the values as indices (taking advantage of awk's indexes_are_hashes mechanism)

For example: if I want to know how many different values of $2 I encountered, and how often each values were seen:

awk '
   ... several different treatments ...
   { count[$2]++ } 
   ... other treatments ...
   END { for(str in count) { 
           print "counted: " str " : " count[str] " times." 
           ... and other lines underneath, with additional infos ...
          }
       }
 '

The problem is that (non GNU, or other nicer versions) regular awk (and regular nawk) :

  • [A] doesn't output the different values in the order it has 'encountered' them,
  • [B] nor provide an easy way to go through the indexes either in numerical or alphabetical order

for [A]: not too difficult to do .. just have another array to index the "newly seen" entries.

the QUESTION is for [B]: How can I do a simple call to sort to reorder the display of the different indexes?

(note : I am aware that gnu awk has an "easy" way for [B]: https://www.gnu.org/software/gawk/manual/html_node/Controlling-Array-Traversal.html … But I want the way to do something similar in regular awk/nawk !)

(ie: I need to do a loop to output the different indexes seen, sort them, re-read them [in an old awk…] into "something" ( ex: another array ordered_seen ?) and use that something to display the seen[s] in the chosen order. And this needs to be inside awk as under each indexes I often need to output a paragraph of additional infos. A "sort" outside of awk would reorder everything)

So far: I find no "axiomatic" one-liner (or n-liner?) way to do that.

I end up with a kludge that takes several lines, outputs each values to a file through sort, and then re-reads that sorted file and insert each line in order into a sorted_countindexes[n++], and then for(i=0;i<=n;i++){ …output count[sorted_countindexes[n]]… }

I'd welcome a better/simpler/more "axiomatic" to output indexes according to a sort, for regular awk (or nawk)

MCVE: here is a simple example : outputting the indexes in alphabetical order would be really nice:

# create the 2 basic files to be parsed by the awk:
printf 'a b a a a c c d e s s s s e f s a e r r f\ng f r e d e z z c s d r\n' >fileA
printf 's f g r e d f g e z s d v f e z a d d g r f e a\ns d f e r\n'>fileB
# and the awk loop: It outputs in 'whatever order', I want in 'alphabetical order'
for f in file? ; do printf 'for file: %s: ' "$f"
  tr ' ' '\n' < "$f" | awk ' 
       { count[$0]++ } 
   END { for(str in count){ 
           printf("%s:%d ",str,count[str]) 
          }; print "" 
       } '
done
#this outputs:
for file: fileA: d:3 e:5 f:3 g:1 r:4 s:6 z:2 a:5 b:1 c:3
for file: fileB: d:5 e:5 f:5 g:3 r:3 s:3 v:1 z:2 a:2
# I'd like to have the letters outputted in alphabetical order instead!

Best Answer

$ cat tst.awk
{ cnt[$0]++ }
END {
    n = sort(cnt,idxs)
    for (i=1; i<=n; i++) {
        idx = idxs[i]
        printf "%s:%d%s", idx, cnt[idx], (i<n ? OFS : ORS)
    }

}

function sort(arr, idxs, args,      i, str, cmd) {
    for (i in arr) {
        gsub(/\047/, "\047\\\047\047", i)
        str = str i ORS
    }

    cmd = "printf \047%s\047 \047" str "\047 |sort " args

    i = 0
    while ( (cmd | getline idx) > 0 ) {
        idxs[++i] = idx
    }

    close(cmd)

    return i
}

# create the 2 basic files to be parsed by the awk:
printf 'a b a a a c c d e s s s s e f s a e r r f\ng f r e d e z z c s d r\n' >fileA
printf 's f g r e d f g e z s d v f e z a d d g r f e a\ns d f e r\n'>fileB

for f in fileA fileB ; do
    printf 'for file: %s: ' "$f"
    tr ' ' '\n' < "$f" |
    awk -f tst.awk
done
for file: fileA: a:5 b:1 c:3 d:3 e:5 f:3 g:1 r:4 s:6 z:2
for file: fileB: a:2 d:5 e:5 f:5 g:3 r:3 s:3 v:1 z:2

The above just builds a newline-separated string from the array indices (quoting it appropriately for sh), creates a shell script that pipes that string to sort, and then loops on the output. If you want to modify sorts behavior just add a string of Unix sort arguments to the sort function call, e.g. sort(seen,"-fu"). It could obviously be modified to print or do whatever else you want inside the sort() function instead of populating an array of indices for you to loop on when it returns if that's what you prefer but then the function is as cohesive.

Note however that it will be limited to the maximum command line length on your system.

The \047s in the code represent 's which shell does not allow to be included in '-delimited strings or scripts and so while we could use ' directly in an awk script being read from a file as I'm doing above, if you were to use that script on the command line as awk 'script' file you'd need to use something instead of ' and \047 works both when the script is interpreted from the command line and from a file so it's the most portable choice of '-replacement.

The 's (\047s) are present to quote str in a way that ensures that the shell doesn't expand variables, have mismatched quotes, etc. when the string is being piped to sort, i.e. they do this:

$ echo 'foo'\''bar $(ls) $HOME' | awk '{
    str=$0; gsub(/\047/, "\047\\\047\047", str); print "str="str
    cmd="printf \047%s\047 \047" str "\047"; print "cmd="cmd
}'
str=foo'\''bar $(ls) $HOME
cmd=printf '%s' 'foo'\''bar $(ls) $HOME'

so we don't get something like this, which is vulnerable/buggy, instead:

$ echo 'foo'\''bar $(ls) $HOME' | awk '{
    str=$0; print "str="str
    cmd="printf \"%s\" \"" str "\""; print "cmd="cmd
}'
str=foo'bar $(ls) $HOME
cmd=printf "%s" "foo'bar $(ls) $HOME"
Related Question