Less expensive version of `sort -n | uniq -c | sort -n`

awkperformancesortuniq

I have an unsorted list of IPs that I need to count and sort by occurrences.
I use sort -n | uniq -c | sort -n and that works well, but I'd like something less expensive… surely awk can do this?

Input

1.1.1.1
2.2.2.2
1.1.1.1
3.3.3.3
2.2.2.2
1.1.1.1

Output

3 1.1.1.1
2 2.2.2.2
1 3.3.3.3

Best Answer

With single awk process:

awk '{ a[$1]++ }END{ for(i in a) print a[i],i }' file

The output:

3 1.1.1.1
2 2.2.2.2
1 3.3.3.3

To output records sorted by number of occurrences in descending order use the following GNU awk approach:

awk 'BEGIN{ PROCINFO["sorted_in"]="@val_num_desc" }{ a[$1]++ }
     END{ for(i in a) print a[i],i }' file

Related Solutions

Where has the `uniq` or `sort -u` line gone, with some unicode characters

Short version: collation doesn't really work in command line utilities.

Longer version: the underlying function to compare two strings is strcoll. The description isn't very helpful, but the conceptual method of operation is to convert both strings to a canonical form, and then compare the two canonical forms. The function strxfrm constructs this canonical form.

Let's observe the canonical forms of a few strings (with GNU libc, under Debian squeeze):

$ export LC_ALL=en_US.UTF-8
$ perl -C255 -MPOSIX -le 'print "$_ ", unpack("h*", strxfrm($_)) foreach @ARGV' b a A à 〼 〇
b d010801020
a c010801020
A c010801090
à 101010102c6b
〼 101010102c6b102c6b102c6b
〇 101010102c6b102c6b102c6b

As you can see, 〼 and 〇 have the same canonical form. I think that's because these characters are not mentioned in the collation tables of the en_US.UTF-8 locale. They are, however, present in a Japanese locale.

$ export LC_ALL=ja_JP.UTF-8
$ perl -C255 -MPOSIX -le 'print "$_ ", unpack("h*", strxfrm($_)) foreach @ARGV' 〼 〇 
〼 303030
〇 3c9b

The source code for the locale data (in Debian squeeze) is in /usr/share/i18n/locales/en_US, which includes /usr/share/i18n/locales/iso14651_t1_common. This file doesn't have an entry for U3007 or U303C, nor are they included in any range that I can find.

I'm not familiar with the rules to build the collation order, but from what I understand, the relevant phrasing is

The symbol UNDEFINED shall be interpreted as including all coded character set values not specified explicitly or via the ellipsis symbol. (…) If no UNDEFINED symbol is specified, and the current coded character set contains characters not specified in this section, the utility shall issue a warning message and place such characters at the end of the character collation order.

It looks like Glibc is instead ignoring characters that aren't specified. I don't know if there's a flaw of my understanding of the POSIX spec, if I missed something in Glibc's locale definition, or if there's a bug in the Glibc locale compiler.

Sort and Uniq in Awk – How to Use

To sort you can use a pipe also inside of an awk command, as in:

awk '{ print ... | "sort ..." }'

The syntax means that all respective lines of the data file will be passed to the same instance of sort.

Of course you can also do that equivalently on shell level:

awk '{ print ... }' | sort ...

Or you can use GNU awk which has a couple sort functions natively defined.

The uniq is in awk typically accomplished by saving the "unique data element or key" in an associative array and checking whether new data need to be memorized. One example to illustrate:

awk '!a[$0]++'

This means: If the current line is not in the array then the condition is true and the default action to print the line triggered. Subsequent lines with the same data will result in a false condition and the data will not be printed.