Sort and Uniq in Awk – How to Use

awksortuniq

I know there are "sort" and "uniq" out there, however, today's question is about how to utilise AWK to do that kind of a job. Say if I have a list of anything really (ips, names, or numbers) and I want to sort them;

Here is an example I am taking the IP numbers from a mail log:

awk 'match($0,/\[[[:digit:]]+\.[[:digit:]]+\.[[:digit:]]+\.[[:digit:]]+\]/) { if ( NF == 8 && $6 == "connect" ) {print substr($0, RSTART+1,RLENGTH-2)} }' maillog

Is it possible to sort them, ips, "on the go" within the same awk command? I do not require a complete answer to my question but some hints where to start.

Cheers!

Best Answer

To sort you can use a pipe also inside of an awk command, as in:

awk '{ print ... | "sort ..." }'

The syntax means that all respective lines of the data file will be passed to the same instance of sort.

Of course you can also do that equivalently on shell level:

awk '{ print ... }' | sort ...

Or you can use GNU awk which has a couple sort functions natively defined.

The uniq is in awk typically accomplished by saving the "unique data element or key" in an associative array and checking whether new data need to be memorized. One example to illustrate:

awk '!a[$0]++'

This means: If the current line is not in the array then the condition is true and the default action to print the line triggered. Subsequent lines with the same data will result in a false condition and the data will not be printed.

Related Solutions

Shell – Get lines with maximum values in the column using awk, uniq and sort

I think you want

cat myfile.txt| sort -k1 -r | sort --unique --stable -k2,3

(see my comment regarding cat above). The first sort will put the newest dates to the top. The second sort will sort by user+access, but, by giving --stable, will keep the previous order of lines that have the same user+access combination, i.e. newest still on top. Giving --unique, only the first line of a run with equal user+access combination is shown. (You can replace it with | uniq -f1, I'd think, if it happens to be a GNU extension your sort doesn't have.)

Bash Sort Uniq – Difference Between ‘sort -u’ and ‘sort | uniq’

sort | uniq existed before sort -u, and is compatible with a wider range of systems, although almost all modern systems do support -u -- it's POSIX. It's mostly a throwback to the days when sort -u didn't exist (and people don't tend to change their methods if the way that they know continues to work, just look at ifconfig vs. ip adoption).

The two were likely merged because removing duplicates within a file requires sorting (at least, in the standard case), and is an extremely common use case of sort. It is also faster internally as a result of being able to do both operations at the same time (and due to the fact that it doesn't require IPC between uniq and sort). Especially if the file is big, sort -u will likely use fewer intermediate files to sort the data.

On my system I consistently get results like this:

$ dd if=/dev/urandom of=/dev/shm/file bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 8.95208 s, 11.7 MB/s
$ time sort -u /dev/shm/file >/dev/null

real        0m0.500s
user        0m0.767s
sys         0m0.167s
$ time sort /dev/shm/file | uniq >/dev/null

real        0m0.772s
user        0m1.137s
sys         0m0.273s

It also doesn't mask the return code of sort, which may be important (in modern shells there are ways to get this, for example, bash's $PIPESTATUS array, but this wasn't always true).

Best Answer

Related Solutions

Shell – Get lines with maximum values in the column using awk, uniq and sort

Bash Sort Uniq – Difference Between ‘sort -u’ and ‘sort | uniq’

Related Question