Bash Sort Uniq – Difference Between ‘sort -u’ and ‘sort | uniq’

bashsortuniq

Everywhere I see someone needing to get a sorted, unique list, they always pipe to sort | uniq. I've never seen any examples where someone uses sort -u instead. Why not? What's the difference, and why is it better to use uniq than the unique flag to sort?

Best Answer

sort | uniq existed before sort -u, and is compatible with a wider range of systems, although almost all modern systems do support -u -- it's POSIX. It's mostly a throwback to the days when sort -u didn't exist (and people don't tend to change their methods if the way that they know continues to work, just look at ifconfig vs. ip adoption).

The two were likely merged because removing duplicates within a file requires sorting (at least, in the standard case), and is an extremely common use case of sort. It is also faster internally as a result of being able to do both operations at the same time (and due to the fact that it doesn't require IPC between uniq and sort). Especially if the file is big, sort -u will likely use fewer intermediate files to sort the data.

On my system I consistently get results like this:

$ dd if=/dev/urandom of=/dev/shm/file bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 8.95208 s, 11.7 MB/s
$ time sort -u /dev/shm/file >/dev/null

real        0m0.500s
user        0m0.767s
sys         0m0.167s
$ time sort /dev/shm/file | uniq >/dev/null

real        0m0.772s
user        0m1.137s
sys         0m0.273s

It also doesn't mask the return code of sort, which may be important (in modern shells there are ways to get this, for example, bash's $PIPESTATUS array, but this wasn't always true).

Related Solutions

Sort and Uniq in Awk – How to Use

To sort you can use a pipe also inside of an awk command, as in:

awk '{ print ... | "sort ..." }'

The syntax means that all respective lines of the data file will be passed to the same instance of sort.

Of course you can also do that equivalently on shell level:

awk '{ print ... }' | sort ...

Or you can use GNU awk which has a couple sort functions natively defined.

The uniq is in awk typically accomplished by saving the "unique data element or key" in an associative array and checking whether new data need to be memorized. One example to illustrate:

awk '!a[$0]++'

This means: If the current line is not in the array then the condition is true and the default action to print the line triggered. Subsequent lines with the same data will result in a false condition and the data will not be printed.

Uniq and bash for loop not writing to stdout before stdin closing (for one-line website visitor notification system)

I think I understand what you are trying to accomplish:

For each hit to the web site, which is logged by the web server:
If the visit is "unique" (how do you define this??) log the entry and send an audible notification.

The trick is how you define "unique". Is it by URL, by IP address, by cookie? Your approach with awk was arguably the right way to go, but you got snagged by shell-escaping rules.

So here is something that sort of combines your approaches. First, you really need a script on the web server to do this. Otherwise you're going to be lost in complex quotation-escaping rules. Second, I'm assuming your web-server is using the "common-log format", which frankly, sucks for this kind of work, but we can work with it.

while true; do 
  ssh root@speedy remote-log-capturing-script
done > unique-visits.log

Use mikeserv's excellent suggestion about MAILFILE. The script on speedy should look like this:

#!/bin/sh
tail -1f /var/log/apache2/www.access.log | 
awk '$(NF-1) == 200' | 
grep --line-buffered -o '"GET [^"]*"' |
awk '!url[$1]{ print; url[$1]=1 }'

Awk is always line-buffered. The first awk ensures you're only getting actual successful hits, not cached-hits or 404s. The grep -o prints out only the matching part of the input, in this case, the URL. (This is GNU grep, which I assume you are using. If not, use the stdbuf trick.) The next awk uses a little expression to conditionally print out the input line -- only if that input line was never before seen.

You can also do this with perl to achieve more complexity within one fork:

#!/bin/sh
tail -1f /var/log/apache2/www.access.log | 
perl -lane '$|=1;' \
  -e 'if ($F[$#F-1] eq "200" and ' \
  -e ' /\s"GET\s([^"]*)"\s/ and !$url{$1}) { '\
  -e '  print $1;$url{$1}=undef; }'

Now both of these will only print unique URLs. What if two web clients from different IPs hit the same page? You only get one output. To change that, with the perl solutions, this is easy: modify the key that goes into url.

 $url{$F[0],$1}

When using perl -a, $F[0] represents the first white-space-delimited field of input, just like awk's $1 -- ie, the connecting hostname/IP address. And perl's $1 represents the first matching subexpression of the regular-expression /\s"GET\s([^"]*)"\s/, ie, just the URL itself. The cryptic $F[$#F-1] means 2nd-to-last field of the input line.

Best Answer

Related Solutions

Sort and Uniq in Awk – How to Use

Uniq and bash for loop not writing to stdout before stdin closing (for one-line website visitor notification system)

Related Question