Bash Sort Uniq – Difference Between ‘sort -u’ and ‘sort | uniq’


Everywhere I see someone needing to get a sorted, unique list, they always pipe to sort | uniq. I've never seen any examples where someone uses sort -u instead. Why not? What's the difference, and why is it better to use uniq than the unique flag to sort?

Best Answer

sort | uniq existed before sort -u, and is compatible with a wider range of systems, although almost all modern systems do support -u -- it's POSIX. It's mostly a throwback to the days when sort -u didn't exist (and people don't tend to change their methods if the way that they know continues to work, just look at ifconfig vs. ip adoption).

The two were likely merged because removing duplicates within a file requires sorting (at least, in the standard case), and is an extremely common use case of sort. It is also faster internally as a result of being able to do both operations at the same time (and due to the fact that it doesn't require IPC between uniq and sort). Especially if the file is big, sort -u will likely use fewer intermediate files to sort the data.

On my system I consistently get results like this:

$ dd if=/dev/urandom of=/dev/shm/file bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 8.95208 s, 11.7 MB/s
$ time sort -u /dev/shm/file >/dev/null

real        0m0.500s
user        0m0.767s
sys         0m0.167s
$ time sort /dev/shm/file | uniq >/dev/null

real        0m0.772s
user        0m1.137s
sys         0m0.273s

It also doesn't mask the return code of sort, which may be important (in modern shells there are ways to get this, for example, bash's $PIPESTATUS array, but this wasn't always true).

Related Question