Multicore equivalent for ‘| sort | uniq -c | sort -n’ command

parallelismsort

I would like to ask if there is a out of the box multicore equivalent for a '| sort | uniq -c | sort -n' command?

I know that I can use below procedure

split -l5000000 data.tsv '_tmp';
ls -1 _tmp* | while read FILE; do sort $FILE -o $FILE & done;
sort -m _tmp* -o data.tsv.sorted

But it tastes a bit overhelming.

Best Answer

GNU sort has a --parallel flag:

sort --parallel=8 data.tsv | uniq -c | sort --parallel=8 -n

This would use eight concurrent processes/threads to do each of the two sorting steps. The uniq -c part will still be using a single process.

As Stéphane Chazelas points out in comments, the GNU implementation of sort is already parallelised (it's using POSIX threads), so modifying the number of concurrent threads is only needed if you want it to use more or fewer threads than what you have cores.

Note that the second sort will likely get much less data than the first, due to the uniq step, so it will be much quicker.

You may also (possibly) improve sorting speed by playing around with --buffer-size=SIZE and --batch-size=NMERGE. See the sort manual.

To further speed the sorting up, make sure that sort writes its temporary files to a fast filesystem (if you have several types of storage attached). You may do this by setting the TMPDIR environment variable to the path of writable directory on such a mountpoint (or use sort -T directory).

Related Question