I would like to ask if there is a out of the box multicore equivalent for a '| sort | uniq -c | sort -n' command?
I know that I can use below procedure
split -l5000000 data.tsv '_tmp';
ls -1 _tmp* | while read FILE; do sort $FILE -o $FILE & done;
sort -m _tmp* -o data.tsv.sorted
But it tastes a bit overhelming.
Best Answer
GNU
sort
has a--parallel
flag:This would use eight concurrent processes/threads to do each of the two sorting steps. The
uniq -c
part will still be using a single process.As Stéphane Chazelas points out in comments, the GNU implementation of
sort
is already parallelised (it's using POSIX threads), so modifying the number of concurrent threads is only needed if you want it to use more or fewer threads than what you have cores.Note that the second
sort
will likely get much less data than the first, due to theuniq
step, so it will be much quicker.You may also (possibly) improve sorting speed by playing around with
--buffer-size=SIZE
and--batch-size=NMERGE
. See thesort
manual.To further speed the sorting up, make sure that
sort
writes its temporary files to a fast filesystem (if you have several types of storage attached). You may do this by setting theTMPDIR
environment variable to the path of writable directory on such a mountpoint (or usesort -T directory
).