Bash – How to use “parallel” to speed up “sort” for big files fitting in RAM

bashmultithreadingperformancesort

I have a 100 M line file that fits in RAM on a GNU/Linux system.

This is rather slow:

sort bigfile > bigfile.sorted

and does not use all 48 cores on my machine.

How do I sort that file fast?

Best Answer

Let us assume you have 48 cores, 500 GB free RAM and the file is 100 M lines and fits in memory.

If you use normal sort it is rather slow:

$ time sort bigfile > bigfile.sort
real    4m48.664s
user    21m15.259s
sys     0m42.184s

You can make it a bit faster by ignoring your locale:

$ export LC_ALL=C
$ time sort bigfile > bigfile.sort
real    1m51.957s
user    6m2.053s
sys     0m42.524s

You can make it faster by telling sort to use more cores:

$ export LC_ALL=C
$ time sort --parallel=48 bigfile > bigfile.sort
real    1m39.977s
user    15m32.202s
sys     1m1.336s

You can also try giving sort more working memory (this does not help if sort already has enough memory):

$ export LC_ALL=C
$ time sort --buffer-size=80% --parallel=48 bigfile > bigfile.sort
real    1m39.779s
user    14m31.033s
sys     1m0.304s

But it seems sort really likes to do a lot of single threading. You can force it to parallelize more with:

$ merge() {
    if [ $1 -le 1 ] ; then
        parallel -Xj1 -n2 --dr 'sort -m <({=uq=}) | mbuffer -m 30M;'
    else
        parallel -Xj1 -n2 --dr 'sort -m <({=uq=}) | mbuffer -m 30M;' |
          merge $(( $1/2 ));
    fi
  }
# Generate commands that will read blocks of bigfile and sort those
# This only builds the command - it does not run anything
$ parallel --pipepart -a bigfile --block -1 --dr -vv sort |
    # Merge these commands 2 by 2 until only one is left
    # This only builds the command - it does not run anything
    merge $(parallel --number-of-threads) |
    # Execute the command
    # This runs the command built in the previous step
    bash > bigfile.sort
real    0m30.906s
user    0m21.963s
sys     0m28.870s

It chops the file into 48 blocks on the fly (one block per core), sorts those blocks in parallel. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. And so on, until we only have a single input. All of this is done in parallel when possible.

For a 100 GB file with with 4 G lines the timings are:

$ LC_ALL=C time sort --parallel=48 -S 80% --compress-program pzstd bigfile >/dev/null
real    77m22.255s
$ LC_ALL=C time parsort bigfile >/dev/null
649.49user 727.04system 18:10.37elapsed 126%CPU (0avgtext+0avgdata 32896maxresident)k

So using the parallelization speeds up around a factor of 4.

To make it easier to use I have made it into a small tool: parsort which is now part of GNU Parallel.

It supports sort options and reading from stdin, too (parsort -k2rn < bigfile).

Related Question