Bash – How to use “parallel” to speed up “sort” for big files fitting in RAM

bashmultithreadingperformancesort

I have a 100 M line file that fits in RAM on a GNU/Linux system.

This is rather slow:

sort bigfile > bigfile.sorted

and does not use all 48 cores on my machine.

How do I sort that file fast?

Best Answer

Let us assume you have 48 cores, 500 GB free RAM and the file is 100 M lines and fits in memory.

If you use normal sort it is rather slow:

$ time sort bigfile > bigfile.sort
real    4m48.664s
user    21m15.259s
sys     0m42.184s

You can make it a bit faster by ignoring your locale:

$ export LC_ALL=C
$ time sort bigfile > bigfile.sort
real    1m51.957s
user    6m2.053s
sys     0m42.524s

You can make it faster by telling sort to use more cores:

$ export LC_ALL=C
$ time sort --parallel=48 bigfile > bigfile.sort
real    1m39.977s
user    15m32.202s
sys     1m1.336s

You can also try giving sort more working memory (this does not help if sort already has enough memory):

$ export LC_ALL=C
$ time sort --buffer-size=80% --parallel=48 bigfile > bigfile.sort
real    1m39.779s
user    14m31.033s
sys     1m0.304s

But it seems sort really likes to do a lot of single threading. You can force it to parallelize more with:

$ merge() {
    if [ $1 -le 1 ] ; then
        parallel -Xj1 -n2 --dr 'sort -m <({=uq=}) | mbuffer -m 30M;'
    else
        parallel -Xj1 -n2 --dr 'sort -m <({=uq=}) | mbuffer -m 30M;' |
          merge $(( $1/2 ));
    fi
  }
# Generate commands that will read blocks of bigfile and sort those
# This only builds the command - it does not run anything
$ parallel --pipepart -a bigfile --block -1 --dr -vv sort |
    # Merge these commands 2 by 2 until only one is left
    # This only builds the command - it does not run anything
    merge $(parallel --number-of-threads) |
    # Execute the command
    # This runs the command built in the previous step
    bash > bigfile.sort
real    0m30.906s
user    0m21.963s
sys     0m28.870s

It chops the file into 48 blocks on the fly (one block per core), sorts those blocks in parallel. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. And so on, until we only have a single input. All of this is done in parallel when possible.

For a 100 GB file with with 4 G lines the timings are:

$ LC_ALL=C time sort --parallel=48 -S 80% --compress-program pzstd bigfile >/dev/null
real    77m22.255s
$ LC_ALL=C time parsort bigfile >/dev/null
649.49user 727.04system 18:10.37elapsed 126%CPU (0avgtext+0avgdata 32896maxresident)k

So using the parallelization speeds up around a factor of 4.

To make it easier to use I have made it into a small tool: parsort which is now part of GNU Parallel.

It supports sort options and reading from stdin, too (parsort -k2rn < bigfile).

Related Solutions

Performance – Why Use Swap When There Is Enough Free RAM

It is normal for Linux systems to use some swap even if there is still RAM free. The Linux kernel will move to swap memory pages that are very seldom used (e.g., the getty instances when you only use X11, and some other inactive daemon).

Swap space usage becomes an issue only when there is not enough RAM available, and the kernel is forced to continuously move memory pages to swap and back to RAM, just to keep applications running. In this case, system monitor applications would show a lot of disk I/O activity.

For comparison, my Ubuntu 10.04 system, with two users logged in with X11 sessions both running GNOME desktop, uses ~600MB of swap and ~1GB of RAM (not counting buffers and fs cache), so I'd say that your figures for swap usage look normal.

How to Sort Large Files – Efficient Sorting Techniques

GNU sort (which is the default on most Linux systems), has a --parallel option. From http://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html:

‘--parallel=n’

Set the number of sorts run in parallel to n. By default, n is set to the number of available processors, but limited to 8, as there are diminishing performance gains after that. Note also that using n threads increases the memory usage by a factor of log n. Also see nproc invocation.

Since your cpu has 2 cores, you could do:

sort --parallel=2 -uo list-sorted.txt list.txt

It is better to specify the actual number of cores since there may appear to be more due to the processor having hyper-threading.

You could also experiment with nice to influence the processor scheduling priority and ionice to influence I/O scheduling. You can increase the priority over other processes like this, I don't think this will give you large savings as they are usually better for making sure a background process doesn't use too much resources. Never-the-less you can combine them with something like:

nice -n -20 ionice -c2 -n7 sort --parallel=2 -uo list-sorted.txt list.txt

Note also that as Gilles commented, using a single GNU sort command will be faster than any other method of breaking down the sorting as the algorithm is already optimised to handle large files. Anything else will likely just slow things down.

Best Answer

Related Solutions

Performance – Why Use Swap When There Is Enough Free RAM

How to Sort Large Files – Efficient Sorting Techniques

Related Question