I have a 100 M line file that fits in RAM on a GNU/Linux system.
This is rather slow:
sort bigfile > bigfile.sorted
and does not use all 48 cores on my machine.
How do I sort that file fast?
bashmultithreadingperformancesort
I have a 100 M line file that fits in RAM on a GNU/Linux system.
This is rather slow:
sort bigfile > bigfile.sorted
and does not use all 48 cores on my machine.
How do I sort that file fast?
Best Answer
Let us assume you have 48 cores, 500 GB free RAM and the file is 100 M lines and fits in memory.
If you use normal sort it is rather slow:
You can make it a bit faster by ignoring your locale:
You can make it faster by telling sort to use more cores:
You can also try giving sort more working memory (this does not help if sort already has enough memory):
But it seems sort really likes to do a lot of single threading. You can force it to parallelize more with:
It chops the file into 48 blocks on the fly (one block per core), sorts those blocks in parallel. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. And so on, until we only have a single input. All of this is done in parallel when possible.
For a 100 GB file with with 4 G lines the timings are:
So using the parallelization speeds up around a factor of 4.
To make it easier to use I have made it into a small tool:
parsort
which is now part of GNU Parallel.It supports
sort
options and reading from stdin, too (parsort -k2rn < bigfile
).