How to Sort Large Files – Efficient Sorting Techniques

sort

I have a PC with Intel(R) Pentium(R) CPU G640 @ 2.80 GHz and 8 GB of RAM. I am running Scientific Linux 6.5 on it with EXT3 filesystem.

On this setup, what is the fastest way I can do a sort -u on a 200 gigabyte file?

Should I split the file into smaller files (smaller than 8 GB), sort -u them, put them together, then split them again in a different size, sort -u again, etc.? Or are there any sorting scripts, programs that could handle files this big with my limited amount of RAM?

Best Answer

GNU sort (which is the default on most Linux systems), has a --parallel option. From http://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html:

‘--parallel=n’

Set the number of sorts run in parallel to n. By default, n is set to the number of available processors, but limited to 8, as there are diminishing performance gains after that. Note also that using n threads increases the memory usage by a factor of log n. Also see nproc invocation.

Since your cpu has 2 cores, you could do:

sort --parallel=2 -uo list-sorted.txt list.txt

It is better to specify the actual number of cores since there may appear to be more due to the processor having hyper-threading.

You could also experiment with nice to influence the processor scheduling priority and ionice to influence I/O scheduling. You can increase the priority over other processes like this, I don't think this will give you large savings as they are usually better for making sure a background process doesn't use too much resources. Never-the-less you can combine them with something like:

nice -n -20 ionice -c2 -n7 sort --parallel=2 -uo list-sorted.txt list.txt

Note also that as Gilles commented, using a single GNU sort command will be faster than any other method of breaking down the sorting as the algorithm is already optimised to handle large files. Anything else will likely just slow things down.

Related Question