I have a PC with Intel(R) Pentium(R) CPU G640 @ 2.80 GHz and 8 GB of RAM. I am running Scientific Linux 6.5 on it with EXT3 filesystem.
On this setup, what is the fastest way I can do a sort -u
on a 200 gigabyte file?
Should I split the file into smaller files (smaller than 8 GB), sort -u
them, put them together, then split them again in a different size, sort -u
again, etc.? Or are there any sorting scripts, programs that could handle files this big with my limited amount of RAM?
Best Answer
GNU
sort
(which is the default on most Linux systems), has a--parallel
option. From http://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html:Since your cpu has 2 cores, you could do:
It is better to specify the actual number of cores since there may appear to be more due to the processor having hyper-threading.
You could also experiment with
nice
to influence the processor scheduling priority andionice
to influence I/O scheduling. You can increase the priority over other processes like this, I don't think this will give you large savings as they are usually better for making sure a background process doesn't use too much resources. Never-the-less you can combine them with something like:Note also that as Gilles commented, using a single GNU sort command will be faster than any other method of breaking down the sorting as the algorithm is already optimised to handle large files. Anything else will likely just slow things down.