What should be the buffer size for the sort command

buffersort

I have a machine with 2 TB of RAM and I am running a sort command on a file of size 150G where I have specified the buffer-size as 1000G, after doing my bit of reasearch on google, I got this piece of information "the more is the buffer size, the better is the performance". This is the command that I ran

sort -rk2 --buffer-size=1000G master_matrix_unsorted.csv > master_matrix_sorted.csv

But this is taking a lot of time and I have no clue on the progress of the task.

Any idea on what should be the best buffer size for this operation? I am planning to re-run this task with a new buffer-size.

Best Answer

You don't specify the OS and the sort implementation; I assume you mean GNU sort. You also don't say how long "a lot of time" is, or how long you expect it to take. Most important, you don't mention the I/O subsystem capability, which will be the governing factor.

An ordinary SATA drive delivers ~150 MB/s. At that rate your 150 GB file will take 1000 seconds just to read, about 15 minutes. Try $ time cat filename >/dev/null to see. If ~15 minutes (or whatever time cat shows) is OK, you might be able to get sort(1) to work in about 3X the time, because the output has to be written, too.

Your best bet for speedup would seem to be --parallel, because your data fit in memory and you have spare processors. According to the info page, --buffer-size won't matter, because

... this option affects only the initial buffer size. The buffer grows beyond SIZE if `sort' encounters input lines larger than SIZE.

whereas a quick search indicates GNU uses merge sort, which is amenable to parallelization.

If you really want to know how GNU sort determines buffer sizes and what algorithm it uses for parallel sorting, the coreutils source code and accompanying documentation is readily available.

But if I were you I wouldn't bother. Whatever you're doing with master_matrix_unsorted.csv, sort(1) is surely not up to the task.

First, a CSV file will, one day, trip you up because the CSV syntax is far beyond sort's ken. Second, it is the slowest possible way, because sort(1) is forced to sort entire rows (of indeterminate length), not just the second column. Third, when you're done, what will you have? A sorted CSV file. Is that really better? Why does the order matter so very much?

Sorting sounds like one step along the way toward a goal that likely includes some kind of computation on the data, which computation will require numbers in binary format. If that's the case, you might as well get the CSV file into a more tractable, computable, binary format first in, say, a DBMS. You may find that sorting it turns out to be unnecessary to the ultimate goal.

Related Question